Hybrid Machine Learning Approaches for Phishing Email Detection

By: Gonipalli Bharath Vel Tech University, Chennai, India International Center for AI and Cyber Security Research and Innovations, Asia University, Taiwan, Gmail: gonipallibharath@gmail.com


Abstract:

Phishing attacks are more hazardous in our day-to-day life. It has now emerged as one of the most common cyber-attacks against individuals and organizations in the modern world. The nature of phishing techniques has changed a lot with time, which was difficult for traditional rule-based methods to handle. Hybrid machine learning models combining different algorithms proved very effective in improving the accuracy of phishing email detection. The paper discusses the hybrid models which involve Random Forest and Support Vector Machine with even more comparisons to other models. Hybrid models can improve the classification of phishing emails with higher accuracy by leveraging URL-based, email header, and text content features. It helps in understanding the trend of hybrid models for phishing detection and is supported by key performance metrics to improve traditional approaches.

Introduction:

Phishing is a type of cyber-attack whereby through deception, attackers obtain sensitive information by impersonating legitimate entities. While email filtering has evolved in security, phishing attacks still manage to evade traditional detection mechanisms. ML has contributed a lot to the automation of phishing email detection, whereby hybrid approaches that incorporate multiple ML models perform better[1]. This article debates the use of hybrid ML models for phishing email detection, advantages, and performance in real-world applications.

Machine Learning Techniques in Phishing Detection:

Machine learning models used in phishing detection are categorized into supervised and unsupervised learning techniques. Among them, some supervised learning methods such as Decision Trees, Naive Bayes, and Neural Networks show high accuracy in classifying phishing emails. However, depending on a single ML model itself possesses certain limitations and hence hybrid approaches came into the limelight[2].

Traditional Models for Phishing Detection:

Random Forest-RF: is an ensemble learning technique that improves the classification by aggregating a number of decision trees’ output.

Support Vector Machine-SVM: A robust algorithm in binary classification tasks used in phishing emails detection.

Naive Bayes-NB: A lightweight probabilistic model for spam and phishing email classification.

XGBoost (Extreme Gradient Boosting): It is the ensemble learning powerful algorithm based on gradient boosting, which efficiently handles missing values, preventing overfitting by regularization techniques. Currently, it is one of the most used structured data classification and regression algorithms for its speed and accuracy.

Decision Tree (DT): A tree-based supervised learning algorithm that splits data into branches according to feature values for reaching a decision. It is simple, interpretable, and effective in classification tasks. However, it usually suffers from overfitting when deep trees are grown.

Logistic Regression (LR): A statistical model used for binary classification problems, which estimates the probability of an outcome using the logistic function. Though simple, LR performs well on linearly separable data and serves as a competitive baseline model in many classification tasks.

Hybrid Machine Learning Models:

These models combine several classifiers with the aim of enhancing detection accuracy and minimizing false positives. Here they propose a hybrid framework is to develop synergies in order to create a high-impact classifier for their classification tasks. Integration allows for better phishing detection by optimizing feature selection and decision-making processes[3].

Feature Engineering for Phishing Email Detection:

Feature extraction: It is one of the major tasks in phishing email detection. The significant features used in the models for phishing detection are[4]:

  • URL-based features: Length of URLs, usage of special characters, usage of IP addresses, and age of the domain.
  • Features of Email Header: Authenticity of sender’s email, reply-to address, and mismatch of the message ID.
  • Text Content Features: Frequency of words, TF-IDF vectorization, and sentiment analysis of the email body.

Model

K-Fold

Feature selection

Hyper Parameter Tuning

Support Vector Machine

95.00

95.95

96.99

XG Boost

97.16

97.99

98.11

Random Forest

97.11

98.15

98.50

Decision Tree

97.12

97.67

97.32

Logistic Regression

93.53

93.66

93.80

Hyper-parameter, feature selection, and K-fold Comparison for UCI

The RF model identifies it as the most effective and precise technique for combating fraudulent activities (phishing assaults). It was observed at accordingly[5].

Hybrid model approach:

Performance of each individual model implemented in the proof of concept[6]:

Approach

Accuracy

F1-Score

Hybrid model Framework

97.44%

96.56%

Here, they suggested a hybrid framework that incorporates results from various single-analysis methods. In order to do this, they employ a stacking function. Above mentioned function aggregates each & every model’s predictions into a single prediction. By using various stacking functions on the proof-of-concept, they  identify a particularly effective stacking function to recognize fraudulent websites using a hybrid architecture[6].

Model

Accuracy

Precision

Recall

F1-score

ROC-AUC

URL

95.41%

93.15%

94.51%

93.82%

94.96%

Content

93.64%

95.29%

88.57%

91.81%

93.97%

DOM tree

90.30%

86.70%

87.29%

86.99%

89.58%

Conclusion:

Phishing email detection is one of the critical challenges in cybersecurity, and hybrid approaches with machine learning algorithms have considerably improved traditional methods. By integrating RF and SVM, phishing detection systems achieve high accuracy with low false positives. Future research directions could be in the use of deep learning techniques and real-time deployment to improve the effectiveness of phishing attack protection.

References:

  1. S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey,” Procedia Comput. Sci., vol. 189, pp. 19–28, Jan. 2021, doi: 10.1016/j.procs.2021.05.077.
  2. “A Critical Review of Artificial Intelligence Based Approaches in Intrusion Detection: A Comprehensive Analysis – Muneer – 2024 – Journal of Engineering – Wiley Online Library.” Accessed: Feb. 07, 2025. [Online]. Available: https://onlinelibrary.wiley.com/doi/full/10.1155/2024/3909173
  3. R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
  4. O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/j.eswa.2018.09.029.
  5. “(PDF) DETECTION OF PHISHING ATTACKS USING MACHINE LEARNING TECHNIQUES,” ResearchGate. Accessed: Feb. 07, 2025. [Online]. Available: https://www.researchgate.net/publication/382149371_DETECTION_OF_PHISHING_ATTACKS_USING_MACHINE_LEARNING_TECHNIQUES
  6. R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
  7. Lv, L., Wu, Z., Zhang, L., Gupta, B. B., & Tian, Z. (2022). An edge-AI based forecasting approach for improving smart microgrid efficiencyIEEE Transactions on Industrial Informatics18(11), 7946-7954.
  8. Mirsadeghi, F., Rafsanjani, M. K., & Gupta, B. B. (2021). A trust infrastructure based authentication method for clustered vehicular ad hoc networksPeer-to-Peer Networking and Applications14, 2537-2553.
  9. Rahaman M (2024) Foundations of Phishing Detection Using Deep Learning: A Review of Current Techniques, Insights2Techinfo, pp.1

Cite As

Bharath G. (2025) Hybrid Machine Learning Approaches for Phishing Email Detection, Insights2Techinfo, pp.1

83960cookie-checkHybrid Machine Learning Approaches for Phishing Email Detection
Share this:

Leave a Reply

Your email address will not be published.