By: Gonipalli Bharath Vel Tech University, Chennai, India International Center for AI and Cyber Security Research and Innovations, Asia University, Taiwan, Gmail: gonipallibharath@gmail.com
Abstract:
Phishing attacks are more hazardous in our day-to-day life. It has now emerged as one of the most common cyber-attacks against individuals and organizations in the modern world. The nature of phishing techniques has changed a lot with time, which was difficult for traditional rule-based methods to handle. Hybrid machine learning models combining different algorithms proved very effective in improving the accuracy of phishing email detection. The paper discusses the hybrid models which involve Random Forest and Support Vector Machine with even more comparisons to other models. Hybrid models can improve the classification of phishing emails with higher accuracy by leveraging URL-based, email header, and text content features. It helps in understanding the trend of hybrid models for phishing detection and is supported by key performance metrics to improve traditional approaches.
Introduction:
Phishing is a type of cyber-attack whereby through deception, attackers obtain sensitive information by impersonating legitimate entities. While email filtering has evolved in security, phishing attacks still manage to evade traditional detection mechanisms. ML has contributed a lot to the automation of phishing email detection, whereby hybrid approaches that incorporate multiple ML models perform better[1]. This article debates the use of hybrid ML models for phishing email detection, advantages, and performance in real-world applications.
Machine Learning Techniques in Phishing Detection:
Machine learning models used in phishing detection are categorized into supervised and unsupervised learning techniques. Among them, some supervised learning methods such as Decision Trees, Naive Bayes, and Neural Networks show high accuracy in classifying phishing emails. However, depending on a single ML model itself possesses certain limitations and hence hybrid approaches came into the limelight[2].

Traditional Models for Phishing Detection:
Random Forest-RF: is an ensemble learning technique that improves the classification by aggregating a number of decision trees’ output.
Support Vector Machine-SVM: A robust algorithm in binary classification tasks used in phishing emails detection.
Naive Bayes-NB: A lightweight probabilistic model for spam and phishing email classification.
XGBoost (Extreme Gradient Boosting): It is the ensemble learning powerful algorithm based on gradient boosting, which efficiently handles missing values, preventing overfitting by regularization techniques. Currently, it is one of the most used structured data classification and regression algorithms for its speed and accuracy.
Decision Tree (DT): A tree-based supervised learning algorithm that splits data into branches according to feature values for reaching a decision. It is simple, interpretable, and effective in classification tasks. However, it usually suffers from overfitting when deep trees are grown.
Logistic Regression (LR): A statistical model used for binary classification problems, which estimates the probability of an outcome using the logistic function. Though simple, LR performs well on linearly separable data and serves as a competitive baseline model in many classification tasks.
Hybrid Machine Learning Models:
These models combine several classifiers with the aim of enhancing detection accuracy and minimizing false positives. Here they propose a hybrid framework is to develop synergies in order to create a high-impact classifier for their classification tasks. Integration allows for better phishing detection by optimizing feature selection and decision-making processes[3].
Feature Engineering for Phishing Email Detection:
Feature extraction: It is one of the major tasks in phishing email detection. The significant features used in the models for phishing detection are[4]:
- URL-based features: Length of URLs, usage of special characters, usage of IP addresses, and age of the domain.
- Features of Email Header: Authenticity of sender’s email, reply-to address, and mismatch of the message ID.
- Text Content Features: Frequency of words, TF-IDF vectorization, and sentiment analysis of the email body.
Model | K-Fold | Feature selection | Hyper Parameter Tuning |
Support Vector Machine | 95.00 | 95.95 | 96.99 |
XG Boost | 97.16 | 97.99 | 98.11 |
Random Forest | 97.11 | 98.15 | 98.50 |
Decision Tree | 97.12 | 97.67 | 97.32 |
Logistic Regression | 93.53 | 93.66 | 93.80 |
Hyper-parameter, feature selection, and K-fold Comparison for UCI
The RF model identifies it as the most effective and precise technique for combating fraudulent activities (phishing assaults). It was observed at accordingly[5].
Hybrid model approach:
Performance of each individual model implemented in the proof of concept[6]:
Approach | Accuracy | F1-Score | |
Hybrid model Framework | 97.44% | 96.56% |
Here, they suggested a hybrid framework that incorporates results from various single-analysis methods. In order to do this, they employ a stacking function. Above mentioned function aggregates each & every model’s predictions into a single prediction. By using various stacking functions on the proof-of-concept, they identify a particularly effective stacking function to recognize fraudulent websites using a hybrid architecture[6].
Model | Accuracy | Precision | Recall | F1-score | ROC-AUC |
URL | 95.41% | 93.15% | 94.51% | 93.82% | 94.96% |
Content | 93.64% | 95.29% | 88.57% | 91.81% | 93.97% |
DOM tree | 90.30% | 86.70% | 87.29% | 86.99% | 89.58% |
Conclusion:
Phishing email detection is one of the critical challenges in cybersecurity, and hybrid approaches with machine learning algorithms have considerably improved traditional methods. By integrating RF and SVM, phishing detection systems achieve high accuracy with low false positives. Future research directions could be in the use of deep learning techniques and real-time deployment to improve the effectiveness of phishing attack protection.
References:
- S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey,” Procedia Comput. Sci., vol. 189, pp. 19–28, Jan. 2021, doi: 10.1016/j.procs.2021.05.077.
- “A Critical Review of Artificial Intelligence Based Approaches in Intrusion Detection: A Comprehensive Analysis – Muneer – 2024 – Journal of Engineering – Wiley Online Library.” Accessed: Feb. 07, 2025. [Online]. Available: https://onlinelibrary.wiley.com/doi/full/10.1155/2024/3909173
- R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
- O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/j.eswa.2018.09.029.
- “(PDF) DETECTION OF PHISHING ATTACKS USING MACHINE LEARNING TECHNIQUES,” ResearchGate. Accessed: Feb. 07, 2025. [Online]. Available: https://www.researchgate.net/publication/382149371_DETECTION_OF_PHISHING_ATTACKS_USING_MACHINE_LEARNING_TECHNIQUES
- R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
- Lv, L., Wu, Z., Zhang, L., Gupta, B. B., & Tian, Z. (2022). An edge-AI based forecasting approach for improving smart microgrid efficiency. IEEE Transactions on Industrial Informatics, 18(11), 7946-7954.
- Mirsadeghi, F., Rafsanjani, M. K., & Gupta, B. B. (2021). A trust infrastructure based authentication method for clustered vehicular ad hoc networks. Peer-to-Peer Networking and Applications, 14, 2537-2553.
- Rahaman M (2024) Foundations of Phishing Detection Using Deep Learning: A Review of Current Techniques, Insights2Techinfo, pp.1
Cite As
Bharath G. (2025) Hybrid Machine Learning Approaches for Phishing Email Detection, Insights2Techinfo, pp.1