Hybrid Machine Learning Approaches for Phishing Email Detection

By: Gonipalli Bharath Vel Tech University, Chennai, India International Center for AI and Cyber Security Research and Innovations, Asia University, Taiwan, Gmail: gonipallibharath@gmail.com

Abstract:

Phishing attacks are more hazardous in our day-to-day life. It has now emerged as one of the most common cyber-attacks against individuals and organizations in the modern world. The nature of phishing techniques has changed a lot with time, which was difficult for traditional rule-based methods to handle. Hybrid machine learning models combining different algorithms proved very effective in improving the accuracy of phishing email detection. The paper discusses the hybrid models which involve Random Forest and Support Vector Machine with even more comparisons to other models. Hybrid models can improve the classification of phishing emails with higher accuracy by leveraging URL-based, email header, and text content features. It helps in understanding the trend of hybrid models for phishing detection and is supported by key performance metrics to improve traditional approaches.

Introduction:

Phishing is a type of cyber-attack whereby through deception, attackers obtain sensitive information by impersonating legitimate entities. While email filtering has evolved in security, phishing attacks still manage to evade traditional detection mechanisms. ML has contributed a lot to the automation of phishing email detection, whereby hybrid approaches that incorporate multiple ML models perform better[1]. This article debates the use of hybrid ML models for phishing email detection, advantages, and performance in real-world applications.

Machine Learning Techniques in Phishing Detection:

Machine learning models used in phishing detection are categorized into supervised and unsupervised learning techniques. Among them, some supervised learning methods such as Decision Trees, Naive Bayes, and Neural Networks show high accuracy in classifying phishing emails. However, depending on a single ML model itself possesses certain limitations and hence hybrid approaches came into the limelight[2].

Traditional Models for Phishing Detection:

Random Forest-RF: is an ensemble learning technique that improves the classification by aggregating a number of decision trees’ output.

Support Vector Machine-SVM: A robust algorithm in binary classification tasks used in phishing emails detection.

Naive Bayes-NB: A lightweight probabilistic model for spam and phishing email classification.

XGBoost (Extreme Gradient Boosting): It is the ensemble learning powerful algorithm based on gradient boosting, which efficiently handles missing values, preventing overfitting by regularization techniques. Currently, it is one of the most used structured data classification and regression algorithms for its speed and accuracy.

Decision Tree (DT): A tree-based supervised learning algorithm that splits data into branches according to feature values for reaching a decision. It is simple, interpretable, and effective in classification tasks. However, it usually suffers from overfitting when deep trees are grown.

Logistic Regression (LR): A statistical model used for binary classification problems, which estimates the probability of an outcome using the logistic function. Though simple, LR performs well on linearly separable data and serves as a competitive baseline model in many classification tasks.

Hybrid Machine Learning Models:

These models combine several classifiers with the aim of enhancing detection accuracy and minimizing false positives. Here they propose a hybrid framework is to develop synergies in order to create a high-impact classifier for their classification tasks. Integration allows for better phishing detection by optimizing feature selection and decision-making processes[3].

Feature Engineering for Phishing Email Detection:

Feature extraction: It is one of the major tasks in phishing email detection. The significant features used in the models for phishing detection are[4]:

URL-based features: Length of URLs, usage of special characters, usage of IP addresses, and age of the domain.
Features of Email Header: Authenticity of sender’s email, reply-to address, and mismatch of the message ID.
Text Content Features: Frequency of words, TF-IDF vectorization, and sentiment analysis of the email body.

Model	K-Fold	Feature selection	Hyper Parameter Tuning
Support Vector Machine	95.00	95.95	96.99
XG Boost	97.16	97.99	98.11
Random Forest	97.11	98.15	98.50
Decision Tree	97.12	97.67	97.32
Logistic Regression	93.53	93.66	93.80

Hyper-parameter, feature selection, and K-fold Comparison for UCI

The RF model identifies it as the most effective and precise technique for combating fraudulent activities (phishing assaults). It was observed at accordingly[5].

Hybrid model approach:

Performance of each individual model implemented in the proof of concept[6]:

Approach	Accuracy	F1-Score
Hybrid model Framework	97.44%	96.56%

Here, they suggested a hybrid framework that incorporates results from various single-analysis methods. In order to do this, they employ a stacking function. Above mentioned function aggregates each & every model’s predictions into a single prediction. By using various stacking functions on the proof-of-concept, they identify a particularly effective stacking function to recognize fraudulent websites using a hybrid architecture[6].

Model	Accuracy	Precision	Recall	F1-score	ROC-AUC
URL	95.41%	93.15%	94.51%	93.82%	94.96%
Content	93.64%	95.29%	88.57%	91.81%	93.97%
DOM tree	90.30%	86.70%	87.29%	86.99%	89.58%

Conclusion:

Phishing email detection is one of the critical challenges in cybersecurity, and hybrid approaches with machine learning algorithms have considerably improved traditional methods. By integrating RF and SVM, phishing detection systems achieve high accuracy with low false positives. Future research directions could be in the use of deep learning techniques and real-time deployment to improve the effectiveness of phishing attack protection.

References:

S. Salloum, T. Gaber, S. Vadera, and K. Shaalan, “Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey,” Procedia Comput. Sci., vol. 189, pp. 19–28, Jan. 2021, doi: 10.1016/j.procs.2021.05.077.
“A Critical Review of Artificial Intelligence Based Approaches in Intrusion Detection: A Comprehensive Analysis – Muneer – 2024 – Journal of Engineering – Wiley Online Library.” Accessed: Feb. 07, 2025. [Online]. Available: https://onlinelibrary.wiley.com/doi/full/10.1155/2024/3909173
R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/j.eswa.2018.09.029.
“(PDF) DETECTION OF PHISHING ATTACKS USING MACHINE LEARNING TECHNIQUES,” ResearchGate. Accessed: Feb. 07, 2025. [Online]. Available: https://www.researchgate.net/publication/382149371_DETECTION_OF_PHISHING_ATTACKS_USING_MACHINE_LEARNING_TECHNIQUES
R. J. van Geest, G. Cascavilla, J. Hulstijn, and N. Zannone, “The applicability of a hybrid framework for automated phishing detection,” Comput. Secur., vol. 139, p. 103736, Apr. 2024, doi: 10.1016/j.cose.2024.103736.
Lv, L., Wu, Z., Zhang, L., Gupta, B. B., & Tian, Z. (2022). An edge-AI based forecasting approach for improving smart microgrid efficiency. IEEE Transactions on Industrial Informatics, 18(11), 7946-7954.
Mirsadeghi, F., Rafsanjani, M. K., & Gupta, B. B. (2021). A trust infrastructure based authentication method for clustered vehicular ad hoc networks. Peer-to-Peer Networking and Applications, 14, 2537-2553.
Rahaman M (2024) Foundations of Phishing Detection Using Deep Learning: A Review of Current Techniques, Insights2Techinfo, pp.1

Cite As

Bharath G. (2025) Hybrid Machine Learning Approaches for Phishing Email Detection, Insights2Techinfo, pp.1

839610cookie-checkHybrid Machine Learning Approaches for Phishing Email Detection

Post Views: 353

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hybrid Machine Learning Approaches for Phishing Email Detection

Abstract:

Introduction:

Machine Learning Techniques in Phishing Detection:

Traditional Models for Phishing Detection:

Hybrid Machine Learning Models:

Feature Engineering for Phishing Email Detection:

Hyper-parameter, feature selection, and K-fold Comparison for UCI

Hybrid model approach:

Conclusion:

References:

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Explainable Multi-Agent Reinforcement Learning for Algorithmic Trading

Internet of Things and Advancements in Businesses

Efficient and Sustainable Desalination using IoT, Cloud Computing, Embedded Systems and Nanotechnology

Role of Machine Learning in Embedded Systems

Pocket Hacking: From Root Access to Kali Linux