By: Rekarius, CCRI, Asia University
Abstract
Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur. This study discusses how machine learning approaches are used to detect phishing emails.
Keywords Phishing Email, Phishing Detection, Machine Learning Algorithms
Introduction
The rapid development of online services and the Internet has been regrettably accompanied by growth in cyber-attacks, with phishing being one of the most common and effective at- tacks [4]. Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur.

Methodology
Paper 1
This paper [1] uses the following methodology: Data Collection This project was created by a base dataset which consist of email body. At the same time, class label is assigned as 1 for phishing emails where-as 0 for non phishing emails. Approach In this work, tokenization of each word by word in the email body has been done. Also, then used ”module cloud” to visualize the words that ap- pear in phishing emails. By interprating the diffrent frequency words, manually made a corpus of 100 phishing related words. By doing so, a new dataset is created with each phishing word as a feature and the frequency of each phishing word of each email body as a corresponding value for that feature in a row. As features are more, it’s then decided to use feature extraction methods to mitigate the curse of dimensionality problem. for this purpose, a lot of feature reduction methods were used to study the effect of it. Dimensionality re- duction methods pertaining to Principal Component Analysis, Forward Feature Elimination, Backward Feature Elimination, Non-negative matrix factorization, Recursive feature elimination were exposed to the base dataset, along with cross validation.
Models used in this project
A variety of the state-of-the-art existing Machine Learning approaches pertaining to Logistic regression, Decision Tree, Support Vector Machine, Naive Bayes, and KNN is being employed train the model with the dataset. The Various Machine Learning models performance are shown in Table 1.
Table 1: Accuracy Scores Models
ML Model | Accuracy |
Logistic Regression | 0.97125 |
Decision Tree | 0.9515 |
SVM | 0.9667 |
Naive Bayes | 0.9515 |
KNN | 0.9304 |
Paper 2
This paper [2] uses the following methodology.
Data Pre-Processing
This study applied various pre-processing techinques to the emails, including email cleaning, stop- and rare-word filtering, and tokenization.
Graph Construction
After pre-processing the dataset, the next steps is to construct a single large graph from the entire corpus, with the words and email used as nodes. The edges that connect the word nodes depend on the cooccurence information between two words. The edges between a word and an email are constructed using the word frequency and the word’s email frequency.
The GCN classifier
In the Graph Convolutional Network (GCN), when we use one convolutional network layer, the node can only capture information from its nearest neigh- bours. Therefore, the more convolutional layers are used, the more information from neighbours is integrated. The task of gathering information about each node from its neighbours proceeds in pararel for each node. with two layers, we repeat the information gathering task of the nearst nodes only. The outcome of this model is 98.2% accuracy, 98.5% precision, 98.3% recall, and 98.55% f-measure.
Comparison of related works in term of the accuracy, precision, recall and f-measure shown in Table 2.
Table 2: Comparison of related works
Reference | Technique | accuracy | precision | recall | f-measure |
(Lai et al., 2015) | Text classification for phishing detection based on RCNN. | 96.94% | – | – | – |
(Nguyen, Nguyen and Nguyen, 2018a) | Deep learning hierarchical long short-term memory networks (H-LSTMs) | 99% | 97% | 95% | 96% |
(Halgaˇs, Agrafi- otis and Nurse, 2020) | Deep learning | 96.74% | 97.45% | 95.98% | 96.71% |
(Peng, Har- ris and Sawa, 2018a) | Machine learning algo- rithm and NLP | – | 95% | 91% | – |
(Bergholz et al., 2008) | Machine learning algo- rithm using semantic fea- tures | 99.88% | 100% | 98.93% | 99.46% |
(Fang et al., 2019) | Deep learning and NLP | 99.84% | 99.66% | 99.00% | 99.33% |
This paper’s model | GCN | 98.2% | 98.5% | 98.3% | 98.55% |
Conlusion
The technique they employed was effective in identifying phishing emails, ac- cording to the two approaches they examined. Logistic regression produced the best accuracy performance (0.97125) in paper 1. GCN’s accuracy performance in paper 2 was 98.2
References
- Shaik Mulinti Mustaq Ahammad, Tangudu Raviteja, Jami Koushik, Pamidi Venkata Dinesh, and Asha Ashok. Machine learning approach based phishing email text analysis (ml-pe-ta). In 2022 Third International Con- ference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), pages 1087–1092, 2022.
- Areej Alhogail and Afrah Alsabih. Applying machine learning and nat- ural language processing to detect phishing email. Computers Security, 110:102414, 2021.
- Ammar Almomani, Brij B Gupta, Samer Atawneh, Andrew Meulenberg, and Eman Almomani. A survey of phishing email filtering techniques. IEEE communications surveys & tutorials, 15(4):2070–2090, 2013.
- Luk´aˇs Halgaˇs, Ioannis Agrafiotis, and Jason RC Nurse. Catching the phish: Detecting phishing attacks using recurrent neural networks (rnns). In In- ternational Workshop on Information Security Applications, pages 219–233. Springer, 2019.
- Agrawal, D. P., Gupta, B. B., Yamaguchi, S., & Psannis, K. E. (2018). Recent Advances in Mobile Cloud Computing. Wireless Communications and Mobile Computing, 2018.
- Goyal, S., Kumar, S., Singh, S. K., Sarin, S., Priyanshu, Gupta, B. B., … & Colace, F. (2024). Synergistic application of neuro-fuzzy mechanisms in advanced neural networks for real-time stream data flux mitigation. Soft Computing, 28(20), 12425-12437.
- Panigrahi, R., Bele, N., Panigrahi, P. K., & Gupta, B. B. (2024). Features level sentiment mining in enterprise systems from informal text corpus using machine learning techniques. Enterprise Information Systems, 18(5), 2328186.
- Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web (pp. 649-656).
- Salloum, S., Gaber, T., Vadera, S., & Shaalan, K. (2022). A systematic literature review on phishing email detection using natural language processing techniques. Ieee Access, 10, 65703-65727.
- Gangavarapu, T., Jaidhar, C. D., & Chanduka, B. (2020). Applicability of machine learning in spam and phishing email filtering: review and approaches. Artificial Intelligence Review, 53(7), 5019-5081.
Cite As
Rekarius (2025) Phishing Email Text Analysis Using Machine Learning Approaches, Insights2Techinfo, pp.1