By: Rekarius, CCRI, Asia University

Abstract

Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur. This study discusses how machine learning approaches are used to detect phishing emails.

Keywords Phishing Email, Phishing Detection, Machine Learning Algorithms

Introduction

The rapid development of online services and the Internet has been regrettably accompanied by growth in cyber-attacks, with phishing being one of the most common and effective at- tacks [4]. Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur.

Methodology

Paper 1

This paper [1] uses the following methodology: Data Collection This project was created by a base dataset which consist of email body. At the same time, class label is assigned as 1 for phishing emails where-as 0 for non phishing emails. Approach In this work, tokenization of each word by word in the email body has been done. Also, then used ”module cloud” to visualize the words that ap- pear in phishing emails. By interprating the diffrent frequency words, manually made a corpus of 100 phishing related words. By doing so, a new dataset is created with each phishing word as a feature and the frequency of each phishing word of each email body as a corresponding value for that feature in a row. As features are more, it’s then decided to use feature extraction methods to mitigate the curse of dimensionality problem. for this purpose, a lot of feature reduction methods were used to study the effect of it. Dimensionality re- duction methods pertaining to Principal Component Analysis, Forward Feature Elimination, Backward Feature Elimination, Non-negative matrix factorization, Recursive feature elimination were exposed to the base dataset, along with cross validation.

Models used in this project

A variety of the state-of-the-art existing Machine Learning approaches pertaining to Logistic regression, Decision Tree, Support Vector Machine, Naive Bayes, and KNN is being employed train the model with the dataset. The Various Machine Learning models performance are shown in Table 1.

Table 1: Accuracy Scores Models

ML Model	Accuracy
Logistic Regression	0.97125
Decision Tree	0.9515
SVM	0.9667
Naive Bayes	0.9515
KNN	0.9304

Paper 2

This paper [2] uses the following methodology.

Data Pre-Processing

This study applied various pre-processing techinques to the emails, including email cleaning, stop- and rare-word filtering, and tokenization.

Graph Construction

After pre-processing the dataset, the next steps is to construct a single large graph from the entire corpus, with the words and email used as nodes. The edges that connect the word nodes depend on the cooccurence information between two words. The edges between a word and an email are constructed using the word frequency and the word’s email frequency.

The GCN classifier

In the Graph Convolutional Network (GCN), when we use one convolutional network layer, the node can only capture information from its nearest neigh- bours. Therefore, the more convolutional layers are used, the more information from neighbours is integrated. The task of gathering information about each node from its neighbours proceeds in pararel for each node. with two layers, we repeat the information gathering task of the nearst nodes only. The outcome of this model is 98.2% accuracy, 98.5% precision, 98.3% recall, and 98.55% f-measure.

Comparison of related works in term of the accuracy, precision, recall and f-measure shown in Table 2.

Table 2: Comparison of related works

Reference	Technique	accuracy	precision	recall	f-measure
(Lai et al., 2015)	Text classification for phishing detection based on RCNN.	96.94%	–	–	–
(Nguyen, Nguyen and Nguyen, 2018a)	Deep learning hierarchical long short-term memory networks (H-LSTMs)	99%	97%	95%	96%
(Halgaˇs, Agrafi- otis and Nurse, 2020)	Deep learning	96.74%	97.45%	95.98%	96.71%
(Peng, Har- ris and Sawa, 2018a)	Machine learning algo- rithm and NLP	–	95%	91%	–
(Bergholz et al., 2008)	Machine learning algo- rithm using semantic fea- tures	99.88%	100%	98.93%	99.46%
(Fang et al., 2019)	Deep learning and NLP	99.84%	99.66%	99.00%	99.33%
This paper’s model	GCN	98.2%	98.5%	98.3%	98.55%

Conlusion

The technique they employed was effective in identifying phishing emails, ac- cording to the two approaches they examined. Logistic regression produced the best accuracy performance (0.97125) in paper 1. GCN’s accuracy performance in paper 2 was 98.2

References

Shaik Mulinti Mustaq Ahammad, Tangudu Raviteja, Jami Koushik, Pamidi Venkata Dinesh, and Asha Ashok. Machine learning approach based phishing email text analysis (ml-pe-ta). In 2022 Third International Con- ference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), pages 1087–1092, 2022.
Areej Alhogail and Afrah Alsabih. Applying machine learning and nat- ural language processing to detect phishing email. Computers Security, 110:102414, 2021.
Ammar Almomani, Brij B Gupta, Samer Atawneh, Andrew Meulenberg, and Eman Almomani. A survey of phishing email filtering techniques. IEEE communications surveys & tutorials, 15(4):2070–2090, 2013.
Luk´aˇs Halgaˇs, Ioannis Agrafiotis, and Jason RC Nurse. Catching the phish: Detecting phishing attacks using recurrent neural networks (rnns). In In- ternational Workshop on Information Security Applications, pages 219–233. Springer, 2019.
Agrawal, D. P., Gupta, B. B., Yamaguchi, S., & Psannis, K. E. (2018). Recent Advances in Mobile Cloud Computing. Wireless Communications and Mobile Computing, 2018.
Goyal, S., Kumar, S., Singh, S. K., Sarin, S., Priyanshu, Gupta, B. B., … & Colace, F. (2024). Synergistic application of neuro-fuzzy mechanisms in advanced neural networks for real-time stream data flux mitigation. Soft Computing, 28(20), 12425-12437.
Panigrahi, R., Bele, N., Panigrahi, P. K., & Gupta, B. B. (2024). Features level sentiment mining in enterprise systems from informal text corpus using machine learning techniques. Enterprise Information Systems, 18(5), 2328186.
Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web (pp. 649-656).
Salloum, S., Gaber, T., Vadera, S., & Shaalan, K. (2022). A systematic literature review on phishing email detection using natural language processing techniques. Ieee Access, 10, 65703-65727.
Gangavarapu, T., Jaidhar, C. D., & Chanduka, B. (2020). Applicability of machine learning in spam and phishing email filtering: review and approaches. Artificial Intelligence Review, 53(7), 5019-5081.

Cite As

Rekarius (2025) Phishing Email Text Analysis Using Machine Learning Approaches, Insights2Techinfo, pp.1

886700cookie-checkPhishing Email Text Analysis Using Machine Learning Approaches

Post Views: 51

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Phishing Email Text Analysis Using Machine Learning Approaches

Abstract

Introduction

Methodology

Paper 1

Models used in this project

Paper 2

Data Pre-Processing

Graph Construction

The GCN classifier

Conlusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Role of Natural Language Processing (NLP) in Email Phishing Detection

LLM-Based Phishing Detection: URL Phishing and Voice Phishing

Beyond the Code: A Systematic Review of Psychological Techniques in Phishing Attacks

Improving Generalization in Phishing URL Detection: A Review of Multimodal BERT and BERT-PhishFinder Models

QR code Phishing URL Detection Based on Lightweight Deep Learning Model