Phishing Email Text Analysis Using Machine Learning Approaches

By: Rekarius, CCRI, Asia University

Abstract

Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur. This study discusses how machine learning approaches are used to detect phishing emails.

Keywords Phishing Email, Phishing Detection, Machine Learning Algorithms

Introduction

The rapid development of online services and the Internet has been regrettably accompanied by growth in cyber-attacks, with phishing being one of the most common and effective at- tacks [4]. Imagine receiving an email from a sender claiming to be a specific party, such as a bank. In the email, you are instructed to verify your account thru a verification link they provide, which leads you to a website that superficially resembles a specific bank’s website. Without suspicion, you enter your credentials or personal information into the form they provide. The personal information you entered will be collected by phishing attackers and can be used to access your account and commit various crimes. This is one illustration of how phishing attacks can occur.

Figure 1: Life cycle of phishing [3]

Methodology

Paper 1

This paper [1] uses the following methodology: Data Collection This project was created by a base dataset which consist of email body. At the same time, class label is assigned as 1 for phishing emails where-as 0 for non phishing emails. Approach In this work, tokenization of each word by word in the email body has been done. Also, then used ”module cloud” to visualize the words that ap- pear in phishing emails. By interprating the diffrent frequency words, manually made a corpus of 100 phishing related words. By doing so, a new dataset is created with each phishing word as a feature and the frequency of each phishing word of each email body as a corresponding value for that feature in a row. As features are more, it’s then decided to use feature extraction methods to mitigate the curse of dimensionality problem. for this purpose, a lot of feature reduction methods were used to study the effect of it. Dimensionality re- duction methods pertaining to Principal Component Analysis, Forward Feature Elimination, Backward Feature Elimination, Non-negative matrix factorization, Recursive feature elimination were exposed to the base dataset, along with cross validation.

Models used in this project

A variety of the state-of-the-art existing Machine Learning approaches pertaining to Logistic regression, Decision Tree, Support Vector Machine, Naive Bayes, and KNN is being employed train the model with the dataset. The Various Machine Learning models performance are shown in Table 1.

Table 1: Accuracy Scores Models

ML Model

Accuracy

Logistic Regression

0.97125

Decision Tree

0.9515

SVM

0.9667

Naive Bayes

0.9515

KNN

0.9304

Paper 2

This paper [2] uses the following methodology.

Data Pre-Processing

This study applied various pre-processing techinques to the emails, including email cleaning, stop- and rare-word filtering, and tokenization.

Graph Construction

After pre-processing the dataset, the next steps is to construct a single large graph from the entire corpus, with the words and email used as nodes. The edges that connect the word nodes depend on the cooccurence information between two words. The edges between a word and an email are constructed using the word frequency and the word’s email frequency.

The GCN classifier

In the Graph Convolutional Network (GCN), when we use one convolutional network layer, the node can only capture information from its nearest neigh- bours. Therefore, the more convolutional layers are used, the more information from neighbours is integrated. The task of gathering information about each node from its neighbours proceeds in pararel for each node. with two layers, we repeat the information gathering task of the nearst nodes only. The outcome of this model is 98.2% accuracy, 98.5% precision, 98.3% recall, and 98.55% f-measure.

Comparison of related works in term of the accuracy, precision, recall and f-measure shown in Table 2.

Table 2: Comparison of related works

Reference

Technique

accuracy

precision

recall

f-measure

(Lai et al., 2015)

Text classification for

phishing detection based

on RCNN.

96.94%

(Nguyen,

Nguyen and

Nguyen, 2018a)

Deep learning hierarchical

long short-term memory

networks (H-LSTMs)

99%

97%

95%

96%

(Halgaˇs, Agrafi-

otis and Nurse,

2020)

Deep learning

96.74%

97.45%

95.98%

96.71%

(Peng, Har-

ris and Sawa,

2018a)

Machine learning algo-

rithm and NLP

95%

91%

(Bergholz et al.,

2008)

Machine learning algo-

rithm using semantic fea-

tures

99.88%

100%

98.93%

99.46%

(Fang et al.,

2019)

Deep learning and NLP

99.84%

99.66%

99.00%

99.33%

This paper’s

model

GCN

98.2%

98.5%

98.3%

98.55%

Conlusion

The technique they employed was effective in identifying phishing emails, ac- cording to the two approaches they examined. Logistic regression produced the best accuracy performance (0.97125) in paper 1. GCN’s accuracy performance in paper 2 was 98.2

References

  1. Shaik Mulinti Mustaq Ahammad, Tangudu Raviteja, Jami Koushik, Pamidi Venkata Dinesh, and Asha Ashok. Machine learning approach based phishing email text analysis (ml-pe-ta). In 2022 Third International Con- ference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), pages 1087–1092, 2022.
  2. Areej Alhogail and Afrah Alsabih. Applying machine learning and nat- ural language processing to detect phishing email. Computers Security, 110:102414, 2021.
  3. Ammar Almomani, Brij B Gupta, Samer Atawneh, Andrew Meulenberg, and Eman Almomani. A survey of phishing email filtering techniques. IEEE communications surveys & tutorials, 15(4):2070–2090, 2013.
  4. Luk´aˇs Halgaˇs, Ioannis Agrafiotis, and Jason RC Nurse. Catching the phish: Detecting phishing attacks using recurrent neural networks (rnns). In In- ternational Workshop on Information Security Applications, pages 219–233. Springer, 2019.
  5. Agrawal, D. P., Gupta, B. B., Yamaguchi, S., & Psannis, K. E. (2018). Recent Advances in Mobile Cloud Computing. Wireless Communications and Mobile Computing, 2018.
  6. Goyal, S., Kumar, S., Singh, S. K., Sarin, S., Priyanshu, Gupta, B. B., … & Colace, F. (2024). Synergistic application of neuro-fuzzy mechanisms in advanced neural networks for real-time stream data flux mitigation. Soft Computing, 28(20), 12425-12437.
  7. Panigrahi, R., Bele, N., Panigrahi, P. K., & Gupta, B. B. (2024). Features level sentiment mining in enterprise systems from informal text corpus using machine learning techniques. Enterprise Information Systems, 18(5), 2328186.
  8. Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web (pp. 649-656).
  9. Salloum, S., Gaber, T., Vadera, S., & Shaalan, K. (2022). A systematic literature review on phishing email detection using natural language processing techniques. Ieee Access, 10, 65703-65727.
  10. Gangavarapu, T., Jaidhar, C. D., & Chanduka, B. (2020). Applicability of machine learning in spam and phishing email filtering: review and approaches. Artificial Intelligence Review, 53(7), 5019-5081.

Cite As

Rekarius (2025)  Phishing Email Text Analysis Using Machine Learning Approaches, Insights2Techinfo, pp.1

88670cookie-checkPhishing Email Text Analysis Using Machine Learning Approaches
Share this:

Leave a Reply

Your email address will not be published.