Role of Natural Language Processing (NLP) in Email Phishing Detection

By: Rekarius; Asia University

Abstract

Phishing emails are one of the most common phishing attack vectors in society. These attacks are carried out by criminals by sending legitimate-looking email messages by pretending to be from a legitimate party. This attack aims to steal the victim’s personal data such as usernames, passwords, credential details, and others. In an effort to detect phishing emails, NLP is utilized to process raw email messages that cannot be understood by machines directly by performing text preprocessing such as cleaning from meaningless words, stop-word removal, and tokenization. The preprocessed email data is then transformed into a numerical representation that is ready to be given as input to machine learning and deep learning to learn from the data to recognize word or sentence patterns that indicate phishing from an email.

Keywords: Email Phishing, Natural Language Processing, Machine Learning, Deep Learning

Introduction

Email, or electronic mail is a means of exchanging messages for formal and professional purposes. Because of this function, many irresponsible parties try to exploit email to pretend to be certain official entities to deceive others in order to obtain personal information. This information is then used for purposes that are detrimental to the victim, often in the form of financial losses and data loss. This type of crime is known as email phishing, an act of information theft that utilizes social engineering techniques to deceive victims through email.

To address this issue, relying on human innate ability to detect whether an email is phishing or not is insufficient; an automated detection mechanism is needed. Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language. NLP enables computers and digital devices to recognize, understand, and generate text and speech by combining computational linguistics, rule-based modeling of human language, along with statistical modeling, machine learning, and deep learning [15]. There- fore, in this article, we discuss the use of NLP in detecting phishing in email.

Background

In this section, we discuss the basic concepts of phishing emails, concepts of NLP, and existing research on the application of NLP for phishing detection.

Basic Concepts of Email Phishing

Phishing is a cybercrime committed through social engineering with the goal of obtaining personal data from victims without their suspicion. In the context of email, phishing perpetrators often exploit emails, which are often used for formal communication, by creating messages that appear legitimate and pretend to be from certain entities such as universities, banks, government agencies, and other services. Victims, believing the email to be legitimate, unhesitatingly click on the included link or follow the instructions by calling a specified number. Once the victim clicks the link or follows other instructions made by the perpetrator, the initial loss will occur. The information in the form of personal data obtained from the victim is misused by the perpetrator for various purposes, such as draining the victim’s account balance, stealing more data, and so on. Phishing emails are one of the fastest-growing crimes, requiring innovative solutions. Machine learning and deep learning are fields that are being widely explored to create solutions to the problem of phishing emails [7] [6]. In addition, the latest field being explored to provide solutions to the problem of phishing emails is the Large Language Model (LLM) [1]. Large language models (LLMs) are a category of basic models trained with very large amounts of data, making them capable of understanding and generating natural language and other types of content to perform various tasks [8].

Basic Concepts of NLP

Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language. NLP enables computers and digital devices to recognize, understand, and generate text and speech by combining computational linguistics, rule-based modeling of human language, along with statistical modeling, machine learning, and deep learning [15]. In processing text, NLP works sequentially as follows: Tokenization, which breaks long text into smaller units in the form of words or phrases; Lemmatization or setemming, where words with suffixes at the end are changed to their root form; Part-of-Speech Tagging (POS) identifies the grammatical function of words, for example, the word click is understood as a verb; Named Entity Recognition, which understands entities such as organizations, domains, email addresses, and others; Syntax Analysis, which understands the relationship between words such as subjects, objects, and predicates; Semantic Analysis, understanding the meaning of words and their context in a sentence such as verify now is under- stood as an urgency; Sentiment and Emotion Analysis, namely understanding the emotions in a text, whether the text has a threatening tone or something else.

Methodology

In this section, we discuss how NLP is applied in phishing detection mechanisms, starting from text processing, feature extraction, Semantic and Sentiment analysis, contextual understanding, and integration with ML and DL models.

Text Preprocessing

To identify whether an email is phishing, NLP is applied to data preprocessing. The email body text is cleaned, tokenized, and then stop words and rare words are removed (alhogail2021applying). In the email cleaning process, non- alphanumeric characters and nonspecial characters, such as ”?”, ”!”, and ””, are removed. In the tokenization process, each email text is broken down into words to facilitate the model’s analysis of syntactic and semantic relationships within the text. These words are referred to as tokens. Additionally, removing stop words is crucial; these words typically only help build an idea but do not carry any significance on their own.

Feature Extraction

At this stage, the processed email messages enter the feature extraction stage. Feature extraction is the stage where the processed text is converted into a numerical representation as input for a machine learning or deep learning model. Traditional approaches that have been widely explored to convert text into numerical representations include Bag-of-Words (BoW) and Term Frequency- Inverse Document Frequency (TF-IDF). For example, this study [9] and [12] demonstrates the use of word embeddings such as TF-IDF and BoW to convert email text into numerical features based on calculated word frequencies.

Contextual word embedding has become an approach widely adopted by existing studies, this approach can capture richer linguistic information than traditional approaches. Models such as Word2Vec and GloVe are models that can map words into dense vectors where words with similar semantic meanings are located close to each other. This study [5] and [14] demonstrate the use of the Gensim-Word2Vec model to generate numeric word vectors. Table I illustrates how words with similar semantic meanings are represented.

From the table above, we can understand that words such as verify, update, account, password, and urgent have a small distance between vectors, meaning that the semantic relationship of these words is close. Meanwhile, words such as meeting, schedule, and report are also close to each other, but far apart from the previous group of words, which shows that the difference in semantic

Table 1: Example of Word Embedding Representation

Word

Context Type

Vector Representation (Example)

verify

Phishing-related

[0.85, 0.22, 0.44]

update

Phishing-related

[0.83, 0.19, 0.40]

account

Phishing-related

[0.88, 0.21, 0.46]

password

Phishing-related

[0.90, 0.24, 0.42]

urgent

Phishing-related

[0.80, 0.25, 0.39]

meeting

Legitimate-related

[0.22, 0.82, 0.33]

schedule

Legitimate-related

[0.25, 0.85, 0.35]

report

Legitimate-related

[0.28, 0.88, 0.37]

relationship between the first group of words and the second group of words is quite large.

Transformer-based architectures like BERT and RoBERTa are more mod- ern approaches; these models can capture bidirectional contextual relationships within a sentence, making them more effective at identifying fraudulent language patterns within email text. Table II illustrates how these models work. In Table II, we can see that the raw email was tokenized by dividing the text into tokens or subwords. The [CLS] token represents the entire sentence for classification, while [SEP] marks the end of the input. In the self-attention stage, the word verify pays attention to the words account, password, and suspend, which makes the model understand that this sentence contains a threat.

Integration with ML and DL Models

After the data in the form of raw email text is preprocessed and transformed into meaningful numerical representations through NLP techniques, these features will be integrated with various learning machines and deep learning as input for these models to learn patterns of words or sentences that have the potential to be phishing in email text.

Machine learning algorithms that have been widely explored in phishing email detection tasks include RandomForest [2, 13], NaiveBayes [2], Support Vector Machine (SVM) [11], Decision Tree (DT) [10]. Meanwhile, for deep learning, algorithms that are widely explored in phishing email detection tasks include Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional GRU (BiGRU). [4], Convolu- tional Neural Network (CNN) [3].

Conclusion

Based on the literature review, NLP has a very important role in the phishing email detection mechanism, where NLP plays an important role in preprocessing

Table 2: Illustration of Transformer-based Feature Extraction in Phishing Email Detection

Step

Description

Example / Representation

Input Email

Raw email text to analyze

“Your account has been suspended. Please verify your password immedi- ately.”

Tokenization

Split text into to- kens using BERT tokenizer

[CLS], Your, account, has, been, sus- pended, ., Please, verify, your, pass- word, immediately, ., [SEP]

Self-Attention

Each token attends to all other tokens in the sentence

verify account, password, suspended suspended account

password verify, account

Contextual Embed- ding

Token embeddings capture semantic meaning and con- text

[CLS] = overall sentence embedding verify = embedding capturing urgency credential request

suspended = embedding capturing threat

Classification

Final decision based on [CLS] embedding

Phishing Email (1) / Legitimate Email (0)

raw data that cannot be understood by machines into a numeric vector repre- sentation that can be used to train machine learning or deep learning models to recognize phishing patterns in email text.

References

  1. Aya Salama Abdelhady and Mohamed Amr Shaker. Comprehensive email spam detection using large language models: Exploring the limits of llm- based and traditional methods. In 2025 International Conference on Ar- tificial Intelligence, Computer, Data Sciences and Applications (ACDSA), pages 1–6. IEEE, 2025.
  2. Rifat Ahamed Fahim, Md.Shohel Arman, Ishrat Sultana, Nusrat Tasnim, Kazi Rifat Ahmed, and Imran Mahmud. Phishguard: Leveraging nlp and machine learning for email phishing detection. In 2024 International Con- ference on Big Data Analytics in Bioinformatics (DABCon), pages 01–06, 2024.
  3. Reem Alotaibi, Isra Al-Turaiki, and Fatimah Alakeel. Mitigating email phishing attacks using convolutional neural networks. In 2020 3rd Interna- tional Conference on Computer Applications & Information Security (IC- CAIS), pages 1–6. IEEE, 2020.
  4. Eduardo Benavides-Astudillo, Walter Fuertes, Sandra Sanchez-Gordon, Daniel Nun˜ez-Agurto, and Germ´an Rodr´ıguez-Gal´an. A phishing-attack- detection model using natural language processing and deep learning. Ap- plied Sciences, 13(9):5275, 2023.
  5. Esteban Castillo, Sreekar Dhaduvai, Peng Liu, Kartik-Singh Thakur, Adam Dalton, and Tomek Strzalkowski. Email threat detection using distinct neural network approaches. In Archna Bhatia and Samira Shaikh, edi- tors, Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management, pages 48–55, Marseille, France, May 2020. European Language Resources Association.
  6. Edafe Maxwell Damatie, Amna Eleyan, and Tarek Bejaoui. Real-time email phishing detection using a custom distilbert model. In 2024 International Symposium on Networks, Computers and Communications (ISNCC), pages 1–6. IEEE, 2024.
  7. Hajar Fares, Nouhayla Mouakkal, Youssef Baddi, and Nirmin Hajraoui. Robust email phishing detection using machine learning and deep learning approach. International Journal of Communication Networks and Infor- mation Security, 16(3):91–108, 2024.
  8. IBM. Apa itu llm?, 2024. Diakses pada 16 Oktober 2025.
  9. Pallavi Jain, Shivang Singh, and Chaitanya Kumar Saxena. Detecting email spam with nlp: A machine learning approach. In 2024 IEEE Interna- tional Conference on Computing, Power and Communication Technologies (IC2PCT), volume 5, pages 393–398, 2024.
  10. Akash Junnarkar, Siddhant Adhikari, Jainam Fagania, Priya Chimurkar, and Deepak Karia. E-mail spam classification via machine learning and nat- ural language processing. In 2021 Third International Conference on Intel- ligent Communication Technologies and Virtual Mobile Networks (ICICV), pages 693–699, 2021.
  11. A. Kumar, J. M. Chatterjee, and V. G. D´ıaz. A novel hybrid approach of svm combined with nlp and probabilistic neural network for email phishing. International Journal of Electrical and Computer Engineering (IJECE), 10(1):486–493, Feb 2020.
  12. Somesha M. and Alwyn R. Pais. Classification of phishing email using word embedding and machine learning techniques. Journal of Cyber Security and Mobility, 11(3):279–320, 2022.
  13. Radhika Rajoju, Vangala Sathvika, Guthikonda Narayana Sai Smaran, Chettipelli Tejashwini, and Gangasani Aditya Reddy. Text phishing de- tection system using random forest algorithm. In 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), pages 1332–1339. IEEE, 2024.
  14. Nafiz Rifat, Mostofa Ahsan, Md. Chowdhury, and Rahul Gomes. Bert against social engineering attack: Phishing text detection. In 2022 IEEE International Conference on Electro Information Technology (eIT), pages 1–6, 2022.
  15. Cole Stryker and Jim Holdsworth. Apa itu nlp?, 2025. IBM Think. Diakses pada 15 Oktober 2025.
  16. Agrawal, D. P., Gupta, B. B., Yamaguchi, S., & Psannis, K. E. (2018). Recent Advances in Mobile Cloud Computing. Wireless Communications and Mobile Computing, 2018.
  17. Goyal, S., Kumar, S., Singh, S. K., Sarin, S., Priyanshu, Gupta, B. B., … & Colace, F. (2024). Synergistic application of neuro-fuzzy mechanisms in advanced neural networks for real-time stream data flux mitigation. Soft Computing, 28(20), 12425-12437.
  18. Panigrahi, R., Bele, N., Panigrahi, P. K., & Gupta, B. B. (2024). Features level sentiment mining in enterprise systems from informal text corpus using machine learning techniques. Enterprise Information Systems, 18(5), 2328186.

Cite As

Rekarius (2025) Role of Natural Language Processing (NLP) in Email Phishing Detection, Insights2Techinfo, pp.1

89600cookie-checkRole of Natural Language Processing (NLP) in Email Phishing Detection
Share this:

Leave a Reply

Your email address will not be published.