Deep Learning-Based Phishing Detection

By: Rekarius, CCRI, Asia University, Taiwan

Abstract

Shopping, Banking, Chatting all these things have been serviced over the inter- net. The growth of e-services increases attackers’ opportunities to compromise the security of Internet users; with phishing being the most prevalent. This study discusses deep learning techniques that have been proposed.

Keywords Phising Detection, Deep Learning-based, Artificial Neural Net- work, Convolutional Neural Networks, Recurrent Neural Network, Bidirectional Recurrent Neural Network, Attention Networks, Long short-term memory, LSTM- CNN

Introduction

All sectors, such as shop- ping and banking, which have intense financial ac- tivities, have been serviced over the internet [3]. Life has become faster and more accessible because of the evolution of communication technologies and digitalization [1]. The growth of e-services increases attackers’ opportunities to compromise the security of Internet users; with phishing being the most preva- lent. Phishing has different types, it could be via electronic mail (E-mail), SMS (Short Message Service), or URL (Uniform Resource Locator), to name a few. Phishing can be defined as an act to steal our valuable information like user id, user password, debit/credit card details for harmful reasons where they are con- cealed as a genuine organization [2]. Phishing rely on fooling users to share their valuable details regarding usernames, user password, card details etc. phishing can be also defined as a type of cyberattack that uses electronic communication channels like SMS, emails, phone calls to convey socially manipulated messages to humans which in-turn make them to provide their credentials, credit card number, password etc. for attacker’s benefit. Such types of activities persuade a normal website user to enter his/her details to a fraud website that acts like a hidden passage between the user and the attacker. Most of the phishing attacks rely on email and website, that are designed exactly like emails and websites from genuine organization to prompt users into detailing their financial or per- sonal information. The hacker could use this sensitive information of users for his/her benefits. Volkamer et. al. [4] emphasize that computer users may fall into phishing due to the following reasons:

• Limited knowledge about the URLs and their structure,

• Not knowing about trustable web pages,

• Due to the use of hidden URLs and redirections, original website address/URL can be hidden from the users and not displayed in the message,

• Due to the workload of the users, they omit to consult the web pages/URLs, or they can accidentally enter to them,

• Users don’t know anything about phishing and cannot distinguish legitimate web pages from phishing ones. In this study, a deep learning-based method is proposed for phishing detection.

Method

Method 1

This study [3], collected the dataset from Internet sources.

    1. Dataset Collection

Phishing URLs are collected from “www.phishtank.com”, and legitimate URLs are collected from “commoncrawl.org”, in the dataset created. The dataset is divided into three parts for making a com- parable performance measurement. 70 % is used for training, while 20 % is for validation and 10 % is used for testing.

    1. Vectorization

Attackers employ various tactics in phishing attacks, some of which pose significant challenges for detection. The attackers employ techniques that pose significant challenges for conducting word-based analyses. This is due to the widespread use of subtle alterations that are hard for individuals to discern. To perform word-based analysis, one must undergo training on a corpus forming a substantial volume of nonsensical words. Furthermore, word-based analysis introduces a level of language dependence. Consequently, in this article, we employed a character-based embedding approach for the vectorization process.

    1. Algorithms

Artificial Neural Network (ANN) is characterized by its acyclic connections between units. In this approach, the architecture primarily includes dense layers. Each node performs a direct linear calculation, and the parameters are then updated based on the output of the loss function.

Convolutional Neural Networks (CNNs) Convolutional Neural Net- works (CNNs) belong to the realm of deep learning algorithms and are tailored to demand minimal preprocessing. They employ a form of multilayer perceptron and execute convolution operations by envisioning them as sliding window functions applied to a matrix. When it comes to word-based approaches, this window traverses through words in the input data. Each convolutional operation yields a response when it detects specific pat- terns. These patterns may encompass expressions such as “I hate” or “very good,” enabling CNNs to identify them within a sentence irrespective of their position. Character- based approaches are instrumental in recognizing malicious patterns in URLs used in phishing attacks.

Recurrent Neural Network (RNN) constitutes another subset of deep learning techniques, where connections between nodes create a directed graph along a sequence.

Bidirectional Recurrent Neural Network (BRNN) architecture operates on the principle of splitting the neurons found in a regular RNN into two directions: one for the positive direction and another for the negative di- rection. Traditional RNNs have unidirectional information flow, meaning they only consider information from preceding words when processing a word.

In contrast Attention Networks (ATTs) employ an approach that prioritizes comprehending the entire context rather than focusing on isolated components. “Attention” in this context refers to the conscious directing of the mind toward a specific object or aspect.

Method 2

This paper [1] proposed a phishing detection system for detecting phishing URLs. The framework of the proposed model consisting of four stages: data preprocessing, feature extraction, model training, and evaluation is shown in figure 1.

Figure 1: proposed system flowchat

    1. Dataset Preparation and Preprocessing reliability. In our approach, we made use of appropriate and consistent data, so the system’s training is robust. After prepossessing the dataset containing the URL features, with 20,000 records of 80 features, there were a lot of features in the dataset; therefore, the SelectKBest method was used with the value of the 30 best features. The dataset under consideration was processed in the data preprocessing stage, which included detecting null values in addition to scaling each feature to a given range using the MinMaxScaler method. detecting null values in addition to scaling each feature to a given range using the MinMaxScaler method. The obtained dataset after preprocessing was individually taken into account during
    2. Training and Testing The dataset was divided into 20% as testing and 80% as training.
    3. Deep Learning Approaches Long short-term memory (LSTM) is an adaptive recurrent neural network (RNN), which is a type of recurrent neural network in which a memory cell, in addition to the conservative neuron, switches each neuron on account of an internal state. The workflow of LSTM for classifying a URL starts after loading, preprocessing, and splitting the dataset. The LSTM model starts with the first layer, which is the input layer that uses a 79- length vector, and then the LSTM layer, which includes 128 neurons and acts as the model’s memory subset. Following LSTM, the dense layer—an output layer with a sigmoid function—assists in providing the labels.

Convolutional neural network (CNN) is a discriminative architecture that works effectively at processing grid-based two-dimensional data, including images and videos. In terms of time delay, the CNN outperforms the neural network (NN). The workflow of the CNN for classifying a URL starts with the first step by fetching the labeled training data of the URLs, then divides into train and test sets at random. After we prepared the training and test data, the data was finally trained by creating the architecture of the CNN including the input, output, and layers. After each convolution, we incorporated a max-pool layer to capture the essential elements from each convolution and convert them into a feature vector. Next, we added dropout regularization to ensure that that model did not overfit. The

LSTM—CNN The model consists of CNN layers that extract features from input data and LSTM layers that predict sequences. The workflow of CNN-LSTM starts with the prepreprocessing the dataset, it splits into train and test sets, followed by data normalization before feeding into the model; lastly, the model is passed to the CNN and LSTM layers, in addition to the dense layer to avoid overfitting of the dataset, and finally, the model classifies the results of the output produced by this layer when a sigmoid function is used. Performance results can be seen in Table 1.

Table 1: The Method 2 performance results.

Evaluation Metric

LSTM (%)

CNN (%)

LSTM-CNN (%)

Accuracy

96.8

99.2

97.6

Precision

95.9

99.0

96.9

Recall

97.5

99.2

98.2

F1-score

96.8

99.2

97.6

Conclusion

In the study employing method 1, Among the various deep learning architectures that were tested, the Convolutional Neural Network (CNN) demonstrated the highest level of success, achieving an impressive accuracy rate of 98.74In the study who used method 2, the LSTM, CNN, and LSTM–CNN algorithms were proposed to detect and classify the URLs of the websites as either phishing or legitimate. Based on the evaluation of the proposed system, the detection of phishing websites accomplished excellent results. The CNN algorithm out- performed LSTM–CNN and LSTM in terms of accuracy, which reached 99.2%, while LSTM–CNN and LSTM achieved accuracies of 97.6%, and 96.8%, respectively.

References

  1. Zainab Alshingiti, Rabeah Alaqel, Jalal Al-Muhtadi, Qazi Emad Ul Haq, Kashif Saleem, and Muhammad Hamza Faheem. A deep learning-based phishing detection system using cnn, lstm, and lstm-cnn. Electronics, 12(1):232, 2023.
  2. M Hiransha, Nidhin A Unnithan, R Vinayakumar, K Soman, and ADR Verma. Deep learning based phishing e-mail detection. In Proc. 1st AntiPhishing Shared Pilot 4th ACM Int. Workshop Secur. Privacy Anal.(IWSPA), pages 1–5. Tempe, AZ, USA, 2018.
  3. Ozgur Koray Sahingoz, Ebubekir BUBE, and Emin Kugu. Dephides: Deep learning based phishing detection system. Ieee Access, 12:8052–8070, 2024.
  4. Melanie Volkamer, Karen Renaud, Benjamin Reinheimer, and Alexandra Kunz. User experiences of torpedo: Tooltip-powered phishing email detec- tion. Computers & Security, 71:100–113, 2017.
  5. Panigrahi, R., Bele, N., Panigrahi, P. K., & Gupta, B. B. (2024). Features level sentiment mining in enterprise systems from informal text corpus using machine learning techniques. Enterprise Information Systems, 18(5), 2328186.
  6. Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
  7. Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
  8. Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
  9. Birthriya, S. K., Ahlawat, P., & Jain, A. K. (2025). Detection and prevention of spear phishing attacks: A comprehensive survey. Computers & Security, 104317.
  10. Jabir, R., Le, J., & Nguyen, C. (2025). Phishing Attacks in the Age of Generative Artificial Intelligence: A Systematic Review of Human Factors. AI6(8), 174.
  11. Chinta, P. C. R., Moore, C. S., Karaka, L. M., Sakuru, M., Bodepudi, V., & Maka, S. R. (2025). Building an Intelligent Phishing Email Detection System Using Machine Learning and Feature Engineering. European Journal of Applied Science, Engineering and Technology3(2), 41-54.

Cite As

Rius R. (2025) Deep Learning-Based Phishing Detection, Insights2Techinfo, pp.1

87110cookie-checkDeep Learning-Based Phishing Detection
Share this:

Leave a Reply

Your email address will not be published.