Deep Learning Based URL Phishing Attack Detection

By: Rekarius; CCRI, Asia University, Taiwan

Abstract

The widespread use of the Internet has led to an increase in data stored on networked devices. Attackers target these data because they are of high importance. extract complicated features from massive datasets of phishing and non-phishing data, which can result in more reliable detection systems. The study demonstrated that in terms of accuracy, flexibility, and resistance to changing phishing strategies, deep learning- based models performed better than conventional ma- chine learning techniques. Research using the same method also proves that the BiLSTM–attention-based deep learning model is effective, accurate, and efficient in detecting phishing URLs, outperforming the performance of machine learning methods.

Keywords URL Phising Detection, Deep Learning Model

Introduction

The widespread use of the Internet has led to an increase in data stored on networked devices. Attackers target these data because they are of high im- portance. A type of cybercrime known as ”phishing” attempts to trick unware consumers into revealing sensitive and valuable personal information. This can include social media connections, login credentials, financial account information, usernames, and passwords which the adversary can then use for malicious activities such as identity theft, scams and many more. One kind of attack is performed using phishing domains. These domains acquire confidential data without consent, either by tricking the users into visiting a fraudulent website that imitates the actual one and looks exactly the same as the legit website [2]. Then, the web page asks for personal details of the user such as login credentials, banking details, etc. And when users enter their personal information on these websites, that information gets delivered to the attacker. Then, there is a security breach because the attacker now has access to sensitive data that they may use to steal identities and perform many activities by impersonating as the original user. This study presents deep learning based approach to classify websites URLs as phishing or legitimate addresses.

Methodology

Method 1

Data Preprocessing

The dataset underwent several preprocessing steps: column names were stan- dardized to url and class; irrelevant entries were removed, and missing data were filled. Common prefixes (https://, http://) were removed as they are non- discriminative. The class column was label-encoded, assigning phishing and malicious as 1, and legitimate and benign as 0. Dataset statistics such as total URLs, class distribution, and URL length metrics (max, min, average) were also calculated [1].

Feature Extraction

Two types of features were extracted from URLs:

Character-level features: Each character was encoded using a predefined dictionary and mapped into dense vectors via a character embedding layer (input shape: 300, output: 300×128).

Word-level features: URLs were tokenized using delimiters (e.g., /, ., -), and

tokens were embedded into dense vector representations via a word embedding layer.

BiLSTM Layer

BiLSTM Layer BiLSTM represents an advanced model of LSTM that employs a bidirectional feedforward and backward mechanism [1]. The model processes sequential data and learns the sequential links within the data. The algorithm is suitable for use with URL data. As it is capable of operating in both directions, the model can extract more information from the data [3].

Attention Layer

governing its functioning. The attention weight is calculated to ensure that specific inputs are given priority. Consequently, the model prioritises the processing of crucial information. Furthermore, the attention layer enhances the performance of the model by pre- venting the loss of information in long sequences.

Max Pooling Dropout

Max pooling selected the most prominent features from previous layers, while dropout randomly deactivated neurons to reduce overfitting.

Concatenation Classification Character-based and word-based feature out- puts were processed separately through BiLSTM, attention, max pooling, and dropout layers. Their outputs were concatenated (shape: 512) and passed to a fully connected layer with 64 neurons for higher-level representation, followed by classification.

Method 2: FCNN Model

This study [2], utilizes the PhiUSIIL Phishing URL dataset for training and developing a FCNN based model for phishing detection. And for preprocess- ing and cleaning the dataset, MinMax normalization was used. This dataset includes 235795 instances and 56 features from which a few columns has been dropped to clean the dataset as they contained redundent data. The dropped columns are:- ‘FILENAME’, ‘URL’, ‘Domain’, ‘Title’, ‘URLLength’, ‘Domain- TitleMatchScore’, ‘URLSimilarityIndex’. And the algorithm has been trained with the rest of the features to increase the detection capability and accuracy of this model.

Algorithm Implemented

This research uses MinMax normalization for preprocessing the data and then applies the FCNN algorithm to increase the accuracy of detecting the phishing URLs. The process of min-max normalization, sometimes referred to as feature scaling, involves a linear adjustment of the initial data. After using MinMax for data normalization, this model uses FCNN algorithm. FCNN(Fully Connected Neural Network) is a class of neural networks where every neuron that is present in a layer is connected to each neuron of the next layers. When dealing with structured data that lacks a distinct spatial organization, like tabular data, FCNNs are frequently utilized. FCNN has a layered structure as represented in Figure 1:

    1. Input Layer: It is the first layer that receives your data’s input features, such as the quantity of features in your dataset.
    2. Hidden Layers : These are the layers where the modified data is stored. Activations, regularizations (such as dropout or L2 regularization), and trans- formations are examples of these layers.
Figure 1: Structure of FCNN

    1. Output Layer: It is the last layer that produces the desired result, such as a continuous value in regression or class probabilities in a classification problem. The operation of this model shown in Figure 2, follows the steps below:
      1. Collection of dataset from UCI PhiUSIL.
      2. Preprocessing of data & dropping unnecessary columns.
      3. Split the data into 80-20 for training and testing.
      4. Constructing FCNN model.
      5. Using the model for prediction of URLs.
      6. Evaluation of model and comparing the accuracy with other models.

Figure 2: Flowchart of Model

Conclusion

Deep learning models, especially neural networks, can be used toautomatically extract complicated features from massive datasets of phishing and non-phishing data, which can result in more reliable detection systems. The study demon- strated that in terms of accuracy, flexibility, and resistance to changing phishing strategies, deep learning- based models performed better than conventional ma- chine learning techniques. Research using the same method also proves that the BiLSTM–attention-based deep learning model is effective, accurate, and ef- ficient in detecting phishing URLs, outperforming the performance of machine learning methods.

References

  1. O¨ znur S¸ifa Ak¸cam, Adem Tekerek, and Mehmet Tekerek. Development of bilstm deep learning model to detect url-based phishing attacks. Computers and Electrical Engineering, 123:110212, 2025.
  2. Anish Rawla, Shreya Singh, Md Daniyal, and Preeti Dubey. Detection of phishing attacks in phiusiil dataset using deep learning. Procedia Computer Science, 259:543–552, 2025.
  3. Erzhou Zhu, Qixiang Yuan, Zhile Chen, Xuejian Li, and Xianyong Fang. Ccbla: a lightweight phishing detection model based on cnn, bilstm, and attention mechanism. Cognitive Computation, 15(4):1320–1333, 2023.
  4. Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
  5. Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
  6. Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
  7. Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing detection: a literature survey. IEEE Communications Surveys & Tutorials, 15(4), 2091-2121.
  8. Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160-196.
  9. Dhamija, R., Tygar, J. D., & Hearst, M. (2006, April). Why phishing works. In Proceedings of the SIGCHI conference on Human Factors in computing systems (pp. 581-590).
  10. Jansson, K., & von Solms, R. (2013). Phishing for phishing awareness. Behaviour & information technology, 32(6), 584-593.

Cite As

Rekarius (2025) Deep Learning Based URL Phishing Attack Detection, Insights2Techinfo, pp.1

87320cookie-checkDeep Learning Based URL Phishing Attack Detection
Share this:

Leave a Reply

Your email address will not be published.