Phishing Attacks Detection Based on Machine Learning Approach

By: Reka Kerja, CCRI, Asia University, Taiwan

Abstract

The shift in society habit such as banking, shopping, entertainment from conventional to online or internet based has increased the risk of cyberattack like phishing. The method used in this study is machine learning based with three(3) approaches such as Support Vector Machine (SVM), Random Forest, and Artificial Neural Network. Machine learning is powerful tool for detecting patterns in data, making it possible to identify some common phishing traits and thereby recognize phishing websites.

Keywords Phishing Detection, Machine Learning, Support Vector Machine, Random Forest, Artificial Neural Network

Introduction

The shift in activities—such as shopping, banking, entertainment, and more—from traditional methods to online or internet-based platforms has increased the risk of cyberattack. Instead of stealing from banks or stores physically, such crimes are now also carried out online. Phishing is one of the most prevalent types of cybercrime today, in which perpetrators create fake websites that imitate legitimate ones and distribute the URLs of these fake pages to people online, whether through email, direct messages, and other means. These fake web pages often contain forms that prompt victims to enter their personal information, such as usernames, IDs, passwords, bank account details, and more. The stolen personal information is then used to commit theft, which may involve money, company data, and other valuable assets. Although software companies launch new anti-phishing products, which use blacklists, heuristics, visual and machine learning-based approaches, these products cannot prevent all of the phishing attacks [3]. In this paper, a real-time anti-phishing system which uses machine learning-based is proposed.

Method

The method used in this study is machine-learning based with the following approaches:

Support Vector Machine

based on [2], In the approach using the SVM algorithm, input data in the form of URLs is collected. The collected URLs consist of legitimate URLs and phishing URLs. Afterward, feature extraction is performed on the input data, which is divided into several types:

Lexical: Features based on the URL’s text structure (such as the number of special characters, number of digits, URL length, etc.).

Host: Information related to the domain and hosting, such as domain age, IP address, whether it uses HTTPS, and so on.

Word vectors: Vector representations of words in the URL using embedding techniques to convert text into numerical form.

After the feature extraction process, SVM is used to classify whether a URL is phishing or legitimate. The process of this approach is shown in Figure 1.

Figure 1: Phishing Detection Process

Ada-Boost

Based on [4]Random Forest, as its name implies, contains a large number of individual decision trees that act as a group to decide the output. Each tree in a random forest specifies the class prediction, and the result will be the most

predicted class among the decision of trees. The reason for this amazing result from Random Forest is because of the trees protect each other from individual errors. Although some trees may predict the wrong answer, many other trees will rectify the final prediction, so as a group the trees can move in the right direction. The main drawback of Random Forests is the lack of reproducibility because the process of forest construction is random. Besides, it is difficult to interpret the final model and subsequent results, because it involves many independent decision trees [1].

Artificial Neural Networks

Artificial neural networks (ANNS) are a learning model roughly inspired by biological neural networks. These models are multilayered, each layer containing several processing units called neurons. Each neuron receives its input from its adjacent layers and computes its output with the help of its weight and a non- linear function called the activation function. In feed-forward neural networks like in 3, data flows from the first layer to the last layer. Different layers may perform different transformations on their input. The weights of neurons are set randomly at the start of the training and they are gradually adjusted by the help of the gradient descent method to get close to the optimal solution. The power of neural networks is due to the non-linearity of hidden nodes. The dataset used contains about 11,000 sample websites, we used 10of samples in the testing phase. Each website is marked either legitimate or phishing. ChatGPT said: Each feature used in this dataset is shown in Table 1.

Table 1: Description of Dataset (Part 1 and Part 2)

Features

Mean

Std

Features

Mean

Std

Having IP Address

0.3137

0.9495

Links in Tags

-0.1181

0.7709

URL Length

-0.6831

0.7660

SFH

-0.5957

0.7591

Shortening Service

0.7387

0.6739

Submitting To Email

0.6566

0.7720

Having @ Symbol

0.7005

0.7135

Abnormal URL

0.7052

0.7089

Double Slash Redirecting

0.7414

0.6710

Website Redirect Count

0.1156

0.3138

Prefix Suffix

0.7349

0.6783

On Mouse over

0.7620

0.6474

Having Sub Domain

0.0639

0.9185

RightClick

0.9128

0.4059

SSL Final State

0.2509

0.9118

PopUpWindow

0.6133

0.7088

Domain Reg Length

-0.3637

0.9446

IFrame

0.8169

0.5767

Favicon

0.6385

0.7777

Age of Domain

0.6165

0.7931

Port

0.7282

0.6885

DNS Record

0.3771

0.9222

HTTPS Token

0.6750

0.7372

Web Traffic

0.2872

0.8277

Request URL

0.1867

0.9824

Page Rank

-0.4836

0.8725

URL of Anchor

-0.0765

0.7151

Google Index

0.7215

0.6923

Links Pointing to Page

0.3440

0.5699

Statistical Report

0.7195

0.6494

Result

0.1138

0.9935

Conclusion

Machine learning methods have proven to be a powerful tool for detecting pat- terns in data, making it possible to identify some common phishing traits and thereby recognize phishing websites. The machine learning approaches that can be applied include Support Vector Machine, Random Forest, and Artificial Neural Network.

References

  1. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  2. Junaid Rashid, Toqeer Mahmood, Muhammad Wasif Nisar, and Tahira Nazir. Phishing detection using machine learning technique. In 2020 first in- ternational conference of smart systems and emerging technologies (SMART- TECH), pages 43–46. IEEE, 2020.
  3. Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, and Banu Diri. Ma- chine learning based phishing detection from urls. Expert Systems with Ap- plications, 117:345–357, 2019.
  4. Vahid Shahrivari, Mohammad Mahdi Darabi, and Mohammad Izadi. Phishing detection using machine learning techniques. arXiv preprint arXiv:2009.11116, 2020.
  5. Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
  6. Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
  7. Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
  8. Das Guptta, S., Shahriar, K. T., Alqahtani, H., Alsalman, D., & Sarker, I. H. (2024). Modeling hybrid feature-based phishing websites detection using machine learning techniques. Annals of Data Science, 11(1), 217-242.
  9. Bhavani, P. A., Chalamala, M., Likhitha, P. S., & Sai, C. P. S. (2022). Phishing websites detection using machine learning. Madhumitha and Likhitha, Pinnam Sree and Sai, Chanda Pranav Sai, Phishing Websites Detection Using Machine Learning (September 2, 2022).
  10. Patil, V., Thakkar, P., Shah, C., Bhat, T., & Godse, S. P. (2018, August). Detection and prevention of phishing websites using machine learning approach. In 2018 Fourth international conference on computing communication control and automation (ICCUBEA) (pp. 1-5). Ieee.

Cite As

Kerja R. (2025) Phishing Attacks Detection Based on Machine Learning Approach, Insights2Techinfo, pp.1

86950cookie-checkPhishing Attacks Detection Based on Machine Learning Approach
Share this:

Leave a Reply

Your email address will not be published.