By: Reka Kerja, CCRI, Asia University, Taiwan

Abstract

The shift in society habit such as banking, shopping, entertainment from conventional to online or internet based has increased the risk of cyberattack like phishing. The method used in this study is machine learning based with three(3) approaches such as Support Vector Machine (SVM), Random Forest, and Artificial Neural Network. Machine learning is powerful tool for detecting patterns in data, making it possible to identify some common phishing traits and thereby recognize phishing websites.

Keywords Phishing Detection, Machine Learning, Support Vector Machine, Random Forest, Artificial Neural Network

Introduction

The shift in activities—such as shopping, banking, entertainment, and more—from traditional methods to online or internet-based platforms has increased the risk of cyberattack. Instead of stealing from banks or stores physically, such crimes are now also carried out online. Phishing is one of the most prevalent types of cybercrime today, in which perpetrators create fake websites that imitate legitimate ones and distribute the URLs of these fake pages to people online, whether through email, direct messages, and other means. These fake web pages often contain forms that prompt victims to enter their personal information, such as usernames, IDs, passwords, bank account details, and more. The stolen personal information is then used to commit theft, which may involve money, company data, and other valuable assets. Although software companies launch new anti-phishing products, which use blacklists, heuristics, visual and machine learning-based approaches, these products cannot prevent all of the phishing attacks [3]. In this paper, a real-time anti-phishing system which uses machine learning-based is proposed.

Method

The method used in this study is machine-learning based with the following approaches:

Support Vector Machine

based on [2], In the approach using the SVM algorithm, input data in the form of URLs is collected. The collected URLs consist of legitimate URLs and phishing URLs. Afterward, feature extraction is performed on the input data, which is divided into several types:

Lexical: Features based on the URL’s text structure (such as the number of special characters, number of digits, URL length, etc.).

Host: Information related to the domain and hosting, such as domain age, IP address, whether it uses HTTPS, and so on.

Word vectors: Vector representations of words in the URL using embedding techniques to convert text into numerical form.

After the feature extraction process, SVM is used to classify whether a URL is phishing or legitimate. The process of this approach is shown in Figure 1.

Ada-Boost

Based on [4]Random Forest, as its name implies, contains a large number of individual decision trees that act as a group to decide the output. Each tree in a random forest specifies the class prediction, and the result will be the most

predicted class among the decision of trees. The reason for this amazing result from Random Forest is because of the trees protect each other from individual errors. Although some trees may predict the wrong answer, many other trees will rectify the final prediction, so as a group the trees can move in the right direction. The main drawback of Random Forests is the lack of reproducibility because the process of forest construction is random. Besides, it is difficult to interpret the final model and subsequent results, because it involves many independent decision trees [1].

Artificial Neural Networks

Artificial neural networks (ANNS) are a learning model roughly inspired by biological neural networks. These models are multilayered, each layer containing several processing units called neurons. Each neuron receives its input from its adjacent layers and computes its output with the help of its weight and a non- linear function called the activation function. In feed-forward neural networks like in 3, data flows from the first layer to the last layer. Different layers may perform different transformations on their input. The weights of neurons are set randomly at the start of the training and they are gradually adjusted by the help of the gradient descent method to get close to the optimal solution. The power of neural networks is due to the non-linearity of hidden nodes. The dataset used contains about 11,000 sample websites, we used 10of samples in the testing phase. Each website is marked either legitimate or phishing. ChatGPT said: Each feature used in this dataset is shown in Table 1.

Table 1: Description of Dataset (Part 1 and Part 2)

Features	Mean	Std	Features	Mean	Std
Having IP Address	0.3137	0.9495	Links in Tags	-0.1181	0.7709
URL Length	-0.6831	0.7660	SFH	-0.5957	0.7591
Shortening Service	0.7387	0.6739	Submitting To Email	0.6566	0.7720
Having @ Symbol	0.7005	0.7135	Abnormal URL	0.7052	0.7089
Double Slash Redirecting	0.7414	0.6710	Website Redirect Count	0.1156	0.3138
Prefix Suffix	0.7349	0.6783	On Mouse over	0.7620	0.6474
Having Sub Domain	0.0639	0.9185	RightClick	0.9128	0.4059
SSL Final State	0.2509	0.9118	PopUpWindow	0.6133	0.7088
Domain Reg Length	-0.3637	0.9446	IFrame	0.8169	0.5767
Favicon	0.6385	0.7777	Age of Domain	0.6165	0.7931
Port	0.7282	0.6885	DNS Record	0.3771	0.9222
HTTPS Token	0.6750	0.7372	Web Traffic	0.2872	0.8277
Request URL	0.1867	0.9824	Page Rank	-0.4836	0.8725
URL of Anchor	-0.0765	0.7151	Google Index	0.7215	0.6923
			Links Pointing to Page	0.3440	0.5699
			Statistical Report	0.7195	0.6494
			Result	0.1138	0.9935

Conclusion

Machine learning methods have proven to be a powerful tool for detecting pat- terns in data, making it possible to identify some common phishing traits and thereby recognize phishing websites. The machine learning approaches that can be applied include Support Vector Machine, Random Forest, and Artificial Neural Network.

References

Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
Junaid Rashid, Toqeer Mahmood, Muhammad Wasif Nisar, and Tahira Nazir. Phishing detection using machine learning technique. In 2020 first in- ternational conference of smart systems and emerging technologies (SMART- TECH), pages 43–46. IEEE, 2020.
Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, and Banu Diri. Ma- chine learning based phishing detection from urls. Expert Systems with Ap- plications, 117:345–357, 2019.
Vahid Shahrivari, Mohammad Mahdi Darabi, and Mohammad Izadi. Phishing detection using machine learning techniques. arXiv preprint arXiv:2009.11116, 2020.
Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
Das Guptta, S., Shahriar, K. T., Alqahtani, H., Alsalman, D., & Sarker, I. H. (2024). Modeling hybrid feature-based phishing websites detection using machine learning techniques. Annals of Data Science, 11(1), 217-242.
Bhavani, P. A., Chalamala, M., Likhitha, P. S., & Sai, C. P. S. (2022). Phishing websites detection using machine learning. Madhumitha and Likhitha, Pinnam Sree and Sai, Chanda Pranav Sai, Phishing Websites Detection Using Machine Learning (September 2, 2022).
Patil, V., Thakkar, P., Shah, C., Bhat, T., & Godse, S. P. (2018, August). Detection and prevention of phishing websites using machine learning approach. In 2018 Fourth international conference on computing communication control and automation (ICCUBEA) (pp. 1-5). Ieee.

Cite As

Kerja R. (2025) Phishing Attacks Detection Based on Machine Learning Approach, Insights2Techinfo, pp.1

869500cookie-checkPhishing Attacks Detection Based on Machine Learning Approach

Post Views: 1

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Phishing Attacks Detection Based on Machine Learning Approach

Abstract

Introduction

Method

Support Vector Machine

Ada-Boost

Artificial Neural Networks

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Phishing Attacks Detection Based on Machine Learning Approach

Visual Similarity-Based Phishing Websites Detection

Face Morphing in Identity Fraud: Threats to Authentication Systems

Full Body Re-enactment: Controlling Human Motion with Neural Networks

Comparative Analysis Machine Learning Algorithms for Spam Detection on the Telegram Platform