By: Reka Rius, CCRI, Asia University, Taiwan

Abstract

Basically, most people are aware of the dangers of phishing attacks. It is an illegal attempt to obtain someone’s personal information, which can be used for malicious purposes. Many studies have focused on detecting phishing URLs by extracting features from web pages. However, this approach often requires significant time and storage resources. Therefore, it would be more efficient to detect phishing URLs using only URL-based features without accessing the content of the web page. This article discusses several URL feature extraction techniques that can be used to detect phishing URLs.

Keywords: URL phishing detection, Lexcical feature,character n-gram, em- bedding character

Introduction

phishing is an attempt made by attackers to steal an individual’s private infor- mation without the victim’s knowledge. Many people fall victim to phishing attacks by accessing phishing URLs that appear to be legitimate. The process of how a phishing attack occurs and causes loss to the victim can be seen in (Figure 1). There are many methods used by attackers to obtain an individual’s private information, such as through social media, email, websites, Telegram, and other channels.

Numerous anti-phishing solutions have been implemented to stop such ac- tivities; however, many individuals still become victims of these attacks in their daily lives. Distinguishing between a phishing site and a legitimate, secure website remains a significant challenge.

This study discusses several URL feature extraction techniques used to de- tect whether a URL is legitimate or a phishing attempt.

Related Work

used the character N-gram technique to extract features from URLs . The char- acter N-grams extracted are overlapping sequences of n consecutive characters derived from the URL, where the value of n ranges from 1 to 10. For example, the first three bigrams from the URL ’example.com’ are: ex, xa, and am. For the feature set, all substrings with length N were extracted. For example, when N = 3, the features extracted from ‘example.com’ include the following set: e, x, a, m, ex, xa, am, exa, xam, etc [1]. In this study, character N-grams are considered important because: (i) phishing URLs exhibit different characteristics compared to legitimate URLs, making character N-grams suitable for capturing such unique patterns as they operate at the character level; (ii) they can capture recurring patterns: researchers found that phishing URLs tend to have recurring character patterns; (iii) they detect the presence of suspicious file indicators; and (iv) they capture the randomness of a URL, as N-grams can identify repeated unusual character combinations. The study applies online learning algorithms, particularly AROW and CW.

Studies [2, 3] utilized lexical features extracted from URLs. In this technique, researchers used Support Vector Machines (SVM) to analyze the performance of each feature combination and ranked the features based on their contribution to phishing classification shown in (Table 1).

Proposed converting each URL into component-wise character embeddings [4] learned using the SkipGram language model, and training machine learning classifiers such as XGBoost, Logistic Regression, and Random Forest using this embedding representation[5-10].

Table 1: Table of Feature Rankings

Rank	Feature Name
1	URL Length
2	Symbol to Character Ratio
3	Number of Suspicious Symbols
4	Path-to-URL Length Ratio
5	Number of Suspicious Keywords
6	Protocol Used
7	Number of Hyphens (-)
8	Presence of Symbol at the Last Character
9	Redirection Occurred
10	Presence of ’@’ Symbol
11	Number of Forward Slashes (/)
12	Presence of IP Address
13	Number of Question Marks (?)
14	Number of Subdomains
15	Presence of ’www’
16	Presence of the Word ’http’ in the URL
17	Presence of Port Number
18	Presence of Unicode Characters

Conclusion

This article presents a literature review of URL feature extraction techniques for detecting phishing attacks. The review identified several techniques that can be employed to detect phishing URLs, including character n-grams, lexical features, and character embeddings.

References

Bireswar Banik and Abhijit Sarma. Phishing url detection system based on url features using svm. International Journal of Electronics and Applied Research, 5:40–55, 2018.
Brij B Gupta, Krishna Yadav, Imran Razzak, Konstantinos Psannis, Arcan- gelo Castiglione, and Xiaojun Chang. A novel approach for phishing urls detection using lexical based machine learning in a real-time environment. Computer Communications, 175:47–57, 2021.
Rakesh Verma and Avisha Das. What’s in a url: Fast feature extraction and malicious url detection. In Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, pages 55–63, 2017.
Huaping Yuan, Zhenguo Yang, Xu Chen, Yukun Li, and Wenyin Liu. Url2vec: Url modeling with character embeddings for fast and accurate phishing website detection. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Com- puting & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 265–272. IEEE, 2018.
Gupta, S., & Gupta, B. B. (2018). XSS-secure as a service for the platforms of online social network-based multimedia web applications in cloud. Multimedia Tools and Applications, 77(4), 4829-4861.
Gupta, B. B., Misra, M., & Joshi, R. C. (2012). An ISP level solution to combat DDoS attacks using combined statistical based approach. arXiv preprint arXiv:1203.2400.
Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing detection: a literature survey. IEEE Communications Surveys & Tutorials, 15(4), 2091-2121.
Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160-196.
Jain, A. K., & Gupta, B. B. (2018). Rule-based framework for detection of smishing messages in mobile environment. Procedia Computer Science, 125, 617-623.
Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of the ACM, 50(10), 94-100.

Cite As

Rius R. (2025) The Application of URL Feature Extraction Techniques in Phishing Detection, Insights2Techingo, pp.1

866600cookie-checkThe Application of URL Feature Extraction Techniques in Phishing Detection

Post Views: 93

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

The Application of URL Feature Extraction Techniques in Phishing Detection

Abstract

Introduction

Related Work

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Edge AI Security: Protecting Tiny Models with Big Impact

Memory in Conversational AI Agents: The Backbone of Long-Term Intelligence

The Future of Remote Work and Hybrid Models in 2025

Photonic AI Processors: Architectures, Applications, and Limitations

Neuro-Symbolic AI: The Comeback of Logic in an LLM World