The Application of URL Feature Extraction Techniques in Phishing Detection

By: Reka Rius, CCRI, Asia University, Taiwan

Abstract

Basically, most people are aware of the dangers of phishing attacks. It is an illegal attempt to obtain someone’s personal information, which can be used for malicious purposes. Many studies have focused on detecting phishing URLs by extracting features from web pages. However, this approach often requires significant time and storage resources. Therefore, it would be more efficient to detect phishing URLs using only URL-based features without accessing the content of the web page. This article discusses several URL feature extraction techniques that can be used to detect phishing URLs.

Keywords: URL phishing detection, Lexcical feature,character n-gram, em- bedding character

Introduction

phishing is an attempt made by attackers to steal an individual’s private infor- mation without the victim’s knowledge. Many people fall victim to phishing attacks by accessing phishing URLs that appear to be legitimate. The process of how a phishing attack occurs and causes loss to the victim can be seen in (Figure 1). There are many methods used by attackers to obtain an individual’s private information, such as through social media, email, websites, Telegram, and other channels.

Numerous anti-phishing solutions have been implemented to stop such ac- tivities; however, many individuals still become victims of these attacks in their daily lives. Distinguishing between a phishing site and a legitimate, secure website remains a significant challenge.

This study discusses several URL feature extraction techniques used to de- tect whether a URL is legitimate or a phishing attempt.

Figure 1: phising attack diagram

Related Work

used the character N-gram technique to extract features from URLs . The char- acter N-grams extracted are overlapping sequences of n consecutive characters derived from the URL, where the value of n ranges from 1 to 10. For example, the first three bigrams from the URL ’example.com’ are: ex, xa, and am. For the feature set, all substrings with length N were extracted. For example, when N = 3, the features extracted from ‘example.com’ include the following set: e, x, a, m, ex, xa, am, exa, xam, etc [1]. In this study, character N-grams are considered important because: (i) phishing URLs exhibit different characteristics compared to legitimate URLs, making character N-grams suitable for capturing such unique patterns as they operate at the character level; (ii) they can capture recurring patterns: researchers found that phishing URLs tend to have recurring character patterns; (iii) they detect the presence of suspicious file indicators; and (iv) they capture the randomness of a URL, as N-grams can identify repeated unusual character combinations. The study applies online learning algorithms, particularly AROW and CW.

Studies [2, 3] utilized lexical features extracted from URLs. In this technique, researchers used Support Vector Machines (SVM) to analyze the performance of each feature combination and ranked the features based on their contribution to phishing classification shown in (Table 1).

Proposed converting each URL into component-wise character embeddings [4] learned using the SkipGram language model, and training machine learning classifiers such as XGBoost, Logistic Regression, and Random Forest using this embedding representation[5-10].

Table 1: Table of Feature Rankings

Rank

Feature Name

1

URL Length

2

Symbol to Character Ratio

3

Number of Suspicious Symbols

4

Path-to-URL Length Ratio

5

Number of Suspicious Keywords

6

Protocol Used

7

Number of Hyphens (-)

8

Presence of Symbol at the Last Character

9

Redirection Occurred

10

Presence of ’@’ Symbol

11

Number of Forward Slashes (/)

12

Presence of IP Address

13

Number of Question Marks (?)

14

Number of Subdomains

15

Presence of ’www’

16

Presence of the Word ’http’ in the URL

17

Presence of Port Number

18

Presence of Unicode Characters

Conclusion

This article presents a literature review of URL feature extraction techniques for detecting phishing attacks. The review identified several techniques that can be employed to detect phishing URLs, including character n-grams, lexical features, and character embeddings.

References

  1. Bireswar Banik and Abhijit Sarma. Phishing url detection system based on url features using svm. International Journal of Electronics and Applied Research, 5:40–55, 2018.
  2. Brij B Gupta, Krishna Yadav, Imran Razzak, Konstantinos Psannis, Arcan- gelo Castiglione, and Xiaojun Chang. A novel approach for phishing urls detection using lexical based machine learning in a real-time environment. Computer Communications, 175:47–57, 2021.
  3. Rakesh Verma and Avisha Das. What’s in a url: Fast feature extraction and malicious url detection. In Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, pages 55–63, 2017.
  4. Huaping Yuan, Zhenguo Yang, Xu Chen, Yukun Li, and Wenyin Liu. Url2vec: Url modeling with character embeddings for fast and accurate phishing website detection. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Com- puting & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 265–272. IEEE, 2018.
  5. Gupta, S., & Gupta, B. B. (2018). XSS-secure as a service for the platforms of online social network-based multimedia web applications in cloud. Multimedia Tools and Applications, 77(4), 4829-4861.
  6. Gupta, B. B., Misra, M., & Joshi, R. C. (2012). An ISP level solution to combat DDoS attacks using combined statistical based approach. arXiv preprint arXiv:1203.2400.
  7. Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing detection: a literature survey. IEEE Communications Surveys & Tutorials15(4), 2091-2121.
  8. Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security68, 160-196.
  9. Jain, A. K., & Gupta, B. B. (2018). Rule-based framework for detection of smishing messages in mobile environment. Procedia Computer Science, 125, 617-623.
  10. Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of the ACM, 50(10), 94-100.

Cite As

Rius R. (2025) The Application of URL Feature Extraction Techniques in Phishing Detection, Insights2Techingo, pp.1

86660cookie-checkThe Application of URL Feature Extraction Techniques in Phishing Detection
Share this:

Leave a Reply

Your email address will not be published.