By: By: Reka Kerja, CCRI, Asia University, Taiwan
Abstract
The behaviour of phishing attacks that mimic legitimate website pages can be detected using several approaches such as page-based similarity. however this approach will fail if the attacker uses images or embedded objects instead of HTML text. image-based similarity, this approach also has drawbacks such as requiring images to have the same aspect ratio. in addition, it is limited when segmenting pages with complex backgrounds. The visual similarity classification method shows the results of the experiments reveal the wHash mechanism with a color histogram is more accurate than the currently used perceptual Hashing(pHash) mechanism. The accuracies of SIFT technique are 97.93%, 98.61%, and 99.95% related to Microsoft, Dropbox, and Bank of America data, respectively. Additionally, based on qualitative analysis of the successful cases of Visual- PhishNet, the network identified easy phishing pages (highly similar to pages in training), and more importantly, phishing pages that were partially copied, obfuscated, or unseen.
Keywords Visual Similarity-Based, Phishing Detection
Introduction
Phishing pages impersonate legitimate websites without permission [7], Attack- ers replicate authentic sites of financial services or social media (e.g., PayPal, Facebook), copying visual elements (e.g., logos and layouts) to trick users into revealing sensitive credentials [5].
Approaches that have been proposed such as page-based similarity and image-based similarity have a significant weakness. Page-based similarity ap- proach will fail if attackers used images or embedded objects instead of HTML text [14]. They are also vulnerable to code obfuscation techniques where a different code produces similar rendered images [4, 6].
Image-based similarity approaches have disadvantages such as required the images to have same aspect ratio [6], limited when segmenting pages with com- plex backgrounds [2].these approaches assumed a fixed location for the website logo which could be bypassed (visual).This article discuses about the visual similarity-based.
Method
This section focuses on methods used in visual similarity approach, as follows:
visual similarity classification
In this method, the similarity levels are categorized into three groups: very similar, locally similar, and non-imitating. Figure 1 shows the visual similarity classification phishing detection process.

Very Similar Cases
Locality-Sensitive Hashing (LSH) is used to detect pages that have a high level of similarity. In cases of very high similarity, the screenshots of whole webpages must be compared. To improve the rate of comparison without loss of accuracy, an LSH method wavelet Hashing (wHash) mechanism with a color histogram is proposed herein [3]. Table 1 shows the summary of similarity metrics used for phishing detection.
Table 1: Summary of similarity metrics used for phishing detection
No. | Target Website | Similarity Type | Threshold | Mechanism | Number of Websites |
1 | Microsoft | Contour similarity α | ≥ 0.85 | wHash | 1661 |
2 | Microsoft | Color similarity β | ≥ 0.78 | wHash | 1661 |
3 | Microsoft | pHash similarity | ≥ 0.65 | pHash | 1661 |
4 | Dropbox | Color similarity β | ≥ 0.94 | wHash | 1843 |
5 | Dropbox | pHash similarity | ≥ 0.78 | pHash | 1843 |
6 | Bank of America | Contour similarity α | ≥ 0.85 | wHash | 1867 |
7 | Bank of America | Color similarity β | ≥ 0.78 | wHash | 1867 |
8 | Bank of America | pHash similarity | ≥ 0.75 | pHash | 1867 |
Local Similar Cases
Scale-Invariant Feature Transform (SIFT) is used to detect pages with similarity level categorized as locally similar.
A SIFT technique is based on the feature of local appearance at a point of interest on an object. Such features of an image are independent of its size and rotation [3]. Table 2 shows the SIFT performance using unbalanced dataset.
Table 2: Overall performance using unbalanced dataset
Target Webpage | Imitation Webpages | Match Points | Detected Webpages | Accuracy | Precision |
Microsoft | 393 | 3 | 363 | 98.14% | 99.17% |
Dropbox | 207 | 3 | 180 | 98.61% | 100.00% |
Bank of America | 152 | 11 | 151 | 99.95% | 100.00% |
Similarity Learning Based on Deep Learning
Visual similarity-based phishing detection relies on whether there is a high visual resemblance between the visited web page and one of the trusted websites, despite having a different domain. if the visited page is not sufficiently similar to any site in the trusted list, it will be classified as a legitimate page with a valid identity [1]. Similarity learning based on deep learning implements a triplet network to detect visual similarity between websites. The dataset used is VisualPhish (155 websites with 9,363 screenshots)[1]. Table 3 shows the performance of VisualPhishNet compared to other methods.
Table 3: Our experiments to compare VisualPhishNet ’s performance against prior methods and alternative baselines.
Method | Top-1 Match | ROC Area |
VisualPhishNet | 81.03% | 0.9879 |
VGG16 | 51.32% | 0.8134 |
ResNet50 | 52.21% | 0.7008 |
ORB | 24.9% | 0.6922 |
HOG | 27.61% | 0.58 |
SURF | 6.55% | 0.488 |
Conclusion
Based on qualitative analysis of the successful cases of VisualPhishNet,the net- work identified easy phishing pages (highly similar to pages in training), and more importantly, phishing pages that were partially copied, obfuscated, or un- seen. In the case of ‘very similar’, the wHash mechanism with the color histogram has a higher accuracy than the pHash mechanism, and the former is more stable than the pHash mechanism. In the case of ‘local similar’, logo detection by SIFT technique is a suitable choice. This study also adds a cache to reduce the detection time, increasing the detection speed up to 4.6 times. In a complete test with imbalanced data, the accuracies of Microsoft, Dropbox, and Bank of America data were 98.14%, 98.61%, and 99.95% separately. However, the performance difference is not obvious in a complete test with balanced data. The threshold setting and processing speed should be discussed in the future.
References
- Sahar Abdelnabi, Katharina Krombholz, and Mario Fritz. Visualphishnet: Zero-day phishing website detection by visual similarity. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pages 1681–1698, 2020.
- Ahmet Selman Bozkir and Ebru Akcapinar Sezer. Use of hog descriptors in phishing detection. In 2016 4th International Symposium on Digital Forensic and Security (ISDFS), pages 148–153. IEEE, 2016.
- Jiann-Liang Chen, Yi-Wei Ma, and Kuan-Lung Huang. Intelligent visual similarity-based phishing websites detection. Symmetry, 12(10):1681, 2020.
- Anthony Y Fu, Liu Wenyin, and Xiaotie Deng. Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (emd). IEEE transactions on dependable and secure computing, 3(4):301–311, 2006.
- Fujiao Ji, Kiho Lee, Hyungjoon Koo, Wenhao You, Euijin Choo, Hyoung- shick Kim, and Doowon Kim. Evaluating the effectiveness and robust-ness of visual similarity-based phishing detection models. arXiv preprint arXiv:2405.19598, 2024.
- Ieng-Fat Lam, Wei-Cheng Xiao, Szu-Chi Wang, and Kuan-Ta Chen. Coun- teracting phishing page polymorphism: An image layout analysis approach. In International conference on information security and assurance, pages 270–279. Springer, 2009.
- Colin Whittaker, Brian Ryner, and Marria Nazif. Large-scale automatic classification of phishing pages. In Ndss, volume 10, page 2010, 2010.
- Sedik, A., Maleh, Y., El Banby, G. M., Khalaf, A. A., Abd El-Samie, F. E., Gupta, B. B., … & Abd El-Latif, A. A. (2022). AI-enabled digital forgery analysis and crucial interactions monitoring in smart communities. Technological Forecasting and Social Change, 177, 121555.
- Agrawal, D. P., Gupta, B. B., Yamaguchi, S., & Psannis, K. E. (2018). Recent Advances in Mobile Cloud Computing. Wireless Communications and Mobile Computing, 2018.
- Goyal, S., Kumar, S., Singh, S. K., Sarin, S., Priyanshu, Gupta, B. B., … & Colace, F. (2024). Synergistic application of neuro-fuzzy mechanisms in advanced neural networks for real-time stream data flux mitigation. Soft Computing, 28(20), 12425-12437.
- Kulkarni, A. D., & Brown III, L. L. (2019). Phishing websites detection using machine learning.
- Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., & Bindhumadhava, B. S. (2020, January). Phishing website classification and detection using machine learning. In 2020 international conference on computer communication and informatics (ICCCI) (pp. 1-6). IEEE.
- Zamir, A., Khan, H. U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., & Hamdani, M. (2020). Phishing web site detection using diverse machine learning algorithms. The Electronic Library, 38(1), 65-80.
Cite As
Kerja R. (2025) Visual Similarity-Based Phishing Websites Detection, Insights2Techinfo, pp.1