A Literature Review on Quishing: QR Code Malicious URL Detection Using Machine Learning

By: Rekarius, CCRI, Asia University, Taiwan

Abstract

The increasing use of QR codes in everyday life—such as in transactions, website access, and other services—has created opportunities for phishing attackers to exploit QR codes. This type of phishing attack via QR codes is known as quishing. This study discusses machine learning methodologies for detecting quizzing. The best-performing model (XGBoost) achieved an AUC score of 0.9106, indicating that direct QR code-based detection is a feasible and effective approach. The findings from the applied methods show that the structural features of QR codes are strongly correlated with phishing risks.

Keywords Quishing Detection, Feature Selection, Machine Learning

Introduction

QR (Quick Response) codes are two-dimensional matrix barcodes that are used to encode information [1]. In recent years, they have more and more found their way into our everyday lives. QR codes have become an integral part of modern digital interactions, facilitating seamless access to websites, payments, authentication systems, and other services. security. QR codes are cheap to produce and easy to deploy. Therefore, they became the medium of choice in billboard advertising to access potential customers. One of the most commonly found use case is URL encoding to make information instantly available. Besides a broad range of advantages, QR codes have been misused as attack vector for social engineers. Attackers encode malicious links that lead to phishing sites, this is known as Quishing or QR code-based phishing. This study discusses methods applied in Quishing detection.

Method

Method 1

This study [3], They created their own dataset derived from the PhishStorm dataset by selecting 10,000 samples, evenly divided into 5,000 legitimate URLs and 5,000 phishing URLs. Then, they used the Python library ‘qrcode’ to generate QR codes corresponding to those URLs.

    1. Models

employ multiple machine learning models, including Logistic Regression, Decision Tree, Na¨ıve Bayes, Random Forest, LightGBM, and XGBoost. Hyper- parameter tuning was performed using arandomized search with 10-fold cross- validation on the training set. The optimal hyperparameters for each model are summarized in Table 1 to facilitate reproducibility.

Table 1: Hyperparameters of the chosen models

Model

Hyperparameters

Logistic Regression

Decision Tree Random Forest LightGBM

XGBoost

’C’: 0.1, ’solver’: ’liblinear’

’max depth’: 3, ’min samples leaf’: 1 ’max depth’: 20, ’n estimators’: 100

’learning rate’: 0.1, ’n estimators’: 200

’learning rate’: 0.2, ’n estimators’: 150

    1. Metrics

Accuracy: This metric measures the overall proportion of correctly clas- sified instances, providing a straightforward assessment of model performance. – Precision: Precision quantifies the proportion of true positive predictions among all positive predictions, indicating the model’s ability to minimize false positives.

  • Recall: Also known as sensitivity, recall measures the proportion of actual positives that are correctly identified by the model, reflecting its capability to detect positive cases.
  • F1-score: As the harmonic mean of precision and recall, the F1-score offers a balanced evaluation of the model’s performance, particularly in scenarios where class distribution is uneven.
  • Area Under the ROC Curve (AUC): The AUC metric evaluates the model’s overall ability to discriminate between classes across different thresh- old settings, providing a comprehensive measure of classification perfor- mance. It is worth noting that the previous metrics are reported based on the default classification threshold of 0.5.

Experiment 1: Training and testing models

The models were trained using the best hyperparameters obtained from 10- fold cross-validation and then tested on the test dataset. The results showed that most models performed well, except for Na¨ıve Bayes, which had a low AUC

(0.6531). XGBoost, LightGBM, and Random Forest demonstrated the best performance with AUC scores above 0.89. LightGBM and XGBoost achieved the highest accuracy (0.8293 and 0.8258, respectively) and the highest F1-scores (0.8214 and 0.8184), indicating a good balance between precision and recall. Table 2 presents the computed metrics for each model.

Table 2: Performance metrics on the testing set

Model

Accuracy

Precision

Recall

F1-Score

AUC

Logistic Regression

0.7983

0.8129

0.7621

0.7867

0.8737

Decision Tree Classifier

0.7578

0.7358

0.7856

0.7599

0.8138

Random Forest Classifier

0.7993

0.8570

0.7067

0.7746

0.8908

Gaussian NB

0.6376

0.8680

0.3036

0.4498

0.6531

XGBoost Classifier

0.8258

0.8332

0.8041

0.8184

0.9083

LGBM Classifier

0.8293

0.8394

0.8041

0.8214

0.9106

Experiment 2: Deriving Feature Importance

Based on the results of Experiment 1, we select the top three performing models (Random Forest, LightGBM, and XGBoost) and analyze their feature importance. To visualize the impact of individual pixels on the classification decision, we plot the feature importance values on a 69×69 grid, corresponding to the QR code size,

Experiment 3: Feature Selection

In this experiment, we performed feature selection based on the most important features identified by each of the three best-performing models: Random Forest,LightGBM, and XGBoost. We then retrained all models using only these selected features and compared their performance to their original versions with- out ture selection. The results show that applying feature selection consistently improves or maintains model performance. This suggests that many pixels in the original QR code images are non-informative and do not contribute to phishing detection.

Method 2

The proposed system employs a machine learning-based approach to detect malicious URLs embedded in QR codes [2]. Initially, a comprehensive dataset of both benign and malicious URLs is collected from various sources, ensuring diverse and up-to-date data. The data undergoes preprocessing, including tokenization, stop-word removal, and feature extraction using natural language processing (NLP) techniques. Domain-specific characteristics such as URL length, entropy, presence of special characters, and subdomains are also extracted. Ma- chine learning models like Random Forest and Support Vector Machine (SVM) are trained on this processed dataset to classify URLs as benign or malicious. The trained model is integrated into a QR code scanner for real-time URL legitimacy predictions. When a QR code is scanned, the extracted URL is analyzed, and users receive notifications regarding its safety. The model continuously up- dates with new threats, ensuring adaptability and robustness against evolving cyber-attacks, providing enhanced protection for users. The methodology used is illustrated in the diagram shown in Figure 1.

Figure 1: Methodology Flowchart

Conclusion

In the study employing method 1, This study introduced a novel approach to QR code-based phishing (quishing) detection by directly analyzing QR code structure and pixel patterns without relying on URL extraction. Through ma- chine learning models, trained on a newly created dataset, we demonstrate that QR-centric detection is both feasible and effective. Feature selection further improves or maintains model performance by identifying the most informative regions of the QR code while filtering out non-essential pixels. Based on the conclusion drawn from the second method used in the paper, it is stated that machine learning (ML) and artificial intelligence (AI) provide a more robust solution. Models like Random Forest and Support Vector Machine (SVM) analyze large datasets, extract features using natural language processing (NLP) techniques, and deliver real-time malicious URL detection.

References

  1. Katharina Krombholz, Peter Fru¨hwirt, Peter Kieseberg, Ioannis Kapsalis, Markus Huber, and Edgar Weippl. Qr code security: A survey of attacks and challenges for usable security. In International Conference on Human Aspects of Information Security, Privacy, and Trust, pages 79–90. Springer, 2014.
  2. J Ratna Kumarti, Sri Latha Kambala, Sireesha Paila, Gopichand Kunkudu, and Vinuhya Arithi Chekuri. Qr code malicious url detection system using machine learning. International Journal of Communication Networks and Information Security, 17(3):198–202, 2025.
  3. Fouad Trad and Ali Chehab. Detecting quishing attacks with machine learn- ing techniques through qr code analysis. arXiv preprint arXiv:2505.03451, 2025.
  4. Gupta, B. B., Gaurav, A., Arya, V., & Alhalabi, W. (2024). The evolution of intellectual property rights in metaverse based Industry 4.0 paradigms. International Entrepreneurship and Management Journal, 20(2), 1111-1126.
  5. Zhang, T., Zhang, Z., Zhao, K., Gupta, B. B., & Arya, V. (2023). A lightweight cross-domain authentication protocol for trusted access to industrial internet. International Journal on Semantic Web and Information Systems (IJSWIS), 19(1), 1-25.
  6. Jain, D. K., Eyre, Y. G. M., Kumar, A., Gupta, B. B., & Kotecha, K. (2024). Knowledge-based data processing for multilingual natural language analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(5), 1-16.
  7. Amoah, G. A., & Hayfron-Acquah, J. B. (2022). QR Code security: mitigating the issue of quishing (QR Code Phishing). International journal of computer applications184(33), 34-39.
  8. Bichnigauri, A., Kartvelishvili, I., Shonia, L., Bichnigauri, D., & Gudadze, O. (2023). Unveiling Quishing: The Dark Side Of Qr Codes In Cyber Attacks. თავდაცვა და მეცნიერება, (2).
  9. Trad, F., & Chehab, A. (2025). Detecting Quishing Attacks with Machine Learning Techniques Through QR Code Analysis. arXiv preprint arXiv:2505.03451.

Cite As

Rius R. (2025) A Literature Review on Quishing: QR Code Malicious URL Detection Using Machine Learning, Insights2Techinfo, pp.1

87140cookie-checkA Literature Review on Quishing: QR Code Malicious URL Detection Using Machine Learning
Share this:

Leave a Reply

Your email address will not be published.