By: Rekarius, CCRI, Asia University, Taiwan

Abstract

Phishing attacks have become one of the most damaging cyberattacks, responsible for a large percentage of losses, such as financial theft and data breaches that affect companies or individuals. This study examines how three papers that use AI-driven approaches are utilized for a phishing attack detection system. These three AI-driven approaches employ Reinforcement Learning, Machine Learning techniques such as Logistic Regression, Random Forest, Support Vector Ma- chine (SVM), CatBoost, and XGBoost, as well as Feedforward Deep Neural Networks (FC-DNN), Long Short-Term Memory (LSTM) and Inception-based Convolutional Networks (ID CNN).

Keywords Phishing Detection, AI-driven Approaches Detection, Reinforcement Learning, Machine Learning Techniques, Deep Learning

Introduction

The massive global use of the internet today has opened up opportunities for various cybercrimes. Phishing has established itself as one of the most pervasive and damaging forms of cyberattacks, responsible for a large percentage of data breaches, financial theft, and personal information compromises [3]. Email phishing attacks are among the most prevalent and dangerous types of cybercrime [2]. Several solutions have been developed for detecting phishing attacks. How- ever, they mainly rely on a blacklist approach, which has been inefficient in detecting a zero-day phishing attack. This study examines how three papers that use AI-driven approaches are utilized for a phishing attack detection system. These three AI-driven approaches employ Reinforcement Learning, Machine Learning techniques such as Logistic Regression, Random Forest, Support Vector Machine (SVM), CatBoost, and XGBoost, as well as Feedforward Deep Neural Networks (FC-DNN), Long Short-Term Memory (LSTM) and Inception-based Convolutional Networks (ID CNN).

Proposed Methodology

The methodology used in this study focuses on the proposed methodology from three selected papers on AI-driven phishing detection.

Reinforcement Learning

Reinforcement learning was developed to address the shortcomings of tradi- tional phishing detection strategies such as rule-based methods and heuristic approaches.

Data Collection and Feature Extraction

In this method there are 2 datasets used:

1. Dataset 1(Real-World Dataset): A dataset consisting of 5000 emails, evenly distributed with 2500 phishing emails (sourced from PhishTank and OpenPhish archives) and 2500 benign emails (collected from public corporate email traffic and open datasets). Dataset 1 was split using an 80/20 ratio (4000 for training and 1000 for testing);
2. Dataset 2 (Synthetic Phishing Dataset): A synthetic dataset containing 1000 phishing emails generated using templates based on real-world phishing attack patterns, in- cluding domain spoofing, urgent calls-to-action, credential harvesting requests, and misleading hyperlinks. Dataset 2 was reserved exclu- sively for external validation without prior model exposure.

They focused on extracting both content-based and header- based features, including the following:

URL Features: Length of URLs, number of subdomains, presence of IP addresses, and suspicious domain patterns;

HTML Structure: Presence of embedded scripts, hidden form fields, and iframe usage;

Sender Reputation: Mismatched “From” and “Reply-To” addresses, domain age, and SPF/DKIM validation status;

Keyword Patterns: Occurrence of phishing-related keywords like “urgent”, “verify your account”, and “password reset”.

Reinforcement Learning Model

Their proposed phishing detection system is based on DQN architecture, a widely adopted RL technique that combines Q-Learning with deep neural net- works to handle large, complex feature spaces effectively [6].

Agent-Environment Interaction

In the RL framework, the phishing detection model operates as an agent in- teracting with an environment (the dataset). The environment presents emails with various features, and the agent’s task is to classify each as either phishing or benign. The interaction is formalized as a Markov Decision Process (MDP),

defined by the tuple (S, A, R, P, ), where the following variables are used: S = Set of states (email features); A = Set of actions (phishing, benign); R = Re- ward function; P = State transition probability; = Discount factor for future rewards.

Q-Learning Algorithm

The model was trained in over 10,000 iterations to ensure convergence. The reward system was defined as follows:

TP (True Positive): Correctly identified phishing email

TN (True Negative): Correctly identified legitimate email

FP (False Positive): Legitimate email misclassified as phishing FN (False Negative): Phishing email misclassified as legitimate

Table 1 below summarizes the phishing detection accuracy and false positive rates during the training phase across different models.

Table 1: Comparative performance metrics (training phase)

Model	Accuracy (%)	False Positive Rate (%)
RL-Based DQN	95	2
SVM	85	12
Random Forest	87	10

Linking Back to Previous Strategies

Unlike rule-based systems that rely on static thresholds [4], or heuristic models prone to high false positives [1], RL-based approach learns from real-time feedback, dynamically adjusting its decision-making process. It also overcomes the data-dependency issues seen in supervised ML techniques [5], offering the following:

Real-Time Adaptation: Continuous learning from new phishing patterns;
Lower False Positive Rates: Optimized through tailored reward mechanisms;
Scalability: Effective across diverse phishing attack vectors.

This methodological framework demonstrates how RL bridges the gaps left by traditional phishing detection models, providing

Machine Learning Techniques 1

The methodology employed for this study is based on a systematic approach to data collection and preparation for machine learning applications. The dataset for this study was gathered from two sources, both of which were obtained via Kaggle. The first dataset consists of 18,650 emails, with 7,328 categorised as phishing attacks and 11,322 as safe. The second sample consists of 5,128 emails, with 2,868 categorised as safe and 2,239 as phishing. The dataset’s main features are ”Email Text” and ”Email Type”, where the email text is used as input and the email type (safe or phishing) is used as the classification label.

The dataset was split into training and testing subsets with an 80:20 ratio, resulting in 14,920 emails for training and 3,730 for testing in the first dataset, and 3,597 for training and 1,531 for testing in the second dataset. This ensures that the model is trained on a substantial portion of the data while still being evaluated on unseen examples to measure generalisation ability.

Data preprocessing is a crucial step in building a robust phishing email detection system. This study carefully handles null values by removing incomplete data entries to maintain the integrity of the dataset. Text cleaning techniques, such as removal of special characters, stop word removal, and lemmatisation, are applied to standardise email content. Feature extraction and engineering are then used to convert the text data into structured numerical representations, making it suitable for machine learning models.

The machine learning models used in the email categorisation system were carefully chosen and supported by empirical data. CatBoost, Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost were chosen for their superior performance in classification tests.

Machine Learning Techniques 2

In this section described how they derived their proposed features for predicting phishing PDC web pages. They also describe and illustrate the proposed system architecture of our prediction model.

Phishing web page prediction features

First, [7] describe how they identified PDC web pages, the web pages that collect users’ personal data. From their observations, PDC web pages usually consist of at least one word or phrase (we term as PDC phrase) in their structure and contents which is related to the specific personal data being collected. The importance of differentiating PDC from non-PDC web pages is that we avoid predicting web pages which do not pose any phishing threat. This will avoid degrading of user’s experience when accessing the non-PDC web pages and the potential false positives on these web pages which will prevent users from accessing them, causing significant implications to users and websites’ owners (e.g., denial of services and losses of revenues).

To determine common PDC phrases used by PDC web pages, we investigated 100 samples of phishing and legitimate web pages capturing the data from which we obtained a list of 43 PDC phrases (indicated in Table 2).

Table 2: Common PDC Phrases in PDC Web Pages

Category	PDC Phrases
Login Credentials	Username, Login, Sign in, Sign up, Create an account, Forgot password, Forgotten your password, Reset password, Remember me, Passcode, Secret key, Security key, Security number, ID
Social Media Login	Log in with Facebook, Sign in with Facebook, Log in with Twitter, Sign in with Twitter, Log in with Google, Sign in with Google
Account and Member- ship	Customer number, Membership number, Account number, Ac- count
Financial Information	Debit card number, Credit card number, Card number, Card- holder, Billing information, Billing address, Security code, Expiry date
Other Personal Data	Email, PIN, Date of birth, Birth date, Phone

System architecture of the prediction model

Training Process

Our prediction model based on the proposed features earlier is built using the following six-step process (illustrated in Figure 1 as steps 1 to 6)

Figure 1: Training Process

Prediction Process The process of predicting a new web page requested by a user is shown in Figure 2.

Figure 2: Prediction Process

Conclusion

Based on the studies conducted in these three papers, it is concluded that the reinforcement learning-based phishing detection model uses Deep Q-Network (DQN) to improve adaptability, accuracy, and continuous learning, achieving 95% accuracy, 96% precision, 94% recall, 2% false positive rate, and AUC of 0.92. The XGBoost model achieves an outstanding accuracy of 98%, with 95% precision and 99% recall for phishing emails, demonstrating strong capabilities in phishing email classification. The CatBoost model achieves 98% accuracy, with 98% recall for phishing emails and 96% precision, making it highly efficient in identifying phishing attempts while maintaining minimal classification errors.

References

Carlo Marcelo Revoredo da Silva, Eduardo Luzeiro Feitosa, and Vini- cius Cardoso Garcia. Heuristic-based strategy for phishing prediction: A survey of url-based approach. Computers & Security, 88:101613, 2020.
Opeyemi Isaiah Enitan. An ai-powered approach to real-time phishing de- tection for cybersecurity. International Journal, 12(6), 2023.
Haidar Jabbar and Samir Al-Janabi. Ai-driven phishing detection: Enhanc- ing cybersecurity with reinforcement learning. Journal of Cybersecurity and Privacy, 5(2):26, 2025.
Haidar Jabbar, Samir Al-Janabi, and Francis Syms. Ai-integrated cyber security risk management framework for it projects. In 2024 International Jordanian Cybersecurity Conference (IJCC), pages 76–81. IEEE, 2024.
V Santhana Lakshmi and MS Vijaya. Efficient prediction of phishing web- sites using supervised learning algorithms. Procedia Engineering, 30:798– 805, 2012.
Seow Wooi Liew, Nor Fazlida Mohd Sani, Mohd Taufik Abdullah, Razali Yaakob, and Mohd Yunus Sharum. An effective security alert mechanism for real-time phishing tweet detection on twitter. Computers & security, 83:201–207, 2019.
Thomas Nagunwa. Ai-driven approach for robust real-time detection of zero- day phishing websites. International Journal of Information and Computer Security, 23(1):79–118, 2024.
Arya, V., Gaurav, A., Gupta, B. B., Hsu, C. H., & Baghban, H. (2022, December). Detection of malicious node in vanets using digital twin. In International Conference on Big Data Intelligence and Computing (pp. 204-212). Singapore: Springer Nature Singapore.
Lu, Y., Guo, Y., Liu, R. W., Chui, K. T., & Gupta, B. B. (2022). GradDT: Gradient-guided despeckling transformer for industrial imaging sensors. IEEE Transactions on Industrial Informatics, 19(2), 2238-2248.
Sedik, A., Maleh, Y., El Banby, G. M., Khalaf, A. A., Abd El-Samie, F. E., Gupta, B. B., … & Abd El-Latif, A. A. (2022). AI-enabled digital forgery analysis and crucial interactions monitoring in smart communities. Technological Forecasting and Social Change, 177, 121555.

Cite As

Rekarius (2025) AI-Driven Real-Time Detection of Zero-Day Phishing Threats for Enhanced Cybersecurity, Insights2Techinfo, pp.1

874000cookie-checkAI-Driven Real-Time Detection of Zero-Day Phishing Threats for Enhanced Cybersecurity

Post Views: 113

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

AI-Driven Real-Time Detection of Zero-Day Phishing Threats for Enhanced Cybersecurity

Abstract

Introduction

Proposed Methodology

Reinforcement Learning

Data Collection and Feature Extraction

Reinforcement Learning Model

Agent-Environment Interaction

Q-Learning Algorithm

Linking Back to Previous Strategies

Machine Learning Techniques 1

Machine Learning Techniques 2

Phishing web page prediction features

System architecture of the prediction model

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Edge AI Security: Protecting Tiny Models with Big Impact

Memory in Conversational AI Agents: The Backbone of Long-Term Intelligence

The Future of Remote Work and Hybrid Models in 2025

Photonic AI Processors: Architectures, Applications, and Limitations

Neuro-Symbolic AI: The Comeback of Logic in an LLM World