Evaluating the Efficacy of Phishing Detection Models in Multi-Lingual Environments

By: Mosiur Rahaman, International Center for AI and Cyber Security Research and Innovations, Asia University, Taiwan

Abstract: Phishing attacks are a big problem in the digital world right now, and attackers are always changing how they trick people. Many phishing detection models have been made, but not much research has been done on how well they work in settings with multiple languages. This study looks at how well different phishing detection models work in different languages. It points out the problems and suggests ways to make them work better on a world scale. The study shows that present models work well with English, but they are less accurate when used with other languages. We show that using language-specific features along with cross-lingual transfer learning leads to better recognition rates across several languages.

Keywords: Phishing attacks, Multi-Lingual, Languages

Introduction:

Phishing, which is when someone tries to get private information by pretending to be someone or something they’re not, is a widespread issue that affects people and businesses all over the world. As internet use spreads around the world, phishing attacks are no longer just aimed at English-speaking users. They now target people who speak a wide range of languages. Because of this, it is now more important than ever to have good phishing detection models that work well in settings with multiple languages. The purpose of this study is to look at how well current phishing detection models work in different languages and suggest a new way to make them work better in a global setting.

Literature Review:

There has been a lot of study into finding phishing emails, and many models have been made using machine learning (ML), natural language processing (NLP), and deep learning. Support vector machines, neural networks, decision trees, and ensemble methods are some of the most well-known models. It is assumed that most of these models are linguistically similar because they are trained and tested mostly on English datasets.

Classical machine learning models, like Random Forests and Support Vector Machines, have been shown to be good at finding phishing attempts. As an example, [1] used Random Forests to find fake emails very accurately. But their study only looked at statistics in English. Textual features are analysed by NLP-based models to find phishing material. Studies like [2] used NLP to tell the difference between real emails and fake emails. Even though they work, these models aren’t always reliable in settings with more than one language.

Recent progress in deep learning has led to the creation of more advanced models that can spot scams. In [3] suggested using a convolutional neural network (CNN)-based method, which worked amazingly well. Even so, it’s still not clear how well their model works in languages other than English. When you’re dealing with people who speak more than one language, it can be hard to spot phishing because different countries use different phishing techniques and languages.

Approach:

We present a fresh method combining language-specific characteristics and cross-lingual transfer learning to solve the constraints of current models in multi-lingual situations is shown in Figure 1, Using our approach entails.

A screenshot of a phone

Description automatically generated
Figure 1: Language and Cross-lingual transfer Learning

Feature Extraction Specific to Languages:

Include syntactic and semantic aspects particular to every language. Using language models such as BERT (Bidirectional Encoder Representations from Transformers) refined for languages to capture subtle linguistic patterns [4].

Learning via Cross-Lingual Transfer:

Using knowledge from high-resource languages (e.g., English) applied by transfer learning approaches to enhance model performance in low-resource languages.To enable cross-lingual knowledge and representation using models including mBERT (multilingual BERT) and XLM-R (Cross-lingual Language Model – RoBERTa) [5].

Architecture of hybrid models:

Combining deep learning methods with conventional ML models generates a hybrid model with advantages from both approaches. using an ensemble approach to combine forecasts from several models, hence improving general detection accuracy. Investigate the mathematical and theoretical foundations of every component of our suggested phishing detection approach for multi-lingual situations to validate its effectiveness:

  1. Language-Specific Feature Extraction

Allow to stand for the input text in each language. Formalized as language-specific feature extraction, is:

is the set of characteristics unique to language and .

Dependency parsing offers a tree structure for syntactic characteristics that helps us. Let represent syntactic elements derived from .Regarding semantic aspects, we use word embeddings such those produced by BERT, which offer context-aware representations:

Consequently, the overall feature set is:

Theoretical Reasoning:

By means of self-attention strategies, language models such as BERT effectively capture deep semantic meanings and grammatical patterns. guarantees that BERT learns linguistic patterns particular to a given language by fine-tuning it for that language. Accurately capturing subtleties that generic models could overlook depends on this fine-tuning.

  1. Cross-lingual Transfer Learning: Mathematical Expression

Let be a high-resource language model (such as English BERT) and let be a low-resource language model. Transfer learning can be stated as transferring parameters from to .

Where stands for the fine-tuning changes particular to the low-resource language.

  1. Hybrid Model Architectural Design Mathematical Interpretation

Think of two models: a deep learning model and a conventional model . Combining these under the hybrid model .

Usually found by validation, where and are weighting factors balancing the contributions of every model.

Strategy:

The shared prediction is if :

Justification theoretically:

Combining deep learning models with conventional ML models lets the hybrid model take use of both. While deep learning models capture complicated, non-linear relationships, traditional models shine at catching explicit patterns and rules. Supported by the principle of shared learning, which holds that combined predictors generally outperform individual ones, the shared technique also enhances robustness and accuracy by averaging out individual model flaws.

Conclusion:


Global safety depends on being able to spot phishing attacks in places where people speak different languages. Because of differences in language and culture, our evaluation shows that current models often don’t work as well in other languages as they do in English. Finding things much more quickly in a lot of different languages is much easier with the suggested way, which combines language-specific features with cross-lingual transfer learning. If you want to make phishing detection even better in today’s diverse and globalized digital world, you should keep working on these models and looking into more linguistic traits.

Reference:

  1. A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University – Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, Feb. 2023, doi: 10.1016/j.jksuci.2023.01.004.
  2. R. Alanazi and S. Alanazi, “A hybrid NLP and domain validation technique for disposable email detection,” Alexandria Engineering Journal, vol. 102, pp. 200–210, Sep. 2024, doi: 10.1016/j.aej.2024.05.068.
  3. S. Bhatia, A. Devi, R. Alsuwailem, and A. Mashat, “Convolutional Neural Network Based Real Time Arabic Speech Recognition to Arabic Braille for Hearing and Visually Impaired,” Frontiers in Public Health, vol. 10, May 2022, doi: 10.3389/fpubh.2022.898355.
  4. C. M. Greco and A. Tagarelli, “Bringing order into the realm of Transformer-based language models for artificial intelligence and law,” Artif Intell Law, Nov. 2023, doi: 10.1007/s10506-023-09374-7.
  5. Jain, A. K., et al. (2022). A content and URL analysis‐based efficient approach to detect smishing SMS in intelligent systems. International Journal of Intelligent Systems, 37(12), 11117-11141.
  6. Almomani, A., et al. (2022). Phishing website detection with semantic features based on machine learning classifiers: a comparative study. International Journal on Semantic Web and Information Systems (IJSWIS), 18(1), 1-24.
  7. Gupta, B. B., et al. (2022). Artificial intelligence empowered emails classifier for Internet of Things based systems in industry 4.0. Wireless networks, 28(1), 493-503.
  8. Jain, A. K., & Gupta, B. B. (2022). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 16(4), 527-565.

Cite As

Rahaman M, (2024) Evaluating the Efficacy of Phishing Detection Models in Multi-Lingual Environments, Insights2Techinfo, pp.1

71270cookie-checkEvaluating the Efficacy of Phishing Detection Models in Multi-Lingual Environments
Share this:

Leave a Reply

Your email address will not be published.