Feature Engineering for Smishing Detection: Analyzing SMS-Based Threats

By: Gonipalli Bharath, Vel Tech University, Chennai, India, & International Center for AI and Cyber Security Research and Innovations, Asia University, Taiwan, Gmail: gonipallibharath@gmail.com

Abstract:

Smishing, or SMS phishing, is a novel cyber threat in which attackers utilize spoofed SMS messages in an effort to trick users into divulging secret information. Smishing messages are difficult to detect due to their short length and dynamic attack behavior. In this paper, feature engineering methods are discussed for advancing smishing detection using natural language processing (NLP), metadata, and behavioral patterns. We provide a systematic approach to the selection and extraction of pertinent features, along with a comparative evaluation of machine learning algorithms. The research highlights the value of engineered features in enhancing classification accuracy and suggests an effective feature set for vigorous smishing detection.

Introduction:

The advancement in mobile communication has given attackers new avenues to target users using smishing. In contrast to conventional phishing attacks through emails, smishing uses text messages, usually spoofing official entities. The messages include malicious links, fake requests, or prompts to access personal information. Smishing attacks are this risky due to the fact that SMS messages do not feature sophisticated filtering capabilities like emails. Current detection techniques are mostly based on rule-based and blacklist-based systems, which do not generalize to new patterns of attacks[1]. Thus, a successful smishing detection system needs to be machine learning-based and depend on carefully crafted features for classification efficiency.

Methodology:

The suggested approach consists of different phases, ranging from data fetching to feature engineering and model testing. Data collection is the initial phase, and in this phase, publicly available datasets of labeled SMS messages, both original and smishing, are utilized[2]. Artificial smishing messages are also created to introduce diversity into the datasets. Preprocessing and cleaning of text are then performed, which include stopwords removal, lowercasing of text, and removal of redundant characters for data normalization prior to feature extraction[3].

Feature engineering is an important step that deals with the conversion of raw data into useful input features to enhance the performance of machine learning models. Several types of features are mined. Text features encompass message attributes such as message length, word frequency, special character presence, TF-IDF (Term Frequency-Inverse Document Frequency), and N-grams analysis. URL and domain features entail the detection of the presence of links, domain reputation, and URL structure[4]. Metadata features examine sender details, message timing, and linguistic complexity. Behavioral features study user behavior, including click-through and response behavior. Contextual features like the presence of financial or urgency keywords and sentiment analysis are also incorporated to glean more subtle cues of fraudulent messages.

Following feature extraction, feature selection methods such as mutual information, chi-square tests, and recursive feature elimination (RFE) are used to keep the most discriminative features. The procedure guarantees that the resulting feature subset enhances classification performance and minimizes computational time[5]. The dataset is divided into training and testing datasets through stratified sampling to obtain a balanced ratio of smishing and legitimate messages.

Classifiers like Random Forest, Support Vector Machine (SVM), and Long Short-Term Memory (LSTM) networks are trained and tested. Grid search with cross-validation is used to tune hyperparameters and enhance the performance of the models. The performance of various models is assessed using accuracy, precision, recall, and F1-score. The final step is to incorporate the trained model into a real-time detection system, where the model is able to classify incoming SMS messages and flag potential smishing attempts for manual investigation.

Proposed Smishing Detection Model:

The machine learning classifiers start their operation following the extraction phase. Random Forest together with Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) networks function as the models for assessment purposes[6]. The detection system utilizes Random Forest because it defends against overfitting and accepts various features simultaneously while SVM operates effectively in high-dimensional environments and LSTM detects textual patterns across sequences. A flowchart below demonstrates the sequence of feature engineering together with the detection process-

Fig(1) Flow chart of phishing detection

Observations and Perspectives:

The accuracy levels of the detection system depend substantially on the selection of appropriate features. The Structural Support Vector Machine operates best on established text features yet Long Short Term Memory networks display superior performance when dealing with sequential text patterns. Random Forest uses all feature types successfully while maintaining a balanced detection performance through high precision and recall[6]. The detection ability improves when behavioral features are added because attackers use predictable interaction patterns in their strategies. The inclusion of metadata features enables models to reach better performance results than text-only models shows the importance of analyzing sender and URL information.

Conclusion:

The core function of feature engineering stands as a vital element to improve detection models for smishing systems. The detection system achieves better discrimination between legitimate and fraudulent SMS messages through the combination of text-based and URL-related and metadata and behavioral features. The analysis confirms that detection systems deliver optimal outcomes through machine learning and deep learning methods combined in one system. The upcoming research will concentrate on developing mobile real-time smishing detection systems while making them more understandable for end users.

References:

  1. R. J, K. B, K. VKG, A. R, and A. A, “Phishing Attack Detector and Awareness Generator,” in 2024 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS), Oct. 2024, pp. 1–6. doi: 10.1109/ICPECTS62210.2024.10780168.
  2. H. Xu, A. Qadir, and S. Sadiq, “Enhancing Mobile Cybersecurity: Smishing Detection Using Ensemble Learning and Smote,” Jun. 25, 2024, Social Science Research Network, Rochester, NY: 4875342. doi: 10.2139/ssrn.4875342.
  3. M. K. Mehmood, H. Arshad, M. Alawida, and A. Mehmood, “Enhancing Smishing Detection: A Deep Learning Approach for Improved Accuracy and Reduced False Positives,” IEEE Access, vol. 12, pp. 137176–137193, 2024, doi: 10.1109/ACCESS.2024.3463871.
  4. Y. Tashtoush, M. Alajlouni, F. Albalas, and O. Darwish, “Exploring low-level statistical features of n-grams in phishing URLs: a comparative analysis with high-level features,” Clust. Comput., vol. 27, no. 10, pp. 13717–13736, Dec. 2024, doi: 10.1007/s10586-024-04655-5.
  5. “Feature selection techniques for machine learning: a survey of more than two decades of research | Knowledge and Information Systems.” Accessed: Feb. 12, 2025. [Online]. Available: https://link.springer.com/article/10.1007/s10115-023-02010-5
  6. M. Zolfaghari and M. R. Golabi, “Modeling and predicting the electricity production in hydropower using conjunction of wavelet transform, long short-term memory and random forest models,” Renew. Energy, vol. 170, pp. 1367–1381, Jun. 2021, doi: 10.1016/j.renene.2021.02.017.
  7. Sahoo, S. R., & Gupta, B. B. (2019). Hybrid approach for detection of malicious profiles in twitter. Computers & Electrical Engineering76, 65-81.
  8. Zou, L., Sun, J., Gao, M., Wan, W., & Gupta, B. B. (2019). A novel coverless information hiding method based on the average pixel value of the sub-images. Multimedia tools and applications78, 7965-7980.
  9. Katiyar A. (2024) Social Engineering Phishing Detection, Insights2Techinfo, pp.1

Cite As

Bharath G. (2024) Feature Engineering for Smishing Detection: Analyzing SMS-Based Threats, Insights2Techinfo, pp.1

82870cookie-checkFeature Engineering for Smishing Detection: Analyzing SMS-Based Threats
Share this:

Leave a Reply

Your email address will not be published.