By: Reka Kerja, CCRI, Asia University, Taiwan

Abstract

Telegram is an instant messaging application with a large user base that can be a target for cybercrime, such as spam containing malicious links. This ar- ticle, using a literature review, compiles five machine learning algorithms that can be used to detect spam on Telegram. The dataset used was obtained from Kaggle. The dataset contains 20,000 messages, which can be classified as spam or malicious (70-30%). Five machine learning models were applied: Extreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LGBM), Cat- Boosting, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). Simulation results show that the SVM algorithm performs very well, achiev- ing 94% accuracy, while the KNN algorithm performs the lowest, with 73% accuracy.

Keywords Spam, Telegram, Extreme Gradient Boosting, Light Gradient Boosting Machine, CatBoosting, Support Vector Machine, K-Nearest Neigh- bors.

Introduction

Continuously receiving messages from unknown sources can be mentally drain- ing and may interfere with users ability to notice or focus on important com- munications. In addition, these messages often contain harmful phishing links. Unfortunately, few platforms particularly messaging applications are equipped with dedicated spam filtering systems, despite the critical importance of such tools in preventing potential harm or losses. Consequently, the modern world faces the challenge of monitoring incoming text streams in social networks and messengers [4]. In this article, five machine learning algorithms used to de- tecting the spam on the telegram platform are analyzed: Extreme Gradient Boosting, Light Gradient Boosting Machine, CatBoosting Ensemble Method, Support Vector Machine, and K-Nearest Neighbors.

Literature analysis

Machine Learning Algorithms

Machine learning (ML) is a branch of mathematics, computing and statistics which deals with the design of algorithms that can learn [2]. Five (5) different ML algorithms were employed in this study which include XGB, LGBM, Cat- Boosting Ensemble Method, Support Vector Machine, and K-Nearest Neighbor.

Materials and Methods

Dataset Description

The Telegram spam dataset used in this study was obtained from Kaggle. The dataset contains 20,000 messages which can be classified into spam or ham (70- 30%) [3].

Data Preprocessing

Techniques were employed to remove noise from the dataset that could affect the system’s accuracy and perform data cleansing. The techniques include Tokeniza- tion, normalization, removing repeated chars, removing punctuations, removing stop words. Pre-processing cleans and normalises the text data to ensure that it is in a consistent format [3].

Feature Extraction

Extracting suitable features is the first step in employing machine learning al- gorithm for classification problems. This facilitates the identification of spam

contents and their transformation into numerical feature vectors. Content fea- tures that represent the text included in the Telegram messages were extracted. Language features are used, namely Term-Frequency-based (TFIDF). TFIDF is the most popular feature extraction technique. It normally transforms all sen- tences as a vector of term frequencies (TF) and assigns a score for each word in the text based on the number of times its occurrence and the probability it can be found in texts. The relative importance of a term in a document compared to other words in the corpus can be shown by TF-IDF [3].

Extreme Gradient Boosting

XGB is a tree-based ensemble algorithm which employs a gradient boosting machine learning technique for accomplishing regression and classification tasks. XGB uses level0algorithms to grow trees. It is different from the RF algorithm in the way it grows, orders, and combines the results. It employs a variety of algorithms for split finding. Trees grow in leaf-wise manner when histogram is used. The method works by bucketing features values into group of bins to construct features in histogram. The splitting is performed on the bins instead of on the features. The bucket bins are constructed before each tree is built. As a result, it speeds up the training which in turn reduces the computation complexity [1].

Light Gradient Boosting Machine (LGBM)

LightGBM is essentially an ensemble algorithm which combines the predictions of multiple decision trees by adding them together to make the final prediction that generalizes better [3].

CatBoosting Ensemble Method

The CatBoost algorithm, also known as categorical boosting, belongs to the gra- dient boosting family within the domain of machine learning. The method was specifically designed to effectively handle categorical data, making it suitable for both quantitative and qualitative variables [3].

Support Vector Machine

SVM algorithm is employed for both linear and nonlinear data classification. It uses a nonlinear mapping to convert the primary training set into an upper-level size. SVM explores for the linear optimal separating hyperplane in this new size as a decision border by which the tuples of one class from another are being split [3].

K-Nearest Neigbour (K-NN)

The k-NN algorithm is a classic classification model based on the principle of nearest learning examples in the feature space. It learns by analogy which

means the comparison of a provided test tuple with training tuples which are similar. These tuples must be the closest ones to the unknown tuple. A distance metric like Euclidean distance describes the “closeness”. To classify k-nearest neighbour, the tuple that is not known is selected as the most common class among its k- nearest neighbor [3].

Performence Metrics

The detailed results of the performance metrics are presented in the Table 1.

Table 1: Classifiers’ performance

Classifier	Precision	Recall	F-Measure	Accuracy
XGBoost (spam)	0.94	0.70	0.80	0.90
LGBM (spam)	0.94	0.69	0.80	0.90
CatBoost (spam)	0.94	0.69	0.80	0.90
SVM (spam)	0.92	0.87	0.89	0.94
K-NN (spam)	0.54	0.43	0.48	0.73

Conclusion

According to the performance metrics obtained, each algorithm can be sorted based on the performance shown in detecting spam, as follows: SVM, CatBoost, LGBM, and K-NN.

References

Fatimah Alzamzami, Mohamad Hoda, and Abdulmotaleb El Saddik. Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation. IEEE access, 8:101840–101858, 2020.
Emmanuel Gbenga Dada, David Opeoluwa Oyewola, and Joseph Hurcha Yakubu. Power consumption prediction in urban areas using machine learn- ing as a strategy towards smart cities. Arid Zone Journal of Basic and Applied Research (AJBAR), 1(1):11–24, 2022.
Abubakar Hassan, Yusuf Ayuba, Mohammed Aji Wajiro, and Muham- mad Zaharadeen Ahmad. Machine learning algorithms for telegram spam filtering. FUDMA JOURNAL OF SCIENCES, 8(6):170–176, 2024.
Sharifah Md Yasin and Iqbal Hadi Azmi. Email spam filtering technique: challenges and solutions. Journal of Theoretical and Applied Information Technology, 101(13):5130–5138, 2023.
Arya, V., Gaurav, A., Gupta, B. B., Hsu, C. H., & Baghban, H. (2022, December). Detection of malicious node in vanets using digital twin. In International Conference on Big Data Intelligence and Computing (pp. 204-212). Singapore: Springer Nature Singapore.
Lu, Y., Guo, Y., Liu, R. W., Chui, K. T., & Gupta, B. B. (2022). GradDT: Gradient-guided despeckling transformer for industrial imaging sensors. IEEE Transactions on Industrial Informatics, 19(2), 2238-2248.
Sedik, A., Maleh, Y., El Banby, G. M., Khalaf, A. A., Abd El-Samie, F. E., Gupta, B. B., … & Abd El-Latif, A. A. (2022). AI-enabled digital forgery analysis and crucial interactions monitoring in smart communities. Technological Forecasting and Social Change, 177, 121555.
Crawford, M., Khoshgoftaar, T. M., Prusa, J. D., Richter, A. N., & Al Najada, H. (2015). Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1), 23.
Jindal, N., & Liu, B. (2007, May). Review spam detection. In Proceedings of the 16th international conference on World Wide Web (pp. 1189-1190).
Markines, B., Cattuto, C., & Menczer, F. (2009, April). Social spam detection. In Proceedings of the 5th international workshop on adversarial information retrieval on the web (pp. 41-48).

Cite As

Kerja R. (2025) Comparative Analysis Machine Learning Algorithms for Spam Detection on the Telegram Platform, Insights2Techinfo, pp.1

868200cookie-checkComparative Analysis Machine Learning Algorithms for Spam Detection on the Telegram Platform

Post Views: 7

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Comparative Analysis Machine Learning Algorithms for Spam Detection on the Telegram Platform

Abstract

Introduction

Literature analysis

Machine Learning Algorithms

Materials and Methods

Dataset Description

Data Preprocessing

Feature Extraction

Extreme Gradient Boosting

Light Gradient Boosting Machine (LGBM)

CatBoosting Ensemble Method

Support Vector Machine

K-Nearest Neigbour (K-NN)

Performence Metrics

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Face Morphing in Identity Fraud: Threats to Authentication Systems

Full Body Re-enactment: Controlling Human Motion with Neural Networks

Comparative Analysis Machine Learning Algorithms for Spam Detection on the Telegram Platform

Voice Cloning with AI: The Rise of Synthetic Speech

Phishing on Mobile Platforms: Attack Methods and Solutions