By: Reka Rius, CCRI, Asia University, Taiwan

Abstract

This study examines voice-based phishing attacks, also known as vishing, which are a form of social engineering fraud conducted through voice communication to deceive potential victims into disclosing sensitive information. This study discusses a voice analysis system and the emotishing model for detecting vishing attack based on voice and human biological features, particularly emotions, which are difficult to disguise of falsify.

Keywords Vishing Detection, Voice Analysis, Emoti-Shing

Introduction

Smartphone, a technology that facilitates almost all aspects of people’s lives, enable activities such as banking anytime and anywhere, buying and selling, as well as socializing through mobile devices. On the other hand, criminals continue to seek vulnerabilities in the services we use and spread threats by exploiting existing weaknesses. Many efforts have been made to protect user data and information by securing hardware, software (platforms), and procedures. A lot of work is being done to secure the user’s data and information by securing hardware, software (platforms), and procedures [3]. Criminals are becoming more and more interested in passing through the human components of the information system (people) to penetrate the system [1]. They use social engineering (SE) to manipulate human emotions and exploit the human nature of trust to steal users’ data and lure them into financial lose. In this study, we will examine the use of a voice analysis system and the Emoti-Shing model for detection of vishing.

Method

The methods discussed in this study are divided into two:

Voice Analysis System

This study [2] proposes the following method:

Data Preprocessing

1. The text label is converted to numerical value and the data is divided into training and testing sets. 2. Then labels are converted to numpy arrays to fit deep learning models. 3. From the total data, 80% is used for training and 20% for testing purposes.

Tokenization

Hyper-parameters used in Tokenizer objects are: number of words and oov Token.

number of words: It suggests a number of unique words that are to be loaded in training and testing data. In this paper we selected 500 words, (vocabulory size)
oov Token: An out of vocabulary token is appended to the word index in the corpusto construct the model. The reason is to replace out of vocabulary words i.ewords that are not in our corpus during texttosequence calls.

Sequencing and Padding

Once tokenization is done, each sentence is represented by a sequence of numbers which uses texts to sequences from the tokenizer object. Eventually, pad the sequence so that we can have the same length of each sequence. Sequencing and padding are performed for both training and testing data. Let’s say before padding, the first sequence is 27 words long whereas the second one is 24. Once the padding is applied, both sequences have a length of 50.

Training the model

They train their datasets through different models to choose which models are giving the best results. For the purpose of this project we chose Dense Spam Detection Architecture, Long Short-Term Memory (LSTM) layer architecture, Bidirectional LSTM Spam detection architecture.

Dense Spam Detection Architecture:

This is a sequential model, which means that the layers are put up in a sequential order.
The embedding layer converts each word into an N-dimensional vector of real numbers.
The pooling layer functions to reduce the number of model parameters.
Next is the dense layer, which is a layer in a neural network where each neuron receives input from all neurons in the previous layer.
The final layer is again a dense layer with a sigmoid function, used in models that predict probabilities as output.
After that, the model is compiled using the Adam optimizer.

Long Short-Term Memory (LSTM) Model:

Long Short-Term Memory Network is an advanced RNN, a sequential network, that allows information to persist. It is equipped for dealing with the evaporating slope issue looked at by RNN. An intermittent impartial network is otherwise called RNN utilized for steady memory.

Bi-directional Long Short-Term Memory (BiLSTM) Model

Bidirectional recurrent neural networks (RNN) are just two separate RNNs joined together. At each time step, this structure enables the networks to get both backward and forward feedback about the sequence.

BERT model

BERT stands for Bidirectional Encoder Representations from Transformers. By reciprocally creating on each left and right context, it is possible to pre-train deep bidirectional representations from unlabeled messages. Following that, the pre- trained BERT model is fine-tuned with one additional output layer to produce advanced models for a wide range of NLP tasks. Table 1 shows the validation and testing accuracy results of the model.

Table 1: Model Validation and Accuracy Testing

Model type	Validation Loss	Accuracy
BiLSTM	0.18	92%
LSTM	0.33	89%
BERT	0.16	94%

Emoti-Shing

This study [3] proposes the following method:

Level of Analysis

Focuses on direct conversations between the attacker (scammer) and the poten- tial victim, specifically on the stages of:

-Relationship Development

-Attack Execution

The goal is to detect the victim’s vulnerability to vishing attacks in real-time. The stages of social engineering attacks are illustrated in Figure 1 below.

Figure 1: Social Engineering attack stages

Emotion Extraction from Voice

– Uses human voice analysis to identify vocal attributes: pitch, timbre, loud- ness, and intonation. – Separates linguistic content (words) from paralinguistic content (emotions, mood, speaker states) – focuses on emotions manipulated by the scammer: neutral, anger, fear, excitement.

Formulation of Victim Vulrnerability States

1. Hidden Markov Model

-Hiddem states: victim vulrnerabillity states (V1, V2, V3)

-Observations: Emotions emitted by the victim (neutral, anger, fear, excitement)

-Transition probabilities (A): Likelihood of moving between states

-Emission probabilities (B): likelihood of the victim emitting a particular emotion in each state

-The HMM is used to predict the victim’s vulnerability in real-time.

Implementation

-Implemented using R programming language and RStudio IDE

-Transition matrix(A) and emission matrix (B) are computed using mathematical formulas based on literature and recorded scam calls

-The model predicts victim state and potential success of the attack

Conclusion

The conclusion of the Emoti-Shing study indicates that the proposed model shows that it is possible to track the changes in vulnerability states of a potential victim, and say if the conversation he/she is involved in is likely to be a scam. The conclusion of the study using the voice analysis system indicates that, based on model accuracy tests, this approach works effectively to detect fake calls and protect users of the implemented application.

References

Kevin D. Mitnick and William L. Simon. The Art of Deception: Controlling the Human Element of Security. John Wiley & Sons, Hoboken, NJ, USA, 2003.
Devishree Naidu. Voice analysis system for detection of vishing using deep learning. International journal of health sciences, (I):10457–10466, 2022.
Virgile Sim´e, Franklin Tchakount´e, Blaise Omer Yenk´e, Duplex Elvis Houpa Danga, Magnuss Dufe Ngoran, Jean Louis Kedieng Ebongue Fendji, et al. Emoti-shing: Detecting vishing attacks by learning emotion dynamics through hidden markov models. Journal of Intelligent Learning Systems and Applications, 16(3):274–315, 2024.
Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
Griffin, S. E., & Rackley, C. C. (2008, September). Vishing. In Proceedings of the 5th annual conference on Information security curriculum development (pp. 33-35).
Yeboah-Boateng, E. O., & Amanor, P. M. (2014). Phishing, SMiShing & Vishing: an assessment of threats against mobile devices. Journal of Emerging Trends in Computing and Information Sciences, 5(4), 297-307.
Ashfaq, S., Chandre, P., Pathan, S., Mande, U., Nimbalkar, M., & Mahalle, P. (2023, June). Defending against vishing attacks: A comprehensive review for prevention and mitigation techniques. In International Conference on Recent Developments in Cyber Security (pp. 411-422). Singapore: Springer Nature Singapore.
Jones, K. S., Armstrong, M. E., Tornblad, M. K., & Siami Namin, A. (2021). How social engineers use persuasion principles during vishing attacks. Information & Computer Security, 29(2), 314-331.

Cite As

Rekarius (2025) Vishing Detection using Deep Learning and Hidden Markov Models, Insights2Techinfo, pp.1

875800cookie-checkVishing Detection using Deep Learning and Hidden Markov Models

Post Views: 106

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Vishing Detection using Deep Learning and Hidden Markov Models

Abstract

Introduction

Method

Voice Analysis System

Data Preprocessing

Tokenization

Sequencing and Padding

Training the model

Dense Spam Detection Architecture:

Long Short-Term Memory (LSTM) Model:

Bi-directional Long Short-Term Memory (BiLSTM) Model

BERT model

Emoti-Shing

Level of Analysis

Emotion Extraction from Voice

Formulation of Victim Vulrnerability States

Implementation

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Edge AI Security: Protecting Tiny Models with Big Impact

Memory in Conversational AI Agents: The Backbone of Long-Term Intelligence

The Future of Remote Work and Hybrid Models in 2025

Photonic AI Processors: Architectures, Applications, and Limitations

Neuro-Symbolic AI: The Comeback of Logic in an LLM World