By: Reka Rius, CCRI, Asia University, Taiwan
Abstract
This study examines voice-based phishing attacks, also known as vishing, which are a form of social engineering fraud conducted through voice communication to deceive potential victims into disclosing sensitive information. This study discusses a voice analysis system and the emotishing model for detecting vishing attack based on voice and human biological features, particularly emotions, which are difficult to disguise of falsify.
Keywords Vishing Detection, Voice Analysis, Emoti-Shing
Introduction
Smartphone, a technology that facilitates almost all aspects of people’s lives, enable activities such as banking anytime and anywhere, buying and selling, as well as socializing through mobile devices. On the other hand, criminals continue to seek vulnerabilities in the services we use and spread threats by exploiting existing weaknesses. Many efforts have been made to protect user data and information by securing hardware, software (platforms), and procedures. A lot of work is being done to secure the user’s data and information by securing hardware, software (platforms), and procedures [3]. Criminals are becoming more and more interested in passing through the human components of the information system (people) to penetrate the system [1]. They use social engineering (SE) to manipulate human emotions and exploit the human nature of trust to steal users’ data and lure them into financial lose. In this study, we will examine the use of a voice analysis system and the Emoti-Shing model for detection of vishing.
Method
The methods discussed in this study are divided into two:
Voice Analysis System
This study [2] proposes the following method:
Data Preprocessing
1. The text label is converted to numerical value and the data is divided into training and testing sets. 2. Then labels are converted to numpy arrays to fit deep learning models. 3. From the total data, 80% is used for training and 20% for testing purposes.
Tokenization
Hyper-parameters used in Tokenizer objects are: number of words and oov Token.
- number of words: It suggests a number of unique words that are to be loaded in training and testing data. In this paper we selected 500 words, (vocabulory size)
- oov Token: An out of vocabulary token is appended to the word index in the corpusto construct the model. The reason is to replace out of vocabulary words i.ewords that are not in our corpus during texttosequence calls.
Sequencing and Padding
Once tokenization is done, each sentence is represented by a sequence of numbers which uses texts to sequences from the tokenizer object. Eventually, pad the sequence so that we can have the same length of each sequence. Sequencing and padding are performed for both training and testing data. Let’s say before padding, the first sequence is 27 words long whereas the second one is 24. Once the padding is applied, both sequences have a length of 50.
Training the model
They train their datasets through different models to choose which models are giving the best results. For the purpose of this project we chose Dense Spam Detection Architecture, Long Short-Term Memory (LSTM) layer architecture, Bidirectional LSTM Spam detection architecture.
Dense Spam Detection Architecture:
- This is a sequential model, which means that the layers are put up in a sequential order.
- The embedding layer converts each word into an N-dimensional vector of real numbers.
- The pooling layer functions to reduce the number of model parameters.
- Next is the dense layer, which is a layer in a neural network where each neuron receives input from all neurons in the previous layer.
- The final layer is again a dense layer with a sigmoid function, used in models that predict probabilities as output.
- After that, the model is compiled using the Adam optimizer.
Long Short-Term Memory (LSTM) Model:
Long Short-Term Memory Network is an advanced RNN, a sequential network, that allows information to persist. It is equipped for dealing with the evaporating slope issue looked at by RNN. An intermittent impartial network is otherwise called RNN utilized for steady memory.
Bi-directional Long Short-Term Memory (BiLSTM) Model
Bidirectional recurrent neural networks (RNN) are just two separate RNNs joined together. At each time step, this structure enables the networks to get both backward and forward feedback about the sequence.
BERT model
BERT stands for Bidirectional Encoder Representations from Transformers. By reciprocally creating on each left and right context, it is possible to pre-train deep bidirectional representations from unlabeled messages. Following that, the pre- trained BERT model is fine-tuned with one additional output layer to produce advanced models for a wide range of NLP tasks. Table 1 shows the validation and testing accuracy results of the model.
Table 1: Model Validation and Accuracy Testing
Model type | Validation Loss | Accuracy |
BiLSTM | 0.18 | 92% |
LSTM | 0.33 | 89% |
BERT | 0.16 | 94% |
Emoti-Shing
This study [3] proposes the following method:
Level of Analysis
Focuses on direct conversations between the attacker (scammer) and the poten- tial victim, specifically on the stages of:
-Relationship Development
-Attack Execution
The goal is to detect the victim’s vulnerability to vishing attacks in real-time. The stages of social engineering attacks are illustrated in Figure 1 below.

Emotion Extraction from Voice
– Uses human voice analysis to identify vocal attributes: pitch, timbre, loud- ness, and intonation. – Separates linguistic content (words) from paralinguistic content (emotions, mood, speaker states) – focuses on emotions manipulated by the scammer: neutral, anger, fear, excitement.
Formulation of Victim Vulrnerability States
-
- Hidden Markov Model
-Hiddem states: victim vulrnerabillity states (V1, V2, V3)
-Observations: Emotions emitted by the victim (neutral, anger, fear, excitement)
-Transition probabilities (A): Likelihood of moving between states
-Emission probabilities (B): likelihood of the victim emitting a particular emotion in each state
-The HMM is used to predict the victim’s vulnerability in real-time.
Implementation
-Implemented using R programming language and RStudio IDE
-Transition matrix(A) and emission matrix (B) are computed using mathematical formulas based on literature and recorded scam calls
-The model predicts victim state and potential success of the attack
Conclusion
The conclusion of the Emoti-Shing study indicates that the proposed model shows that it is possible to track the changes in vulnerability states of a potential victim, and say if the conversation he/she is involved in is likely to be a scam. The conclusion of the study using the voice analysis system indicates that, based on model accuracy tests, this approach works effectively to detect fake calls and protect users of the implemented application.
References
- Kevin D. Mitnick and William L. Simon. The Art of Deception: Controlling the Human Element of Security. John Wiley & Sons, Hoboken, NJ, USA, 2003.
- Devishree Naidu. Voice analysis system for detection of vishing using deep learning. International journal of health sciences, (I):10457–10466, 2022.
- Virgile Sim´e, Franklin Tchakount´e, Blaise Omer Yenk´e, Duplex Elvis Houpa Danga, Magnuss Dufe Ngoran, Jean Louis Kedieng Ebongue Fendji, et al. Emoti-shing: Detecting vishing attacks by learning emotion dynamics through hidden markov models. Journal of Intelligent Learning Systems and Applications, 16(3):274–315, 2024.
- Gupta, B. B., Gaurav, A., Chui, K. T., & Arya, V. (2024, January). Deep learning-based facial emotion detection in the metaverse. In 2024 IEEE International Conference on Consumer Electronics (ICCE) (pp. 1-6). IEEE.
- Gaurav, A., Gupta, B. B., & Chui, K. T. (2022). Edge computing-based DDoS attack detection for intelligent transportation systems. In Cyber Security, Privacy and Networking: Proceedings of ICSPN 2021 (pp. 175-184). Singapore: Springer Nature Singapore.
- Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
- Griffin, S. E., & Rackley, C. C. (2008, September). Vishing. In Proceedings of the 5th annual conference on Information security curriculum development (pp. 33-35).
- Yeboah-Boateng, E. O., & Amanor, P. M. (2014). Phishing, SMiShing & Vishing: an assessment of threats against mobile devices. Journal of Emerging Trends in Computing and Information Sciences, 5(4), 297-307.
- Ashfaq, S., Chandre, P., Pathan, S., Mande, U., Nimbalkar, M., & Mahalle, P. (2023, June). Defending against vishing attacks: A comprehensive review for prevention and mitigation techniques. In International Conference on Recent Developments in Cyber Security (pp. 411-422). Singapore: Springer Nature Singapore.
- Jones, K. S., Armstrong, M. E., Tornblad, M. K., & Siami Namin, A. (2021). How social engineers use persuasion principles during vishing attacks. Information & Computer Security, 29(2), 314-331.
Cite As
Rekarius (2025) Vishing Detection using Deep Learning and Hidden Markov Models, Insights2Techinfo, pp.1