By: C S Nakul Kalyan; Asia University

Abstract

The emergence of Artificial Intelligence (AI) has paved the way for the creation of realistic cloned voices, which have been used in fraudulent calls. In this study, we will discuss the growing threat of AI-generated voices in phishing scams, where a multilayered approach is proposed. The framework combines features such as threat modeling, multi-modal feature analysis, and detection of real or fake voice calls. Apart from detecting manipulations in speeches, this framework also employs and implements countermeasure strategies, such as technical defenses, regulatory actions, and public awareness initiatives for these types of manipulated voices in scam calls. in this study, we will go through the usage of this framework, and it also states the potential of AI-generated voice phishing to make the traditional fraud into automated digital fraud.

Keywords

AI-generated Voices, Voice cloning, Deepfake detection. Voice phishing, Scam automation.

Introduction

The rapid development of AI has led to the creation of features such as voice synthesis and cloning techniques. Apart from using these techniques in customer service and entertainment, these technologies provide a new threat, and AI- generated voice phishing has become a problem, allowing attackers to mimic trusted entities and spread misinformation about them. The Voice-enabled AI bots have the capacity to indulge in scam activities such as financial fraud and identity theft, which can be done on their own [1]. These threats have increased, such as open-source technologies for the deepfake creation and voice cloning have made the work easier for attackers to implement voice-based fraud [2][4]. To counter back for these fraudulent activities, in this article, we will go through the multi-modal detection framework that has a combination of threat modeling, multi-modal analysis, and deepfake voice detection, which will be used to determine whether the calls are scams or not [3][5].

Proposed Methodology

This study proposes a multi-modal framework to detect scam calls that have been generated by AI. This integrated method contains threat modeling, multi- modal analysis, deepfake detection, and countermeasures, which are stated be- low in Figure 1:

Figure 1: Overall Framework Architecture

Threat Modeling and Data Simulation

Scam call Scenarios

The common scenarios for scam calls will be done in places such as bank account fraud, IRS impersonation scams, crypto-transfer fraud, and identity theft. In real-world cases, attackers will be targeting these domains to get the sensitive information or money.

Dataset Construction

The datasets have been collected from speech datasets such as GRID, TED Talks, and News interviews. and artificially generated scam calls. To clone the Scammer voice from the audio samples, voice cloning tools such as Real-Time Voice Cloning (RTVC), OpenVoice, and F5-TTS were used, which enables the multilingual and cross-accent Voice replication. The Dataset composition and voice cloning tools have been presented in Table 1 below:

Table 1: Dataset Composition and Voice Cloning Tools

Data Source	Sample Count	Duration (Hours)	Language/Accent	Voice Cloning Tool	Purpose
GRID Corpus	1,200	15.5	English (British)	RTVC	Baseline legitimate speech
TED Talks	800	22.3	English (Multi-accent)	OpenVoice	Natural conversation patterns
News Interviews	600	12.8	English (American)	F5-TTS	Professional speech samples
Synthetic Scam Calls	2,400	18.7	Multilingual	RTVC + OpenVoice	AI-generated fraud scenarios
Partially Manipulated	900	8.2	English + Others	F5-TTS	Hybrid real-synthetic content
Total Dataset	5,900	77.5	Mixed	Various	Complete framework testing

Annotation Process

In the process of annotation, each audio sample has been analyzed and divided into categories such as real, deepfake, and partially manipulated voices. For the accurate analysis, the additional meta data have been annotated, such as language, urgency cues, and contextual content, which will be used to support the analysis process. These annotation results can compare the human and AI-generated phishing strategies.

Multi-modal Feature Analysis

Audio Feature Extraction

Automatic Speech Recognition (ASR) models were used to process the speech signals and convert the audio into phoneme-level. Here, the patterns such as pitch fluctuation, prosody, and timing irregularities were extracted from the samples. The AI-generated voices will be showing some inconsistencies, such as a monotonous or dull tone of voice or overly smooth transitions, where we can determine whether the voices are real or AI-generated by extracting these patterns. The audio feature comparison for the Real and AI-generated voices has been shown in Figure 2 below:

Figure 2: Audio Feature Comparison

Social Engineering Vector Detection

Here, the verbal and persuasive aspects of the scam calls have been analyzed, where the indicators such as creation of urgency, manipulating emotions, using authority names and demanding, and other personalized scams have been ex- amined [5]. The AI-generated scam scripts have been using traditional phishing attempts, where the fake voices frequently use these types of above-mentioned ways to manipulate others, and it does this with improved coherence and flu- ency.

Cross-Modal Verification

The cross-modal verification is used to combine the audio, contextualize the metadata, and interaction patterns. For ex, a scammer claiming to represent a bank should be cross-validated with the caller ID and the behavioral patterns, and the mismatches in the cross-validation have been identified when there are differences between the fake voice profile and the fake identity which the scammer uses, so we can detect the scammer easily.

Classification and evaluation Framework

Authenticity Classification

With the corresponding extracted features, the audios were classified into three types such as:

Legitimate Communication:

The audio files that contain unique voice and context will be placed under the legitimate communication category.

Suspicious AI-assisted Communication:

The audio files, which are partially manipulated, will be under the suspicious AI-assisted communication category.

Fully AI-generated scam calls:

The deepfake audio files will be placed under this category.

Deepfake Specific detection

This detection framework specifically targets the deepfake indicators, which include:

Overly Smooth Speech Patterns:

The voice samples that lack the human-like disfluencies will be placed under this category.

Multilingual Inconsistencies:

The multilingual inconsistency is said to occur when the sample’s cloned voice tries to attempts the cross-language translation.

Response timing mismatches:

When there is a mismatch, such as unnatural delays or instant replies that are not humanly possible, these cannot be present in natural speech. So these kind of samples will be placed under this mismatching category.

Evaluation Metrics

By utilizing both the metrics, such as standard and scam-specific, the performance has been evaluated here. The standard metrics, which include Accuracy, precision, Recall, and F1-score have been calculated, and scam-specific metrics such as:

Phishing Success Rate Reduction (PSRR):

The phishing success rate reduction is the ability to stop the scams which is going to happening before the victims share Sensitive data.

Voice Clone Identification Rate (VCIR):

Here the fake cloned voices will be correctly identified within the scam scenarios, where the scam call would be detected. The Evaluation metrics are shown in Table 2 below:

Table 2: Performance Evaluation Results

Classification Category	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	PSRR (%)	VCIR (%)
Legitimate Communication	94.2	92.8	95.1	93.9	N/A	N/A
Suspicious AI-assisted	87.6	85.4	89.3	87.3	78.5	82.1
Fully AI-generated Scam	91.8	93.2	90.7	91.9	89.2	91.5
Overall Framework	91.2	90.5	91.7	91.0	83.8	86.8

Note: PSRR – Phishing Success Rate Reduction; VCIR – Voice Clone Identification Rate

Countermeasure Integration

Technical Defences

Various defense algorithms were integrated into this framework, including Voice spoofing detection systems, real-time caller verification systems, anomaly-based monitoring, etc. These are the technical defenses that have been applied in this integrated framework.

Policy and Regulation

The policy and regulation state the importance of the legal and regulatory interventions, which are taken from international initiatives such as the UK Online Safety Act and U.S legislation addressing the misuse of deepfakes [4].

Conclusion

The AI-generated voices have enabled a new path for phishing attacks by allowing the realistic imitation and scalable automation. For detecting these kinds of forgeries, this article proposed an integrated methodology, which is the combi- nation of threat modeling, multi-modal feature analysis, and authenticity classification, where this framework can support the detection of cloned voices and the manipulation patterns, as well as it can build a strong defense against them, and it provides the technical solutions and regulatory actions against this mis- uses. This study also states the need for proactive defenses against AI-generated voice phishing by bringing together the experimental findings, modalities, and threat analysis, which is done by the integrated tools. Addressing these kinds of difficulties is essential to guard individuals, organizations against the new forms of cyber-enabled fraud.

References

Richard Fang, Dylan Bowman, and Daniel Kang. Voice-enabled ai agents can perform common scams. arXiv preprint arXiv:2410.15650, 2024.
Genesis Gregorious Genelza. A systematic literature review on ai voice cloning generator: A game-changer or a threat? Journal of Emerging Tech- nologies, 4(2):54–61, 2024.
Mumtaz Hussain and Tariq Rahim Soomro. Ai-powered cybercrime: A survey on emerging threats, tools & techniques for countermeasures. In 2024 26th International Multi-Topic Conference (INMIC), pages 1–6. IEEE, 2024.
Elton Chizindu Mpi. Digital rights: How fraudsters exploit artificial intelli- gence for fraud and deception.
Jingru Yu, Yi Yu, Xuhong Wang, Yilun Lin, Manzhi Yang, Yu Qiao, and Fei-Yue Wang. The shadow of fraud: The emerging danger of ai-powered social engineering and its possible cure. arXiv preprint arXiv:2407.15912, 2024.
Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
Keesari, T., Goyal, M. K., Gupta, B., Kumar, N., Roy, A., Sinha, U. K., … & Goyal, R. K. (2021). Big data and environmental sustainability based integrated framework for isotope hydrology applications in India. Environmental Technology & Innovation, 24, 101889.
Gupta, B. B., & Gupta, D. (Eds.). (2020). Handbook of Research on Multimedia Cyber Security. IGI Global.

Cite As

Kalyan C S N (2025) AI-Generated Voices in Scam Calls: Phishing Gets a New Voice, Insights2Techinfo, pp.1

890700cookie-checkAI-Generated Voices in Scam Calls: Phishing Gets a New Voice

Post Views: 66

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.