AI-Generated Voices in Scam Calls: Phishing Gets a New Voice

By: C S Nakul Kalyan; Asia University

Abstract

The emergence of Artificial Intelligence (AI) has paved the way for the creation of realistic cloned voices, which have been used in fraudulent calls. In this study, we will discuss the growing threat of AI-generated voices in phishing scams, where a multilayered approach is proposed. The framework combines features such as threat modeling, multi-modal feature analysis, and detection of real or fake voice calls. Apart from detecting manipulations in speeches, this framework also employs and implements countermeasure strategies, such as technical defenses, regulatory actions, and public awareness initiatives for these types of manipulated voices in scam calls. in this study, we will go through the usage of this framework, and it also states the potential of AI-generated voice phishing to make the traditional fraud into automated digital fraud.

Keywords

AI-generated Voices, Voice cloning, Deepfake detection. Voice phishing, Scam automation.

Introduction

The rapid development of AI has led to the creation of features such as voice synthesis and cloning techniques. Apart from using these techniques in customer service and entertainment, these technologies provide a new threat, and AI- generated voice phishing has become a problem, allowing attackers to mimic trusted entities and spread misinformation about them. The Voice-enabled AI bots have the capacity to indulge in scam activities such as financial fraud and identity theft, which can be done on their own [1]. These threats have increased, such as open-source technologies for the deepfake creation and voice cloning have made the work easier for attackers to implement voice-based fraud [2][4]. To counter back for these fraudulent activities, in this article, we will go through the multi-modal detection framework that has a combination of threat modeling, multi-modal analysis, and deepfake voice detection, which will be used to determine whether the calls are scams or not [3][5].

Proposed Methodology

This study proposes a multi-modal framework to detect scam calls that have been generated by AI. This integrated method contains threat modeling, multi- modal analysis, deepfake detection, and countermeasures, which are stated be- low in Figure 1:

Figure 1: Overall Framework Architecture

Threat Modeling and Data Simulation

Scam call Scenarios

The common scenarios for scam calls will be done in places such as bank account fraud, IRS impersonation scams, crypto-transfer fraud, and identity theft. In real-world cases, attackers will be targeting these domains to get the sensitive information or money.

Dataset Construction

The datasets have been collected from speech datasets such as GRID, TED Talks, and News interviews. and artificially generated scam calls. To clone the Scammer voice from the audio samples, voice cloning tools such as Real-Time Voice Cloning (RTVC), OpenVoice, and F5-TTS were used, which enables the multilingual and cross-accent Voice replication. The Dataset composition and voice cloning tools have been presented in Table 1 below:

Table 1: Dataset Composition and Voice Cloning Tools

Data Source

Sample Count

Duration (Hours)

Language/Accent

Voice Cloning Tool

Purpose

GRID Corpus

1,200

15.5

English (British)

RTVC

Baseline legitimate speech

TED Talks

800

22.3

English (Multi-accent)

OpenVoice

Natural conversation patterns

News Interviews

600

12.8

English (American)

F5-TTS

Professional speech samples

Synthetic Scam Calls

2,400

18.7

Multilingual

RTVC + OpenVoice

AI-generated fraud scenarios

Partially Manipulated

900

8.2

English + Others

F5-TTS

Hybrid real-synthetic content

Total Dataset

5,900

77.5

Mixed

Various

Complete framework testing

Annotation Process

In the process of annotation, each audio sample has been analyzed and divided into categories such as real, deepfake, and partially manipulated voices. For the accurate analysis, the additional meta data have been annotated, such as language, urgency cues, and contextual content, which will be used to support the analysis process. These annotation results can compare the human and AI-generated phishing strategies.

Multi-modal Feature Analysis

Audio Feature Extraction

Automatic Speech Recognition (ASR) models were used to process the speech signals and convert the audio into phoneme-level. Here, the patterns such as pitch fluctuation, prosody, and timing irregularities were extracted from the samples. The AI-generated voices will be showing some inconsistencies, such as a monotonous or dull tone of voice or overly smooth transitions, where we can determine whether the voices are real or AI-generated by extracting these patterns. The audio feature comparison for the Real and AI-generated voices has been shown in Figure 2 below:

Figure 2: Audio Feature Comparison

Social Engineering Vector Detection

Here, the verbal and persuasive aspects of the scam calls have been analyzed, where the indicators such as creation of urgency, manipulating emotions, using authority names and demanding, and other personalized scams have been ex- amined [5]. The AI-generated scam scripts have been using traditional phishing attempts, where the fake voices frequently use these types of above-mentioned ways to manipulate others, and it does this with improved coherence and flu- ency.

Cross-Modal Verification

The cross-modal verification is used to combine the audio, contextualize the metadata, and interaction patterns. For ex, a scammer claiming to represent a bank should be cross-validated with the caller ID and the behavioral patterns, and the mismatches in the cross-validation have been identified when there are differences between the fake voice profile and the fake identity which the scammer uses, so we can detect the scammer easily.

Classification and evaluation Framework

Authenticity Classification

With the corresponding extracted features, the audios were classified into three types such as:

Legitimate Communication:

The audio files that contain unique voice and context will be placed under the legitimate communication category.

Suspicious AI-assisted Communication:

The audio files, which are partially manipulated, will be under the suspicious AI-assisted communication category.

Fully AI-generated scam calls:

The deepfake audio files will be placed under this category.

Deepfake Specific detection

This detection framework specifically targets the deepfake indicators, which include:

Overly Smooth Speech Patterns:

The voice samples that lack the human-like disfluencies will be placed under this category.

Multilingual Inconsistencies:

The multilingual inconsistency is said to occur when the sample’s cloned voice tries to attempts the cross-language translation.

Response timing mismatches:

When there is a mismatch, such as unnatural delays or instant replies that are not humanly possible, these cannot be present in natural speech. So these kind of samples will be placed under this mismatching category.

Evaluation Metrics

By utilizing both the metrics, such as standard and scam-specific, the performance has been evaluated here. The standard metrics, which include Accuracy, precision, Recall, and F1-score have been calculated, and scam-specific metrics such as:

Phishing Success Rate Reduction (PSRR):

The phishing success rate reduction is the ability to stop the scams which is going to happening before the victims share Sensitive data.

Voice Clone Identification Rate (VCIR):

Here the fake cloned voices will be correctly identified within the scam scenarios, where the scam call would be detected. The Evaluation metrics are shown in Table 2 below:

Table 2: Performance Evaluation Results

Classification Category

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

PSRR (%)

VCIR (%)

Legitimate Communication

94.2

92.8

95.1

93.9

N/A

N/A

Suspicious AI-assisted

87.6

85.4

89.3

87.3

78.5

82.1

Fully AI-generated Scam

91.8

93.2

90.7

91.9

89.2

91.5

Overall Framework

91.2

90.5

91.7

91.0

83.8

86.8

Note: PSRR – Phishing Success Rate Reduction; VCIR – Voice Clone Identification Rate

Countermeasure Integration

Technical Defences

Various defense algorithms were integrated into this framework, including Voice spoofing detection systems, real-time caller verification systems, anomaly-based monitoring, etc. These are the technical defenses that have been applied in this integrated framework.

Policy and Regulation

The policy and regulation state the importance of the legal and regulatory interventions, which are taken from international initiatives such as the UK Online Safety Act and U.S legislation addressing the misuse of deepfakes [4].

Conclusion

The AI-generated voices have enabled a new path for phishing attacks by allowing the realistic imitation and scalable automation. For detecting these kinds of forgeries, this article proposed an integrated methodology, which is the combi- nation of threat modeling, multi-modal feature analysis, and authenticity classification, where this framework can support the detection of cloned voices and the manipulation patterns, as well as it can build a strong defense against them, and it provides the technical solutions and regulatory actions against this mis- uses. This study also states the need for proactive defenses against AI-generated voice phishing by bringing together the experimental findings, modalities, and threat analysis, which is done by the integrated tools. Addressing these kinds of difficulties is essential to guard individuals, organizations against the new forms of cyber-enabled fraud.

References

  1. Richard Fang, Dylan Bowman, and Daniel Kang. Voice-enabled ai agents can perform common scams. arXiv preprint arXiv:2410.15650, 2024.
  2. Genesis Gregorious Genelza. A systematic literature review on ai voice cloning generator: A game-changer or a threat? Journal of Emerging Tech- nologies, 4(2):54–61, 2024.
  3. Mumtaz Hussain and Tariq Rahim Soomro. Ai-powered cybercrime: A survey on emerging threats, tools & techniques for countermeasures. In 2024 26th International Multi-Topic Conference (INMIC), pages 1–6. IEEE, 2024.
  4. Elton Chizindu Mpi. Digital rights: How fraudsters exploit artificial intelli- gence for fraud and deception.
  5. Jingru Yu, Yi Yu, Xuhong Wang, Yilun Lin, Manzhi Yang, Yu Qiao, and Fei-Yue Wang. The shadow of fraud: The emerging danger of ai-powered social engineering and its possible cure. arXiv preprint arXiv:2407.15912, 2024.
  6. Sai, K. M., Gupta, B. B., Hsu, C. H., & Peraković, D. (2021, December). Lightweight Intrusion Detection System In IoT Networks Using Raspberry pi 3b+. In SysCom (pp. 43-51).
  7. Keesari, T., Goyal, M. K., Gupta, B., Kumar, N., Roy, A., Sinha, U. K., … & Goyal, R. K. (2021). Big data and environmental sustainability based integrated framework for isotope hydrology applications in India. Environmental Technology & Innovation, 24, 101889.
  8. Gupta, B. B., & Gupta, D. (Eds.). (2020). Handbook of Research on Multimedia Cyber Security. IGI Global.

Cite As

Kalyan C S N (2025) AI-Generated Voices in Scam Calls: Phishing Gets a New Voice, Insights2Techinfo, pp.1

89070cookie-checkAI-Generated Voices in Scam Calls: Phishing Gets a New Voice
Share this:

Leave a Reply

Your email address will not be published.