By: C S Nakul Kalyan, Asia University

Abstract

The development of generative AI models for text-to-video synthesis poses threats, including the creation of realistic, manipulated media and the potential for spreading fake misinformation by Scammers who utilize technologies such as deepfakes. To overcome these kinds of misuse of technologies, a two-part framework has been proposed, which can generate videos and it can perform deepfake text detection, where it can detect the manipulated text in the audio. The first part of the framework is the process of text-to-video combination, where it uses Latent Diffusion Model (LDM), which is used for the low-level creation of temporally consistent video frames, and Large Language Model (LLM), which is used to direct the high-level contents. The second part of the frame- work uses a discriminative model to detect fake machine-generated text (MGT) to reduce the misuse that is taking place by using this technology. To detect the set of features, such as linguistic, semantic, stylometric, sentiment, and emoji- based characteristics, this framework uses a hybrid deep learning architecture that adds a Convolutional Neural Network (CNN) with an LSTM Network. To check the text detection model’s performance, metrics such as Accuracy, Precision, Recall, and F1-Score have been used.

Keywords

Text-to-Video synthesis, Hybrid deep learning, deepfake detection, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Machine- generated text (MGT) detection.

Introduction

The quick development of Artificial Intelligence (AI) has enabled the technology to create highly realistic media by using text-to-video synthesis [4]. This technology uses advanced generative models to make the text into visual con- tent. This advanced technology of creating realistic text-to-visual content poses crucial threats, such as the generation of realistic deepfake media and spreading misinformation about someone by generating fake content regarding the particular individual or groups [5]. To overcome these threats, this study introduces a two-part framework for deep-fake text identification and video production. For text-to-video generation, the pipeline uses a large language model (LLM) and uses the Latent Diffusion Model (LDM) to generate the videos from the texts [4]. The other part is a robust detection model, which is used to detect the machine-generated Text (MGT) [1][5]. In this article, we will go through the creation of a Hybrid deep learning architecture, which will create the videos and detect Machine-generated Text (MGT). The complete overall system architecture has been shown in Figure 1 below:

Proposed Methodology

This methodology presents a dual framework architecture where they can create high-quality, realistic-looking videos from text prompts, and they can detect deepfake text or videos that have been generated from the text. Here, the proposed method uses advanced generative and discriminative AI models for both the creation of realistic videos and the detection of deepfake content.

Data Collection and Preparation

This dual-component model has been trained on a carefully prepared dataset, such as for the text-to-video synthesis module. The dataset that has been used here is a vareity of high-resolution and time corresponding, which covers a wide range of visual concepts, actions, and temporal dynamics for converting text-to- video. All data is pre-processed to ensure that there is consistent video resolution, frame rate, and text-to-frame alignment. For the deepfake text detection module, the dataset contains a balanced combination of machine-generated text (MGT) and human-written text (HWT) [5]. The MGT has been generated by using Large Language Models (LLMs), which guarantees having more deepfake text patterns, and the HWT is taken from various domains to capture the linguistic library [1]. The characteristics and data composition have been shown in Table 1 below:

Table 1: Dataset Composition and Characteristics

Dataset Component	Type	Source	Size	Resolution/Format	Purpose
Text-to-Video Training Data	Video-Text Pairs	Multi-domain corpus	50,000 pairs	1080p, 30fps, MP4	Video generation training
Human-Written Text (HWT)	Natural Text	News articles, blogs, social media	25,000 samples	Text sequences (50-500 words)	Baseline for detection
Machine-Generated Text (MGT)	Synthetic Text	GPT-4, ChatGPT, Bard outputs	25,000 samples	Text sequences (50-500 words)	Deepfake detection training
Validation Set	Mixed	HWT + MGT	10,000 samples	Balanced 50:50 ratio	Model validation
Test Set	Mixed	HWT + MGT	5,000 samples	Balanced 50:50 ratio	Final evaluation
Video Quality Assessment	Reference Videos	Professional content	1,000 videos	4K, various lengths	Quality benchmarking

Text-to-video Synthesis

The main purpose of Text-to-video Synthesis is to convert the written content into logical video clips [4]. This will be done by the multi-stage frameworks, which use generative models such as:

Text Encoding

The Text encoding is done by a Strong Language Model (LLM) in which trans- lates the high level semantic information from the text to a frame-by-frame plan or a script for the video. This step is crucial to maintain the narrative flow.

Video Generation

The main video generation will be done by using the Latent Diffusion Models (LLMs), in which they can work in a compressed latent space, which will be used to reduce the computational cost, and they are particularly used for high-resolution video synthesis [4]. By doing the diffusion process with the encoded text prompt, the model can create the frames that capture the textual description. This modal ensures high visual quality and frame consistency for the generated videos.

Audio Integration

The video will be combined with the synthesized audio to give a fully engaged experience. Text-to-Speech models, which have been trained on a variety of voices and tones, will be used to convert the text prompt into a natural-sounding voice-over. The sound effects and Background music can be added based on matching the video content, where it will increase the realism of the video output.

Deepfake Text Detection

To detect Machine-generated Text (MGT), the deepfake detection module uses a discriminative model [1][5]. A hybrid model that can be used to detect the linguistic features and contextual embeddings can be used to overcome this binary classification problem [2].

Feature Extraction

To detect the manipulated content, the model will extract some features from the text, which are [1]:

Linguistic Features:

Here, the features such as grammatical structure, syntax, and word choice have been extracted mainly to identify the patterns of machine-generated text (MGT) [1][2].

Semantic Features:

Contextualized embeddings from the pretrained models, like BERT, have been used to extract the deeper meaning, relationships, and connections between the words [1].

Stylometric Features:

These features will analyze and extract the writing style, sentence length variation, and word frequency, etc [1].

Sentiment and Emoji Features:

This feature analyzes the emotional tone of the voice and the usage of emojis, where it can easily identify the irregularities which has been produced by the machine-generated Text (MGT) [1].

Model Architecture

A hybrid deep learning framework, which is a combination of a convolutional neural network (CNN) and long-Short-term memory (LSTM), will be used for classification purposes [1]. The CNN will be used to extract the local n-gram- like characters from the text data, where the LSTM will be used to capture the temporal and sequential relationships between the texts. The combination of these 2 frameworks will enable the model to use both the local and long-range contextual information, which will increase the accuracy and robustness of the detection system.

System Integration and Evaluation

The two parts, such as text detection and video synthesis, will be combined into a single unified framework. This process starts with the text input, which is given by the user, where the text is analyzed with the deepfake detection module [5]. If the entered text is human-written, then it proceeds with the next step of text-to-video synthesis, where the process of video generation will take place. If the text entered is fake, the system can flag the content and alert the user, or it can take any other predefined steps to stop the spread of false information [3][5].

The Hybrid CNN-LSTM Architecture has been shown in Figure 2 below:

The overall system performance will be evaluated using the metrics such as:

Deepfake Text Detection

The performance of the deepfake text detection model will be measured by metrics such as accuracy, precision, recall, and F1-score, and it can also identify the effectiveness of the model in detecting human and machine-generated content [1][2][5].

Video Synthesis

Here, the quality of the generated video will be assessed based on standard quantitative measures and human perceptions [4]. It also evaluates the quality based on visual fidelity, temporal coherence, and the alignment of the generated video content with the original text prompt. The Overall performance Evaluation results have been shown in Table 2 Below:

Table 2: Performance Evaluation Results

Model/Component	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Processing Time (ms)
Deepfake Text Detection
CNN-only baseline	84.2	83.7	84.8	84.2	12.3
LSTM-only baseline	87.1	86.9	87.4	87.1	18.7
Hybrid CNN-LSTM (Proposed)	92.6	92.3	92.9	92.6	15.2
BERT-based classifier	90.4	90.1	90.7	90.4	45.6
Text-to-Video Synthesis Quality
Visual Fidelity Score	88.7	–	–	–	2,340
Temporal Coherence Score	91.2	–	–	–	2,340
Text-Video Alignment Score	89.4	–	–	–	2,340
Overall System Performance	90.1	91.2	89.6	90.4	2,355

Conclusion

In this study, we have gone through a single framework for the important task of deepfake text detection and the production of synthetic video content. By using a multi-stage framework that combines a text-to-video pipeline with a hybrid deep learning model for classifying machine-generated text (MGT), we have built an approach against the threats which is posed by advanced AI. This method uses LLMs for directing the content and LDMs for generating the video content. The hybrid CNN-LSTM model performed well to differentiate between human-written and machine-generated text with being trained on a wide range of linguistic, semantic, stylometric, sentiment, and emoji variables. The performance metrics such as Accuracy, Precision, Recall, and F1-score has been shown a effective results where they can be used for practical use.

References

Alicia Tsui Ying Chong, Hui Na Chua, Muhammed Basheer Jasser, and Richard TK Wong. Bot or human? detection of deepfake text with semantic, emoji, sentiment and linguistic features. In 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), pages 205–210. IEEE, 2023.
Adamya Gaur, Sanjay Kumar Singh, and Pranshu Saxena. Performance analysis of deepfake text detection techniques on social-media. In 2024 Inter- national Conference on Distributed Computing and Optimization Techniques (ICDCOT), pages 1–6. IEEE, 2024.
Jun Jang and Thai Le. The art of deepfake text obfuscation: Lessons learned from a manual, first-person perspective. In 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET, pages 1–6. IEEE, 2024.
P Indira Priya, M Manju, R Anitha, R Harsavardhini, and S Sri Harini. Transforming text to video: Leveraging advanced generative ai techniques.In 2024 Asian Conference on Intelligent Technologies (ACOIT), pages 1–6. IEEE, 2024.
Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, and Bimal Viswanath. Deepfake text detection: Limitations and opportunities. In 2023 IEEE sym- posium on security and privacy (SP), pages 1613–1630. IEEE, 2023.
Gupta, B. B., Misra, M., & Joshi, R. C. (2012). An ISP level solution to combat DDoS attacks using combined statistical based approach. arXiv preprint arXiv:1203.2400.
Jain, A. K., & Gupta, B. B. (2018). Rule-based framework for detection of smishing messages in mobile environment. Procedia Computer Science, 125, 617-623.
Chhabra, M., Gupta, B., & Almomani, A. (2013). A novel solution to handle DDOS attack in MANET. Journal of Information Security Vol. 4 No. 3 (2013) , Article ID: 34631 , 15 pages

Cite As

Kalyan C S N (2025) Text-to-Video Synthesis: Creating Deepfakes from Words Alone, Insights2Techinfo, pp.1

891500cookie-checkText-to-Video Synthesis: Creating Deepfakes from Words Alone

Post Views: 11

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Text-to-Video Synthesis: Creating Deepfakes from Words Alone

Abstract

Keywords

Introduction

Proposed Methodology

Data Collection and Preparation

Text-to-video Synthesis

Text Encoding

Video Generation

Audio Integration

Deepfake Text Detection

Feature Extraction

Linguistic Features:

Semantic Features:

Stylometric Features:

Sentiment and Emoji Features:

Model Architecture

System Integration and Evaluation

Deepfake Text Detection

Video Synthesis

Conclusion

References

Cite As

Leave a Reply Cancel reply

Detecting and Preventing Phishing Attacks in IoT-Based Smart Healthcare Systems

Data-Driven Insights into Rare Disease Diagnosis and Treatment with AI

Genetic Algorithms and Data Analytics for Cybersecurity in Phishing and Blockchain Systems

Machine Learning in Biometric Security Systems

The Role of AI and Machine Learning in Cloud Storage

How AI is Revolutionizing Cyber Forensics

Role of Natural Language Processing (NLP) in Email Phishing Detection

LLM-Based Phishing Detection: URL Phishing and Voice Phishing

Beyond the Code: A Systematic Review of Psychological Techniques in Phishing Attacks

Improving Generalization in Phishing URL Detection: A Review of Multimodal BERT and BERT-PhishFinder Models

QR code Phishing URL Detection Based on Lightweight Deep Learning Model