Realistic Video Synthesis from Audio using GAN
2025, International Journal for Research in Applied Science & Engineering Technology (IJRASET)
https://doi.org/10.22214/IJRASET.2025.73064Abstract
Realistic video generation from audio input is a challenging and emerging domain in the intersection of natural language processing, computer vision, and generative modeling. The ability to automatically generate coherent and visually compelling video content from raw audio has promising applications in media creation, virtual education, assistive technologies, and entertainment. Manual video creation remains time-consuming and skill-intensive, while automated solutions often lack semantic alignment and visual realism. To address this gap, this project proposes an end-to-end intelligent pipeline that synthesizes realistic video content from audio input using Generative Adversarial Networks (GANs). The system begins by transcribing the user's audio using OpenAI's Whisper ASR model, followed by extracting meaningful textual descriptions via a language model (e.g., Groq LLaMA or OpenAI GPT). The script is used to generate key search terms for visual content retrieval, sourcing high-quality imagery from the Pexels API. Speech is generated using Edge TTS, and synchronized subtitles are created.The images are compiled into a dynamic video using MoviePy, and visual quality is further enhanced using Real-ESRGAN for super-resolution. The final output is a short, high-resolution, contextually accurate video with natural narration and relevant imagery. This work demonstrates the effectiveness of combining audio processing, NLP, GAN-based enhancement, and open content APIs to automate realistic video generation from scratch.
References (8)
- A. Radford et al., "Whisper: A large-scale, weakly-supervised model for speech recognition," arXiv, 2022. Available: https://arxiv.org/abs/2212.04356
- C. Ledig et al., "A GAN-based architecture for single-image super-resolution," in CVPR Proceedings, 2017, pp. 4681-4690.
- X. Wang et al., "Real-ESRGAN: A technique for improving low-quality images using adversarial training on synthetic examples," in ICCV Workshops, 2021.
- Y. Wu et al., "TA2V: Generating aligned video from audio and text using diffusion models," IEEE Transactions on Multimedia, 2024.
- Groq Inc., "Groq API for fast inference using LLMs (LLaMA/GPT)," 2024. [Online]. Available: https://groq.com
- MoviePy Developers, "MoviePy: A Python library for editing video programmatically," 2023. [Online]. Available: https://zulko.github.io/moviepy/
- OpenAI, "Whisper: Open-source speech-to-text system," 2023. [Online]. Available: https://github.com/openai/whisper
- Pexels, "Pexels API: Access to royalty-free images and videos," 2023. [Online]. Available: https://www.pexels.com/api/