Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
Vol 9 No 2025 (2025): Volume 9, Special Issue of ICCIS-2025 May 2025 | Pages: 206-213
International Research Journal of Innovations in Engineering and Technology
OPEN ACCESS | Research Article | Published Date: 11-06-2025
The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and high-quality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image captioning models with context-aware linguistic augmentation and neural vocoders trained for expressive speech synthesis, ensuring fluent and expressive audio descriptions for visual content. While individual advancements in image captioning and TTS are well documented, their seamless fusion into an end-to-end, real-time system presents unique research and engineering challenges, including context preservation across modalities, maintaining linguistic fluency, and ensuring audio naturalness. This paper addresses these gaps through a unified encoder-decoder captioning module with Bahdanau Attention, followed by a Tacotron 2-based Mel-spectrogram generation module and HiFi-GAN-based waveform synthesis module. Extensive experimentation and evaluations using standard datasets, including Flickr8K and LJSpeech, demonstrate the efficacy of the proposed system in terms of caption quality (BLEU) and audio naturalness (MOS scores). The Vision-to-Voice system holds promising applications in assistive technologies, multimedia enrichment, and automated video annotation systems, thereby contributing to both academic research and real-world accessibility solutions.
NLG (Natural Language Generation), LLMs (Large Language Models), Perplexity, Text Coherence
P. Jayanth, K. Lakshmi Sree, K. Karthik Kumar Reddy, G. Om Prakash, & G. Reddy Prasad. (2025). Vision-to-Voice: AI for generating Description & Audio of Visual Content. In proceeding of Second International Conference on Computing and Intelligent Systems (ICCIS-2025), published in IRJIET, Volume 9, Special Issue ICCIS-2025, pp 206-213. Article DOI https://doi.org/10.47001/IRJIET/2025.ICCIS-202533
This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence