Vision-to-Voice: AI for generating Description & Audio of Visual Content

P. Jayanth; K. Lakshmi Sree; K. Karthik Kumar Reddy; G. Om Prakash; G. Reddy Prasad

doi:https://doi.org/10.47001/IRJIET/2025.ICCIS-202533

Vision-to-Voice: AI for generating Description & Audio of Visual Content

P. JayanthDepartment of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, IndiaK. Lakshmi SreeDepartment of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, IndiaK. Karthik Kumar ReddyDepartment of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, IndiaG. Om PrakashDepartment of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, IndiaG. Reddy PrasadDepartment of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India

Vol 9 No 2025 (2025): Volume 9, Special Issue of ICCIS-2025 May 2025 | Pages: 206-213

International Research Journal of Innovations in Engineering and Technology

OPEN ACCESS | Research Article | Published Date: 11-06-2025

doi.org/10.47001/IRJIET/2025.ICCIS-202533

Full Text PDF

Abstract

The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and high-quality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image captioning models with context-aware linguistic augmentation and neural vocoders trained for expressive speech synthesis, ensuring fluent and expressive audio descriptions for visual content. While individual advancements in image captioning and TTS are well documented, their seamless fusion into an end-to-end, real-time system presents unique research and engineering challenges, including context preservation across modalities, maintaining linguistic fluency, and ensuring audio naturalness. This paper addresses these gaps through a unified encoder-decoder captioning module with Bahdanau Attention, followed by a Tacotron 2-based Mel-spectrogram generation module and HiFi-GAN-based waveform synthesis module. Extensive experimentation and evaluations using standard datasets, including Flickr8K and LJSpeech, demonstrate the efficacy of the proposed system in terms of caption quality (BLEU) and audio naturalness (MOS scores). The Vision-to-Voice system holds promising applications in assistive technologies, multimedia enrichment, and automated video annotation systems, thereby contributing to both academic research and real-world accessibility solutions.

Keywords

NLG (Natural Language Generation), LLMs (Large Language Models), Perplexity, Text Coherence

Citation of this Article

P. Jayanth, K. Lakshmi Sree, K. Karthik Kumar Reddy, G. Om Prakash, & G. Reddy Prasad. (2025). Vision-to-Voice: AI for generating Description & Audio of Visual Content. In proceeding of Second International Conference on Computing and Intelligent Systems (ICCIS-2025), published in IRJIET, Volume 9, Special Issue ICCIS-2025, pp 206-213. Article DOI https://doi.org/10.47001/IRJIET/2025.ICCIS-202533

This work is licensed under Creative common Attribution Non Commercial 4.0 Internation Licence

References

Hu Xu1 Po-Yao Huang1. (2024). Altogether: Image Captioning via Re-aligning Alt-text.v3. https://doi.org/10.48550/arXiv.2410.17251
Reshmi Sasibhooshan, Suresh Kumaraswamy and Santhoshkumar Sasidharan.(2023).Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction. Article number: 18. https://doi.org/10.1186/s40537-023-00693-9.
D. Wang, Z. Hu, Y. Zhou, R. Hong and M. Wang.(2023). "A Text-Guided Generation and Refinement Model for Image Captioning," in IEEE Transactions on Multimedia, vol. 25, pp. 2966-2977, doi: 10.1109/TMM.2022.3154149.
Hawraz A. Ahmad, Tarik A. Rashid, Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning, Journal of King Saud University - Computer and Information Sciences, Volume 36, Issue 7, 2024, ISSN 1319-1578, https://doi.org/10.1016/j.jksuci.2024.102131.
Sneha Tamboli, Pratiksha Raut, A Review Paper on Text-to-Speech Convertor,vol3, https://ijrpr.com/uploads/V3ISSUE5/IJRPR4449.pdf
D. J. B. Saini, S. Kumar, K. Joshi, A. K. Pathak, S. Jain and A. Singh, "A Novel Approach of Image Caption Generator using Deep Learning," 2023 Third International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), Gobichettipalayam, India, 2023, pp. 24-29, doi: 10.1109/ICUIS60567.2023.00012.
W. Jiang, X. Li, H. Hu, Q. Lu and B. Liu, "Multi-Gate Attention Network for Image Captioning," in IEEE Access, vol. 9, pp. 69700-69709, 2021, doi:10.1109/ACCESS.2021.3067607.https://ieeexplore.ieee.org/document/9382255.
C. Amritkar and V. Jabade, "Image Caption Generation Using Deep Learning Technique," 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 2018, pp. 1-4, doi: 10.1109/ICCUBEA.2018.8697360 https://www.researchgate.net/publication/332674126_Image_Caption_Generation_Using_Deep_Learning_Technique
Andrej Karpathy Li Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions",v2,2015, https://doi.org/10.48550/arXiv.1412.2306
Uotian Luo, “Goal-driven Text Descriptions for Images", arXiv preprint, vol. 1, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2108.12575
Muhanad Hameed Arif,"Image to Text Description Approach based on Deep Learning Models",ISSN:29579651,10.56990/bajest/2024.030103 https://www.researchgate.net/publication/378948235_Image-to-Text_Description_Approach_based_on_Deep_Learning_Models
Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai,"V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models", v4, 2023, https://doi.org/10.48550/arXiv.2308.09300
Hifeng Xie1, Shengye Yu, Qile He, MengtianLi, "SonicVision LM: Playing Sound with Vision Language Mod", v3, 2024, https://doi.org/10.48550/arXiv.2401.04394
UotianLuo, “Goal-driven Text Descriptions for Images", v1, 2021, https://doi.org/10.48550/arXiv.2108.12575
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, “Show and Tell: A Neural Image Caption Generator", v2, 2015, https://doi.org/10.48550/arXiv.1411.4555.

For Authors

Publication Archives

Volume 1 - 2017

Volume 2 - 2018

Volume 3 - 2019

Volume 4 - 2020

Volume 5 - 2021

Volume 6 - 2022

Volume 7 - 2023

Volume 8 - 2024

Volume 9 - 2025

Volume 10 - 2026

For Board Members

Downloads

Research Areas

Vision-to-Voice: AI for generating Description & Audio of Visual Content

Abstract

Keywords

Citation of this Article

References

International Research Journal of Innovations in Engineering
and Technology - IRJIET

Editorial Policies

Quick Links