Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
The
seamless transformation of visual content into descriptive text and naturalistic
speech, termed Vision-to-Voice, represents a significant interdisciplinary
advancement at the intersection of computer vision, natural language processing
(NLP), and speech synthesis. This paper explores the development of an
end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding,
semantic description generation, and high-quality speech synthesis, thereby
enabling AI systems to narrate visual content for human users. The proposed
methodology integrates Transformer-based image captioning models with
context-aware linguistic augmentation and neural vocoders trained for
expressive speech synthesis, ensuring fluent and expressive audio descriptions
for visual content. While individual advancements in image captioning and TTS
are well documented, their seamless fusion into an end-to-end, real-time system
presents unique research and engineering challenges, including context
preservation across modalities, maintaining linguistic fluency, and ensuring
audio naturalness. This paper addresses these gaps through a unified
encoder-decoder captioning module with Bahdanau Attention, followed by a
Tacotron 2-based Mel-spectrogram generation module and HiFi-GAN-based waveform
synthesis module. Extensive experimentation and evaluations using standard
datasets, including Flickr8K and LJSpeech, demonstrate the efficacy of the
proposed system in terms of caption quality (BLEU) and audio naturalness (MOS
scores). The Vision-to-Voice system holds promising applications in assistive
technologies, multimedia enrichment, and automated video annotation systems,
thereby contributing to both academic research and real-world accessibility
solutions.
Country : India
IRJIET, Volume 9, Special Issue of ICCIS-2025 May 2025 pp. 206-213