Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
Recent
advances in neural Text-to-Speech (TTS) systems have produced speech with high
intelligibility and naturalness, yet most deployed systems still sound
emotionally neutral. This lack of affective expressiveness limits user
engagement and degrades the quality of human– computer interaction, especially
in applications such as virtual assistants, audiobooks, education, and
accessibility technologies. This work proposes an emotion-infused TTS framework
that extends a Tacotron-based sequence-to-sequence architecture with explicit
emotion conditioning. The system leverages the Emotional Speech Database (ESD)
to model five emotional categories—neutral, happy, angry, sad, and surprise—and
incorporates emotion vectors alongside text embeddings in the encoder–decoder
pipeline. Mel-spectrograms predicted by the model are converted to waveforms
using the Griffin–Lim algorithm. Experimental training on English-emotion
subsets of ESD demonstrates stable convergence of mel-spectrogram
reconstruction loss and the capability to synthesize perceptually distinct
emotional speech, as observed through qualitative waveform and spectrogram
analysis. A web-based interface is further developed to enable end-user
interaction, allowing text input or file upload with selectable emotional
style. The proposed system shows that explicit emotion conditioning can
significantly enhance expressiveness of neural TTS without sacrificing
intelligibility, and it provides a practical foundation for emotionally aware
human–machine communication.
Country : India
IRJIET, Volume 9, Issue 12, December 2025 pp. 192-198