Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
Through the
simultaneous analysis of voice signals and facial expressions, multimodal
identification seeks to comprehend individual behaviors. In order to represent
information more richly across several modalities, feature fusion is essential
to this process. However, temporal misalignment between modalities and
overfitting brought on by high-dimensional feature spaces are frequent problems
for multimodal systems. An attention technique is developed to solve these
problems by enabling the network to automatically concentrate on the most
instructive local features. This approach is used by the network for both
audio-visual feature integration and temporal modeling. The two primary
contributions of this work are: first, it uses a multi-head self-attention
mechanism to fuse audio and video features, reducing the influence of prior
assumptions on the fusion process; and second, it uses a bidirectional gated
recurrent unit to model the temporal dynamics of the fused features,
incorporating autocorrelation coefficients along the time dimension as
attention weights. Experimental results show that the proposed attention-based
approach significantly improves multimodal emotion recognition accuracy.
Country : India
IRJIET, Volume 10, Issue 3, March 2026 pp. 169-174