Abstract:In order to solve the problem that the current multi-modal emotion recognition effect is not good,adual-source emotion recognition model based on Transformer and enhanced information fusion is proposed.The model is composed of audio and video encoding and dual-source enhanced feature fusion modules.Among them,the video coding branch uses MobileViTv2 to extract the spatial features ofeach frame of video,and embeds the residual structure in the Transformer encoder structure to enhance the ability to extract short-term associated semantic information of each frame.A dimensionality matcher is built in the audio feature extraction part,which avoids potential heterogeneity gaps and improves the robustness of model training.A low-parameter cross-modal attention mechanism is introduced in the fusion of audio and video features to enhance the feature fusion ability from two perspectives.The effectiveness ofour method in multimodal emotion recognition tasks is demonstrated by comparison and ablation experiments.