Abstract:Cross-modal emotion recognition (ERC) aims to perceive human emotions through data from different modalities. Currently, most research still focuses on single modalities, neglecting the importance of other modalities. This paper proposes a cross-modal emotion recognition method based on knowledge distillation, which significantly improves the accuracy of emotion recognition by integrating information from both speech and text modalities. Specifically, the proposed method utilizes a pre-trained text model, RoBERTa, as the teacher model, and transfers its high-quality textual emotional representations to a lightweight speech student model through feature distillation. Additionally, a bi-directional objective distillation is employed, enabling the teacher and student models to mutually transfer knowledge. Experimental results show that the proposed method achieves superior performance on the IEMOCAP and MELD datasets.