Abstract:A speech emotion recognition method based on amplitude filtering and hierarchical feature fusion strategy is proposed in response to the problem of low accuracy of speech emotion recognition on multi-language joint datasets.The method first applies amplitude filtering to the amplitude distribution pattern in the Mel spectrogram,enlarging the differences between similar amplitudes and achieving high frequency strong gain and low frequency weak gain within the spectrogram.Meanwhile,by multiplying probabilities,it reduces the differences between distant amplitudes in the Mel spectrogram,displaying the detailed middle frequency components. Based on this,the method uses rectangular convolution to extract the temporal dynamic features of the audio signal,generating dynamic feature maps of the Mel spectrogram,which serve as inputs to the hierarchical feature fusion strategy.The hierarchical feature fusion strategy compresses the feature maps to extract temporal dynamic features of different scales and from different depths.The proposed method achieves a classification accuracy of 84.44%on the multi-language joint dataset CER.