Abstract:Rumors with text and images are more misleadingand harmful than text-only rumors.Therefore,multimodal rumor detection has become a new hot-spot issue.In addition,most of the existing methods simply concatenate unimodal features without considering inter-modality relationships.Therefore,this paper proposes a multimodal rumor detection method based on dual pre-trained transformers(BERT and ViT)to extract features of text words and images respectively,and then uses cross attention mechanism to fuse text and visual features,plus text semantic features extracted by text CNN.Finally,the obtained multimodal fusion features are input into the rumor detection module for multimodal rumor classification.The model uses the pre-trained model for feature extraction,which has been trained on large-scale datasets,and has better performance.This method considers the relationship between multimodes,which can more effectively fuse multimodal features and effectively improve theeffectiveness of rumor detection.As the core of the model,cross attention mechanism dynamically adjusts the weight of words by combining the information of text and image modes.Experiments conducted on two public benchmark datasets(Twitter and Weibo)validate the performance of the proposed method for rumor detection.