Abstract:The high-precision diagnosis performance of the vision Transformer network depends on adequate training data.Using the advantage of convolutional network in extracting local features construct an extraction layer that can describe both local and global features of faults,and improve the anti-noise diagnosis capability of the diagnostic model. First,the convolutional network module is introduced to convert the original vibration signal into a feature vector that can be directly received by the Transformer network to extract the local features of the fault.Then,the global information generated by the multi-head self-attention mechanism of Transformer network is combined to construct the feature vector that can describe both local and global features of the fault.Finally,in the prediction layer of the Transformer network,the contribution of the feature vectors is automatically filtered using an efficient channel attention mechanism.The fault diagnosis results on the case western reserve university(CWRU)bearing dataset show that the improved Transformer network bearing fault diagnosis model achieves an accuracy of 90.21%under the noise interference with a signal-to-noise ratio of -4 dB,which is a 13.2%improvement in accuracy compared with the original Transformer model,and shows excellent diagnostic performance in a noisy environment.