Abstract:In order to enhance 3D human pose prediction from monocular 2D poses,we propose a model that combines Transformer and semantic graph convolution.The model consists of four components:Transformer encoding network, semantic graph convolutional encoding network,pose coordinate prediction module,and pose coordinate error regression module.The Transformer network captures global joint features to improve posture relevance,while the Semantic Graph Convolutional Encoding Network focuses on local joint feature extraction to enhance correlations.The pose prediction and error regression modules fuse global and local joint features,improving 3D pose accuracy.Experimental results on Human3.6M dataset show significant improvements,achieving MPJPE and PA-MPJPE values of 32.7 and 25.9 mm,respectively,representing a 3.82%and 1.14%improvement over the control method.