Yang Liu, Zhiyong Zhang
2026-3
Pattern Recognition Vol. 171
10.1016/j.patcog.2025.112239
The paper addresses the underutilization of spatio-temporal body structure features in transformers and inadequate granularity of spatio-temporal interaction modeling in graph convolutional networks for 3D human pose estimation, leading to depth ambiguity.
The study introduces the Spatio-Temporal GraphFormer (STGFormer) framework, featuring a Spatio-Temporal criss-cross Graph attention mechanism and a dual-path Modulated Hop-wise Regular GCN to process temporal and spatial dimensions independently.
The proposed method achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets, demonstrating superior capacity for learning body structure and capturing long-range dependencies compared to transformer-based, GCN-based, and hybrid methods.