DocHero AI - Best paraphrasing and translation tool for academic and professional writing | DocHero AI - Best paraphrasing and translation tool for academic and professional writing

STGFormer: Spatio-temporal GraphFormer for 3D human pose estimation in video

Yang Liu, Zhiyong Zhang

2026-3

Pattern Recognition Vol. 171

10.1016/j.patcog.2025.112239

问题

The paper addresses the underutilization of spatio-temporal body structure features in transformers and inadequate granularity of spatio-temporal interaction modeling in graph convolutional networks for 3D human pose estimation, leading to depth ambiguity.

方法

The study introduces the Spatio-Temporal GraphFormer (STGFormer) framework, featuring a Spatio-Temporal criss-cross Graph attention mechanism and a dual-path Modulated Hop-wise Regular GCN to process temporal and spatial dimensions independently.

关键发现

The proposed method achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets, demonstrating superior capacity for learning body structure and capturing long-range dependencies compared to transformer-based, GCN-based, and hybrid methods.

3个要点

STGFormer effectively integrates spatio-temporal graph structures into attention layers, leveraging prior knowledge for improved long-range dependency capture.
The dual-path modulated hop-wise regular GCN module extracts first-order and higher-order spatio-temporal graph information from features efficiently.
Extensive experiments confirm STGFormer's efficacy and generalization, outperforming state-of-the-art techniques on benchmark datasets for 3D human pose estimation.

学术详情点击展开

干预措施:Spatio-Temporal GraphFormer (STGFormer) framework

结果指标:Mean Per Joint Position Error (MPJPE), Percentage of Correct Keypoints (PCK), and Area Under the Curve (AUC)

局限性:The proposed model is explicitly designed for single-person scenarios, potentially resulting in decreased performance in frames containing multiple individuals. Occlusions remain a substantial challenge.

未来研究方向:Future work includes developing a spatio-temporal graph representation method tailored for multi-person skeletal sequences and incorporating multi-view scenarios for self-supervised learning.

关键发现:The spatio-temporal criss-cross graph attention separately models spatial and temporal correlations, leveraging encoded graph structure information. The dual-path modulated hop-wise regular GCN module adjusts graph weights and accounts for higher-order dependencies across both time and space dimensions.

生成于 3/26/2026