基于平行多尺度时空图卷积网络的三维人体姿态估计算法
作者:
中图分类号:

TP391

基金项目:

国家自然科学基金(61907028, 11872036); 陕西省青年科技新星项目(2021KJXX-91); 文化和旅游部重点实验室资助项目(2023-02, 2022-13); 陕西省自然科学基金面上项目(2024JC-YBMS-503)


Parallel Multi-scale Spatio-temporal Graph Convolutional Network for 3D Human Pose Estimation
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [37]
  • | | | |
  • 文章评论
    摘要:

    针对基于图卷积神经网络(GCN)的人体姿态估计方法不能充分聚合关节点时空特征、限制判别性特征提取的问题, 构造基于平行多尺度时空图卷积的网络模型(PMST-GNet), 提高三维人体姿态估计的性能. 该模型首先设计对角占优的时空注意力图卷积(DDA-STGConv), 构建跨域时空邻接矩阵, 对骨架关节点信息进行基于自约束和注意力机制约束的建模, 增强节点间的信息交互; 然后, 通过设计图拓扑聚合函数构造不同的图拓扑结构, 以DDA-STGConv为基本单元构建平行多尺度子网络模块(PM-SubGNet); 最后, 为了更好地提取骨架关节的上下文信息, 设计多尺度特征交叉融合模块(MFEB), 实现平行子图网络之间多尺度信息的交互, 提高GCN的特征表示能力. 在主流3D姿态估计数据集Human3.6M和MPI-INF-3DHP数据集上的对比实验结果表明, 所提PMST-GNet模型在三维人体姿态估计中具有较好的效果, 优于Sem-GCN、GraphSH、UGCN等当前基于GCN网络的主流算法.

    Abstract:

    As the human pose estimation (HPE) method based on graph convolutional network (GCN) cannot sufficiently aggregate spatiotemporal features of skeleton joints and restrict discriminative features extraction, in this paper, a parallel multi-scale spatio-temporal graph convolutional network (PMST-GNet) model is built to improve the performance of 3D HPE. Firstly, a diagonally dominant spatiotemporal attention graph convolutional layer (DDA-STGConv) is designed to construct a cross-domain spatiotemporal adjacency matrix and model the joint features based on self-constraint and attention mechanism constrain, therefore enhancing information interaction among nodes. Then, a graph topology aggregation function is devised to construct different graph topologies, and a parallel multi-scale sub-graph network module (PM-SubGNet) is constructed with DDA-STGConv as the basic unit. Finally, a multi-scale feature cross fusion block (MFEB) is designed, by which multi-scale information among PM-SubGNets can interact to improve the feature representation of GCN, therefore better extracting the context information of skeleton joints. The experimental results on the mainstream 3D HPE datasets Human3.6M and MPI-INF-3DHP show that the proposed PMST-GNet model has a good effect in 3D HPE and is superior to the current mainstream GCN-based algorithms such as Sem-GCN, GraphSH, and UGCN.

    参考文献
    [1] 张宇, 温光照, 米思娅, 张敏灵, 耿新. 基于深度学习的二维人体姿态估计综述. 软件学报, 2022, 33(11): 4173–4191. http://www.jos.org.cn/1000-9825/6390.htm
    Zhang Y, Wen GZ, Mi SY, Zhang ML, Geng X. Overview on 2D human pose estimation based on deep learning. Ruan Jian Xue Bao/Journal of Software, 2022, 33(11): 4173–4191 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6390.htm
    [2] 丁静, 舒祥波, 黄捧, 姚亚洲, 宋砚. 基于多模态多粒度图卷积网络的老年人日常行为识别. 软件学报, 2023, 34(5): 2350–2364. http://www.jos.org.cn/1000-9825/6439.htm
    Ding J, Shu XB, Huang P, Yao YZ, Song Y. Multimodal and multi-granularity graph convolutional networks for elderly daily activity recognition. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5): 2350–2364 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6439.htm
    [3] Yang HH, Liu HX, Zhang YM, Wu XJ. FMR-GNet: Forward mix-hop spatial-temporal residual graph network for 3D pose estimation. Chinese Journal of Electronics, 2024, 33(6): 1–14.
    [4] Moon G, Lee KM. I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Proc. of the 16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 752–768. [doi: 10.1007/978-3-030-58571-6_44]
    [5] Zhao L, Peng X, Tian Y, Kapadia M, Metaxas DN. Semantic graph convolutional networks for 3D human pose regression. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 3420–3430.
    [6] Liu JF, Rojas J, Li YH, Liang ZJ, Guan YS, Xi N, Zhu HF. A graph attention spatio-temporal convolutional network for 3D human pose estimation in video. In: Proc. of the 2021 IEEE Int’l Conf. on Robotics and Automation. Xi’an: IEEE, 2021. 3374–3380.
    [7] Cai YJ, Ge LH, Liu J, Cai JF, Cham TJ, Yuan JS, Thalmann NM. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 2272–2281. [doi: 10.1109/ICCV.2019.00236]
    [8] Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017.
    [9] Wu YP, Kong DH, Wang SF, Li JH, Yin BC. HPGCN: Hierarchical poselet-guided graph convolutional network for 3D pose estimation. Neurocomputing, 2022, 487: 243–256.
    [10] 王文冠, 沈建冰, 贾云得. 视觉注意力检测综述. 软件学报, 2019, 30(2): 416–439. http://www.jos.org.cn/1000-9825/5636.htm
    Wang WG, Shen JB, Jia YD. Review of visual attention detection. Ruan Jian Xue Bao/Journal of Software, 2019, 30(2): 416–439 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5636.htm
    [11] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [12] Pavllo D, Feichtenhofer C, Grangier D, Auli M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 7745–7754. [doi: 10.1109/CVPR.2019.00794]
    [13] Sun K, Xiao B, Liu D, Wang JD. Deep high-resolution representation learning for human pose estimation. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 5686–5696. [doi: 10.1109/CVPR.2019.00584]
    [14] Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325–1339.
    [15] Mehta D, Rhodin H, Casas D, Fua P, Sotnychenko O, Xu WP, Theobalt C. Monocular 3D human pose estimation in the wild using improved CNN supervision. In: Proc. of the 2017 Int’l Conf. on 3D Vision. Qingdao: IEEE, 2017. 506–516.
    [16] Zou ZM, Liu KK, Wang L, Tang W. High-order graph convolutional networks for 3D human pose estimation. In: Proc. of the 31st British Machine Vision Conf. BMVA Press, 2020.
    [17] Li H, Shi BW, Dai WR, Chen YB, Wang BT, Sun Y, Guo M, Li CL, Zou JN, Xiong HK. Hierarchical graph networks for 3D human pose estimation. In: Proc. of the 32nd British Machine Vision Conf. (BMVC). BMVA Press, 2021. 387.
    [18] Chen XP, Lin KY, Liu WT, Qian C, Lin L. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 10887–10896. [doi: 10.1109/CVPR.2019.01115]
    [19] Chen YL, Wang ZC, Peng YX, Zhang ZQ, Yu G, Sun J. Cascaded pyramid network for multi-person pose estimation. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 7103–7112.
    [20] Chen TL, Fang C, Shen XH, Zhu YH, Chen ZL, Luo JB. Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE Trans. on Circuits and Systems for Video Technology, 2022, 32(1): 198–209.
    [21] Lin JH, Lee GH. Trajectory space factorization for deep video-based 3D human pose estimation. In: Proc. of the 30th British Machine Vision Conf. Cardiff: BMVA Press, 2019. 101.
    [22] Li SC, Ke L, Pratama K, Tai YW, Tang CK, Cheng KT. Cascaded deep monocular 3D human pose estimation with evolutionary training data. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 6172–6182.
    [23] Zheng C, Zhu SJ, Mendieta M, Yang TJN, Chen C, Ding ZM. 3D Human pose estimation with spatial and temporal Transformers. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 11636–11645.
    [24] Xu TH, Takano W. Graph stacked hourglass networks for 3D human pose estimation. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 16100–16109. [doi: 10.1109/CVPR46437.2021.01584]
    [25] Martinez J, Hossain R, Romero J, Little JJ. A simple yet effective baseline for 3D human pose estimation. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision (ICCV). Venice: IEEE, 2017. 2659–2668. [doi: 10.1109/ICCV.2017.288]
    [26] Liu KK, Ding RQ, Zou ZM, Wang L, Tang W. A comprehensive study of weight sharing in graph networks for 3D human pose estimation. In: Proc. of the 16th European Conf. on Computer Vision 2020 (ECCV). Glasgow: Springer, 2020. 318–334. [doi: 10.1007/978-3-030-58607-2_19]
    [27] Zeng AL, Sun X, Yang L, Zhao NX, Liu MH, Xu Q. Learning skeletal graph neural networks for hard 3D pose estimation. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 11416–11425. [doi: 10.1109/ICCV48922.2021.01124]
    [28] Li WH, Liu H, Tang H, Wang PC, van Gool L. MHFormer: Multi-hypothesis transformer for 3D human pose estimation. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 13137–13146.
    [29] Shan WK, Liu ZH, Zhang XF, Wang SS, Ma SW, Gao W. P-STMO: Pre-trained spatial temporal many-to-one model for 3D human pose estimation. In: Proc. of the 17th European Conf. on Computer Vision (ECCV). Tel Aviv: Springer, 2022. 461–478. [doi: 10.1007/978-3-031-20065-6_27]
    [30] Bai GH, Luo YM, Pan XL, Wang J, Guo JM. Real-time 3D human pose estimation without skeletal a priori structures. Image and Vision Computing, 2023, 132: 104649.
    [31] Li H, Shi BW, Dai WR, Zheng HW, Wang BT, Sun Y, Guo M, Li CL, Zou JN, Xiong HK. Pose-oriented Transformer with uncertainty-guided refinement for 2D-to-3D human pose estimation. In: Proc. of the 37th AAAI Conf. on Artificial Intelligence. Washington: AAAI, 2023. 1296–1304. [doi: 10.1609/aaai.v37i1.25213]
    [32] Han CC, Yu X, Gao CX, Sang N, Yang Y. Single image based 3D human pose estimation via uncertainty learning. Pattern Recognition, 2022, 132: 108934.
    [33] Wang JB, Yan SJ, Xiong YJ, Lin DH. Motion guided 3D pose estimation from videos. In: Proc. of the 16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 764–780. [doi: 10.1007/978-3-030-58601-0_45]
    [34] Yang HH, Liu HX, Zhang YM, Wu XJ. Hierarchical parallel multi-scale graph network for 3D human pose estimation. Applied Soft Computing, 2023, 140: 110267
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杨红红,刘泓希,张玉梅,吴晓军.基于平行多尺度时空图卷积网络的三维人体姿态估计算法.软件学报,2025,36(5):2151-2166

复制
分享
文章指标
  • 点击次数:282
  • 下载次数: 1740
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2022-11-14
  • 最后修改日期:2023-07-20
  • 在线发布日期: 2024-06-20
文章二维码
您是第19893692位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号