基于人体和场景上下文的多人3D姿态估计

doi:10.13328/j.cnki.jos.006837

微信服务号

微信订阅号

2025年6月1日 16:17 星期日

首页 > 过刊浏览>2024年第35卷第4期 >2039-2054. DOI:10.13328/j.cnki.jos.006837

PDF HTML阅读 XML下载导出引用引用提醒

基于人体和场景上下文的多人3D姿态估计
DOI:
                        10.13328/j.cnki.jos.006837
                    
CSTR:
                        
                    
作者:
                        何建航何建航
华南理工大学 软件学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
孙郡瑤孙郡瑤
华南理工大学 软件学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
刘琼刘琼
华南理工大学 软件学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:广东省自然科学基金(2021A1515011349)；国家自然科学基金(61976094)

Multi-person 3D Pose Estimation Using Human-and-scene Contexts

Author:

HE Jian-Hang
HE Jian-Hang
School of Software Engineering, South China University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
SUN Jun-Yao
SUN Jun-Yao
School of Software Engineering, South China University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Qiong
LIU Qiong
School of Software Engineering, South China University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [34]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

深度歧义是单帧图像多人3D姿态估计面临的重要挑战, 提取图像上下文对缓解深度歧义极具潜力. 自顶向下方法大多基于人体检测建模关键点关系, 人体包围框粒度粗背景噪声占比较大, 极易导致关键点偏移或误匹配, 还将影响基于人体尺度因子估计绝对深度的可靠性. 自底向上的方法直接检出图像中的人体关键点再逐一恢复3D人体姿态. 虽然能够显式获取场景上下文, 但在相对深度估计方面处于劣势. 提出新的双分支网络, 自顶向下分支基于关键点区域提议提取人体上下文, 自底向上分支基于三维空间提取场景上下文. 提出带噪声抑制的人体上下文提取方法, 通过建模“关键点区域提议”描述人体目标, 建模姿态关联的动态稀疏关键点关系剔除弱连接减少噪声传播. 提出从鸟瞰视角提取场景上下文的方法, 通过建模图像深度特征并映射鸟瞰平面获得三维空间人体位置布局; 设计人体和场景上下文融合网络预测人体绝对深度. 在公开数据集MuPoTS-3D和Human3.6M上的实验结果表明: 与同类先进模型相比, 所提模型HSC-Pose的相对和绝对3D关键点位置精度至少提高2.2%和0.5%; 平均根关键点位置误差至少降低4.2 mm.

关键词:多人场景3D姿态估计;关键点区域提议;人体上下文;场景上下文;人体绝对深度

Abstract:

Depth ambiguity is an important challenge for multi-person three-dimensional (3D) pose estimation of single-frame images, and extracting contexts from an image has great potential for alleviating depth ambiguity. Current top-down approaches usually model key point relationships based on human detection, which not only easily results in key point shifting or mismatching but also affects the reliability of absolute depth estimation using human scale factor because of a coarse-grained human bounding box with large background noise. Bottom-up approaches directly detect human key points from an image and then restore the 3D human pose one by one. However, the approaches are at a disadvantage in relative depth estimation although the scene context can be obtained explicitly. This study proposes a new two-branch network, in which human context based on key point region proposal and scene context based on 3D space are extracted by top-down and bottom-up branches, respectively. The human context extraction method with noise resistance is proposed to describe the human by modeling key point region proposal. The dynamic sparse key point relationship for pose association is modeled to eliminate weak connections and reduce noise propagation. A scene context extraction method from a bird’s-eye-view is proposed. The human position layout in 3D space is obtained by modeling the image’s depth features and mapping them to a bird’s-eye-view plane. A network fusing human and scene contexts is designed to predict absolute human depth. The experiments are carried out on public datasets, namely MuPoTS-3D and Human3.6M, and results show that compared with those by the state-of-the-art models, the relative and absolute position accuracies of 3D key points by the proposed HSC-Pose are improved by at least 2.2% and 0.5%, respectively, and the position error of mean roots of the key points is reduced by at least 4.2 mm.

Key words:multi-person 3D pose estimation;keypoint region proposal;human context;scene context;absolute human depth

参考文献

[1] 杨彬, 李和平, 曾慧. 基于视频的三维人体姿态估计. 北京航空航天大学学报, 2019, 45(12):2463-2469.[doi:10.13700/j.bh.1001-5965.2019.0384]

Yang B, Li HP, Zeng H. Three-dimensional human pose estimation based on video. Journal of Beijing University of Aeronautics and Astronautics, 2019, 45(12):2463-2469 (in Chinese with English abstract).[doi:10.13700/j.bh.1001-5965.2019.0384]

[2] Guo Y, Ma LC, Li Z, Wang X, Wang F. Monocular 3D multi-person pose estimation via predicting factorized correction factors. Computer Vision and Image Understanding, 2021, 213:103278.[doi:10.1016/j.cviu.2021.103278]

[3] Moon G, Chang JY, Lee KM. Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In:Proc. of the 2019 IEEE/CVF Int'l Conf. on Computer Vision. Seoul:IEEE, 2019. 10132-10141.

[4] Zhen JN, Fang Q, Sun JM, Liu WT, Jiang W, Bao HJ, Zhou XW. SMAP:Single-shot multi-person absolute 3D pose estimation. In:Proc. of the 16th European Conf. on Computer Vision. Glasgow:Springer, 2020. 550-566.

[5] Dabral R, Gundavarapu NB, Mitra R, Sharma A, Ramakrishnan G, Jain A. Multi-person 3D human pose estimation from monocular images. In:Proc. of the 2019 Int'l Conf. on 3D Vision. Los Alamitos:IEEE Computer Society, 2019. 405-414.

[6] Wang ZT, Nie XC, Qu XC, Chen YP, Liu S. Distribution-aware single-stage models for multi-person 3D pose estimation. In:Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2022. 13086-13195.

[7] Cheng Y, Wang B, Yang B, Tan RT. Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In:Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville:IEEE, 2021. 7645-7655.

[8] Wang C, Li JF, Liu WT, Qian C, Lu C. HMOR:Hierarchical multi-person ordinal relations for monocular multi-person 3D pose estimation. In:Proc. of the 16th European Conf. on Computer Vision. Glasgow:Springer, 2020. 242-259.

[9] Tian Z, Chen H, Shen CH. DirectPose:Direct end-to-end multi-person pose estimation. arXiv:1911.07451, 2019.

[10] Lin JH, Lee GH. HDNet:Human depth estimation for multi-person camera-space localization. In:Proc. of the 16th European Conf. on Computer Vision. Glasgow:Springer, 2020. 633-648.

[11] Zou ZM, Tang W. Modulated graph convolutional network for 3D human pose estimation. In:Proc. of the 2021 IEEE/CVF Int'l Conf. on Computer Vision. Montreal:IEEE, 2021. 11457-11467.

[12] Reading C, Harakeh A, Chae J, Waslander SL. Categorical depth distribution network for monocular 3D object detection. In:Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville:IEEE, 2021. 8551-8560.

[13] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In:Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas:IEEE, 2016. 770-778.

[14] Geng ZG, Sun K, Xiao B, Zhang ZX, Wang JD. Bottom-up human pose estimation via disentangled keypoint regression. In:Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville:IEEE, 2021. 14671-14681.

[15] Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In:Proc. of the 28th Int'l Conf. on Neural Information Processing Systems. Montreal:MIT, 2015. 2017-2025.

[16] He KM, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In:Proc. of the 2017 IEEE Int'l Conf. on Computer Vision. Venice:IEEE, 2017. 2980-2988.

[17] Xu XX, Zou Q, Lin X. Adaptive hypergraph neural network for multi-person pose estimation. In:Proc. of the 36th AAAI Conf. on Artificial Intelligence. AAAI, 2022. 2955-2963.

[18] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In:Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018. 7132-7141.

[19] Mao WA, Tian Z, Wang XL, Shen CH. FCPose:Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In:Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville:IEEE, 2021. 9030-9093.

[20] Sun X, Xiao B, Wei FY, Liang S, Wei YC. Integral human pose regression. In:Proc. of the 15th European Conf. on Computer Vision. Munich:Springer, 2018. 536-553.

[21] Li SC, Ke L, Pratama K, Tai YW, Tang CK, Cheng KT. Cascaded deep monocular 3D human pose estimation with evolutionary training data. In:Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle:IEEE, 2020. 6172-6188.

[22] Horaud R, Hansard M, Evangelidis G, Menier C. An overview of depth cameras and range scanners based on time-of-flight technologies. arXiv:2012.06772, 2020.

[23] Li X, Wang WH, Wu LJ, Chen S, Hu XL, Li J, Tang JH, Yang J. Generalized focal loss:Learning qualified and distributed bounding boxes for dense object detection. In:Proc. of the 34th Int'l Conf. on Neural Information Processing Systems. Vancouver:Curran Associates Inc., 2020. 1763.

[24] Artacho B, Savakis A. UniPose:Unified human pose estimation in single images and videos. In:Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle:IEEE, 2020. 7033-7042.

[25] Sun K, Xiao B, Liu D, Wang JD. Deep high-resolution representation learning for human pose estimation. In:Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach:IEEE, 2019. 5686-5796.

[26] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO:Common objects in context. In:Proc. of the 13th European Conf. on Computer Vision. Zurich:Springer, 2014. 740-755.

[27] Mehta D, Sotnychenko O, Mueller F, Xu WP, Sridhar S, Pons-Moll G, Theobalt C. Single-shot multi-person 3D pose estimation from monocular RGB. In:Proc. of the 2018 Int'l Conf. on 3D Vision. Verona:IEEE, 2018. 120-130.

[28] Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6m:Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7):1325-1339.[doi:10.1109/tpami.2013.248]

[29] Su JJ, Wang CY, Ma XX, Zeng WJ, Wang YZ. VirtualPose:Learning generalizable 3D human pose models from virtual data. In:Proc. of the 17th European Conf. on Computer Vision. Tel Aviv:Springer, 2022. 55-71.

[30] Chen ZR, Huang Y, Yu HY, Xue B, Han K, Guo YR, Wang L. Towards part-aware monocular 3D human pose estimation:An architecture search approach. In:Proc. of the 16th European Conf. on Computer Vision. Glasgow:Springer, 2020. 715-732.

[31] Li JF, Bian SY, Zeng AL, Wang C, Pang B, Liu WT, Li C. Human pose regression with residual log-likelihood estimation. In:Proc. of the 2021 IEEE/CVF Int'l Conf. on Computer Vision. Montreal:IEEE, 2021. 11005-11014.

[32] Zheng XT, Chen XM, Lu XQ. A joint relationship aware neural network for single-image 3D human pose estimation. IEEE Transactions on Image Processing, 2020, 29:4747-4758.[doi:10.1109/tip.2020.2972104]

[33] Xia HL, Zhang TT. Self-attention network for human pose estimation. Applied Sciences, 2021, 11(4):1826.[doi:10.3390/app11041826]

引用本文

何建航,孙郡瑤,刘琼.基于人体和场景上下文的多人3D姿态估计.软件学报,2024,35(4):2039-2054

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-05-31
最后修改日期:2022-08-16
录用日期:
在线发布日期: 2023-07-28
出版日期: 2024-04-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码