深度歧义是单帧图像多人3D姿态估计面临的重要挑战, 提取图像上下文对缓解深度歧义极具潜力. 自顶向下方法大多基于人体检测建模关键点关系, 人体包围框粒度粗背景噪声占比较大, 极易导致关键点偏移或误匹配, 还将影响基于人体尺度因子估计绝对深度的可靠性. 自底向上的方法直接检出图像中的人体关键点再逐一恢复3D人体姿态. 虽然能够显式获取场景上下文, 但在相对深度估计方面处于劣势. 提出新的双分支网络, 自顶向下分支基于关键点区域提议提取人体上下文, 自底向上分支基于三维空间提取场景上下文. 提出带噪声抑制的人体上下文提取方法, 通过建模“关键点区域提议”描述人体目标, 建模姿态关联的动态稀疏关键点关系剔除弱连接减少噪声传播. 提出从鸟瞰视角提取场景上下文的方法, 通过建模图像深度特征并映射鸟瞰平面获得三维空间人体位置布局; 设计人体和场景上下文融合网络预测人体绝对深度. 在公开数据集MuPoTS-3D和Human3.6M上的实验结果表明: 较同类先进模型, 所提模型HSC-Pose的相对和绝对3D关键点位置精度至少提高2.2%和0.5%; 平均根关键点位置误差至少降低4.2 mm.
Depth ambiguity is an important challenge for multi-person three-dimensional (3D) pose estimation of single-frame images, and extracting contexts from an image has great potential for alleviating depth ambiguity. Current top-down approaches usually model key point relationships based on human detection, which not only easily results in key point shifting or mismatching but also affects the reliability of absolute depth estimation using human scale factor because of a coarse-grained human bounding box with large background noise. Bottom-up approaches directly detect human key points from an image and then restore the 3D human pose one by one. However, the approaches are at a disadvantage in relative depth estimation although the scene context can be obtained explicitly. This study proposes a new two-branch network, in which human context based on key point region proposal and scene context based on 3D space are extracted by top-down and bottom-up branches, respectively. The human context extraction method with noise resistance is proposed to describe the human by modeling key point region proposal. The dynamic sparse key point relationship for pose association is modeled to eliminate weak connections and reduce noise propagation. A scene context extraction method from a bird’s-eye-view is proposed. The human position layout in 3D space is obtained by modeling the image’s depth features and mapping them to a bird’s-eye-view plane. A network fusing human and scene contexts is designed to predict absolute human depth. The experiments are carried out on public datasets, namely MuPoTS-3D and Human3.6M, and results show that compared with those by the state-of-the-art models, the relative and absolute position accuracies of 3D key points by the proposed HSC-Pose are improved by at least 2.2% and 0.5%, respectively, and the position error of mean roots of the key points is reduced by at least 4.2 mm.