[关键词]
[摘要]
视觉引导的立体声生成是多模态学习中具有广泛应用价值的重要任务之一, 其目标是在给定视觉模态信息及单声道音频模态信息的情况下, 生成符合视听一致性的立体声音频. 针对现有视觉引导的立体声生成方法因编码阶段视听信息利用率不足、解码阶段忽视浅层特征导致的立体声生成效果不理想的问题, 提出一种基于分层特征编解码的视觉引导的立体声生成方法, 有效提升立体声生成的质量. 其中, 为了有效地缩小阻碍视听觉模态数据间关联融合的异构鸿沟, 提出一种视听觉特征分层编码融合的编码器结构, 提高视听模态数据在编码阶段的综合利用效率; 为了实现解码过程中浅层结构特征信息的有效利用, 构建一种由深到浅不同深度特征层间跳跃连接的解码器结构, 实现了对视听觉模态信息的浅层细节特征与深度特征的充分利用. 得益于对视听觉信息的高效利用以及对深层浅层结构特征的分层结合, 所提方法可有效处理复杂视觉场景中的立体声合成, 相较于现有方法, 所提方法生成效果在真实感等方面性能提升超过6%.
[Key word]
[Abstract]
Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism.
[中图分类号]
[基金项目]
国家自然科学基金(61971383,62201524);国家重点研发计划(2021YFF0900504)