分层特征编解码驱动的视觉引导立体声生成方法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

叶龙,E-mail:yelong@cuc.edu.cn

中图分类号:

基金项目:

国家自然科学基金(61971383,62201524);国家重点研发计划(2021YFF0900504)


Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    视觉引导的立体声生成是多模态学习中具有广泛应用价值的重要任务之一, 其目标是在给定视觉模态信息及单声道音频模态信息的情况下, 生成符合视听一致性的立体声音频. 针对现有视觉引导的立体声生成方法因编码阶段视听信息利用率不足、解码阶段忽视浅层特征导致的立体声生成效果不理想的问题, 提出一种基于分层特征编解码的视觉引导的立体声生成方法, 有效提升立体声生成的质量. 其中, 为了有效地缩小阻碍视听觉模态数据间关联融合的异构鸿沟, 提出一种视听觉特征分层编码融合的编码器结构, 提高视听模态数据在编码阶段的综合利用效率; 为了实现解码过程中浅层结构特征信息的有效利用, 构建一种由深到浅不同深度特征层间跳跃连接的解码器结构, 实现了对视听觉模态信息的浅层细节特征与深度特征的充分利用. 得益于对视听觉信息的高效利用以及对深层浅层结构特征的分层结合, 所提方法可有效处理复杂视觉场景中的立体声合成, 相较于现有方法, 所提方法生成效果在真实感等方面性能提升超过6%.

    Abstract:

    Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism.

    参考文献
    相似文献
    引证文献
引用本文

王睿琦,程皓楠,叶龙.分层特征编解码驱动的视觉引导立体声生成方法.软件学报,2024,35(5):2165-2175

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-04-10
  • 最后修改日期:2023-06-08
  • 录用日期:
  • 在线发布日期: 2023-09-11
  • 出版日期: 2024-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号