分层特征编解码驱动的视觉引导立体声生成方法
作者:
通讯作者:

叶龙,E-mail:yelong@cuc.edu.cn

基金项目:

国家自然科学基金(61971383,62201524);国家重点研发计划(2021YFF0900504)


Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    视觉引导的立体声生成是多模态学习中具有广泛应用价值的重要任务之一, 其目标是在给定视觉模态信息及单声道音频模态信息的情况下, 生成符合视听一致性的立体声音频. 针对现有视觉引导的立体声生成方法因编码阶段视听信息利用率不足、解码阶段忽视浅层特征导致的立体声生成效果不理想的问题, 提出一种基于分层特征编解码的视觉引导的立体声生成方法, 有效提升立体声生成的质量. 其中, 为了有效地缩小阻碍视听觉模态数据间关联融合的异构鸿沟, 提出一种视听觉特征分层编码融合的编码器结构, 提高视听模态数据在编码阶段的综合利用效率; 为了实现解码过程中浅层结构特征信息的有效利用, 构建一种由深到浅不同深度特征层间跳跃连接的解码器结构, 实现了对视听觉模态信息的浅层细节特征与深度特征的充分利用. 得益于对视听觉信息的高效利用以及对深层浅层结构特征的分层结合, 所提方法可有效处理复杂视觉场景中的立体声合成, 相较于现有方法, 所提方法生成效果在真实感等方面性能提升超过6%.

    Abstract:

    Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism.

    参考文献
    [1] 杨杨, 詹德川, 姜远, 熊辉. 可靠多模态学习综述. 软件学报, 2021, 32(4):1067-1081. http://www.jos.org.cn/1000-9825/6167.htm
    Yang Y, Zhan DC, Jiang Y, Xiong H. Reliable multi-modal learning:A survey. Ruan Jian Xue Bao/Journal of Software, 2021, 32(4):1067-1081 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6167.htm
    [2] 葛小立. 视听一致性对声音响度变化判断的影响[硕士学位论文]. 上海:上海交通大学, 2011.
    Ge XL. Influence of audiovisual congruency on the auditory intensity change judgment[MS. Thesis]. Shanghai:Shanghai Jiao Tong University, 2011 (in Chinese with English abstract).
    [3] 吕柱良. 利用音视频信息的空间音频生成技术研究[硕士学位论文]. 重庆:重庆邮电大学, 2021.
    Lv ZL. Study on generation of spatial audio using audio-visual cues[MS. Thesis]. Chongqing:Chongqing University of Posts and Telecommunications, 2021 (in Chinese with English abstract).
    [4] 程皓楠, 李思佳, 刘世光. 深度跨模态环境声音合成. 计算机辅助设计与图形学学报, 2019, 31(12):2047-2055.
    Cheng HN, Li SJ, Liu SG. Deep cross-modal synthesis of environmental sound. Journal of Computer-aided Design & Computer Graphics, 2019, 31(12):2047-2055 (in Chinese with English abstract).
    [5] 王睿琦, 程皓楠, 叶龙, 齐秋棠. 基于还音转换规则的胶片音频生成方法. 计算机辅助设计与图形学学报, 2022, 34(10):1524-1532.
    Wang RQ, Cheng HN, Ye L, Qi QT. Reproduction transformation rule-based sound generation for film soundtrack. Journal of Computer-aided Design & Computer Graphics, 2022, 34(10):1524-1532 (in Chinese with English abstract).
    [6] Huang HM, Lin LF, Tong RF, Hu HJ, Zhang QW, Iwamoto Y, Han XH, Chen YW, Wu J. UNet 3+:A full-scale connected UNet for medical image segmentation. In:Proc. of the 2020 IEEE Int'l Conf. on Acoustics, Speech and Signal Processing. Barcelona:IEEE Computer Society Press, 2020. 1055-1059.
    [7] Gao RH, Grauman K. 2.5D visual sound. In:Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach:IEEE Computer Society Press, 2019. 324-333.
    [8] Zhou H, Xu XD, Lin DH, Wang XG, Liu ZW. Sep-stereo:Visually guided stereophonic audio generation by associating source separation. In:Proc. of the 16th European Conf. on Computer Vision. Glasgow:Springer Press, 2020. 52-69.
    [9] Li SJ, Liu SG, Manocha D. Binaural audio generation via multi-task learning. ACM Trans. on Graphics, 2021, 40(6):243.
    [10] Parida KK, Srivastava S, Sharma G. Beyond mono to binaural:Generating binaural audio from mono audio with depth and cross modal attention. In:Proc. of the 2022 IEEE/CVF Winter Conf. on Applications of Computer Vision. Waikoloa:IEEE Computer Society Press, 2022. 2151-2160.
    [11] Lu YD, Lee HY, Tseng HY, Yang MH. Self-supervised audio spatialization with correspondence classifier. In:Proc. of the 2019 IEEE Int'l Conf. on Image Processing (ICIP). Taipei:IEEE Computer Society Press, 2019. 3347-3351.
    [12] Xu XD, Zhou H, Liu ZW, Dai B, Wang XG, Lin DH. Visually informed binaural audio generation without binaural audios. In:Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville:IEEE Computer Society Press, 2021. 15480-15489.
    [13] Lin YB, Wang YCF. Exploiting audio-visual consistency with partial supervision for spatial audio generation. In:Proc. of the 35th AAAI Conf. on Artificial Intelligence, the 33rd Conf. on Innovative Applications of Artificial Intelligence, the 11th Symp. on Educational Advances in Artificial Intelligence. AAAI Press, 2021. 2056-2063.
    [14] Rachavarapu KK, Aakanksha A, Sundaresha V, Rajagopalan AN. Localize to binauralize:Audio spatialization from visual sound source localization. In:Proc. of the 2021 IEEE/CVF Int'l Conf. on Computer Vision. Montreal:IEEE Computer Society Press, 2021. 1910-1919.
    [15] Garg R, Gao RH, Grauman K. Geometry-aware multi-task learning for binaural audio generation from video. In:Proc. of the 32nd British Machine Vision Conf. BMVA Press, 2021. 1082-1092.
    [16] 曹建军, 聂子博, 郑奇斌, 吕国俊, 曾志贤. 跨模态数据实体分辨研究综述. 软件学报, 2023, 34(12):5822-5847. http://www.jos.org.cn/1000-9825/6764.htm
    Cao JJ, Nie ZB, Zheng QB, Lü GJ, Zeng ZX. State-of-the-art survey of cross-modal data entity resolution. Ruan Jian Xue Bao/Journal of Software, 2023, 34(12):5822-5847 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6764.htm
    [17] Leng YC, Chen ZH, Guo JL, Liu HH, Chen JW, Tan X, Mandic DP, He L, Li XY, Qin T, Zhao S, Liu TY. BinauralGrad:A two-stage conditional diffusion probabilistic model for binaural audio synthesis. In:Proc. of the 36th Conf. on Neural Information Processing Systems. New Orleans:Curran Associates, 2022. 23689-23700.
    [18] Wightman FL, Kistler DJ. The dominant role of low-frequency interaural time differences in sound localization. The Journal of the Acoustical Society of America, 1992, 91(3):1648-1661.
    [19] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In:Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas:IEEE Computer Society Press, 2016. 770-778.
    [20] Ronneberger O, Fischer P, Brox T. U-Net:Convolutional networks for biomedical image segmentation. In:Proc. of the 18th Int'l Conf. on Medical Image Computing and Computer-assisted Intervention. Munich:Springer Press, 2015. 234-241.
    [21] Zhou ZW, Siddiquee MMR, Tajbakhsh N, Liang JM. UNet++:A nested U-Net architecture for medical image segmentation. In:Proc. of the 4th Int'l Workshop and the 8th Int'l Workshop Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Granada:Springer Press, 2018. 3-11.
    [22] Morgado P, Vasconcelos N, Langlois T, Wang O. Self-supervised generation of spatial audio for 360° video. In:Proc. of the 32nd Int'l Conf. on Neural Information Processing Systems. Montréal:Curran Associates, 2018. 360-370.
    [23] Valin JM. A hybrid DSP/deep learning approach to real-time full-band speech enhancement. In:Proc. of the 20th IEEE Int'l Workshop on Multimedia Signal Processing. Vancouver:IEEE Computer Society Press, 2018. 1-5.
    [24] Lo CC, Fu SW, Huang WC, Wang S, Yamagishi J, Tsao Y, Wang HM. MOSNet:Deep learning-based objective assessment for voice conversion. In:Proc. of the 20th Annual Conf. of the Int'l Speech Communication Association. Graz:Springer Press, 2019. 1541-1545.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王睿琦,程皓楠,叶龙.分层特征编解码驱动的视觉引导立体声生成方法.软件学报,2024,35(5):2165-2175

复制
分享
文章指标
  • 点击次数:568
  • 下载次数: 2910
  • HTML阅读次数: 1237
  • 引用次数: 0
历史
  • 收稿日期:2023-04-10
  • 最后修改日期:2023-06-08
  • 在线发布日期: 2023-09-11
  • 出版日期: 2024-05-06
文章二维码
您是第19708275位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号