Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism.

    Reference
    Related
    Cited by
Get Citation

王睿琦,程皓楠,叶龙.分层特征编解码驱动的视觉引导立体声生成方法.软件学报,2024,35(5):2165-2175

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 10,2023
  • Revised:June 08,2023
  • Adopted:
  • Online: September 11,2023
  • Published: May 06,2024
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063