Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding

doi:10.13328/j.cnki.jos.007027

微信服务号

微信订阅号

Home > Archive>Volume 35, Issue 5, 2024 >2165-2175. DOI:10.13328/j.cnki.jos.007027

PDF HTML XML Export Cite reminder

Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
DOI:
                        10.13328/j.cnki.jos.007027
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism.

Reference

Cited by

Get Citation

王睿琦,程皓楠,叶龙.分层特征编解码驱动的视觉引导立体声生成方法.软件学报,2024,35(5):2165-2175

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 10,2023
Revised:June 08,2023
Adopted:
Online: September 11,2023
Published: May 06,2024

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History