Query Aware Dual Contrastive Learning Network for Cross-modal Retrieval

doi:10.13328/j.cnki.jos.007021

微信服务号

微信订阅号

2025-4-15- 3

Home > Archive>Volume 35, Issue 5, 2024 >2120-2132. DOI:10.13328/j.cnki.jos.007021

PDF HTML XML Export Cite reminder

Query Aware Dual Contrastive Learning Network for Cross-modal Retrieval
DOI:
                        10.13328/j.cnki.jos.007021
                    
Author:
                        YIN Meng-RanYIN Meng-Ran
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LIANG Mei-YuLIANG Mei-Yu
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
YU YangYU Yang
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
CAO Xiao-WenCAO Xiao-Wen
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
DU Jun-PingDU Jun-Ping
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
XUE ZheXUE Zhe
School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China;Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia(Beijing University of Posts and Telecommunications), Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Recently, a new task named cross-modal video corpus moment retrieval (VCMR) has been proposed, which aims to retrieve a small video segment corresponding to a query statement from an unsegmented video corpus. The key point of the existing cross-modal video text retrieval work is the alignment and fusion of different modal features. However, simply performing cross-modal alignment and fusion cannot ensure that semantically similar data from the same modal remain close under the joint feature space, and the semantics of query statements are not considered. To solve the above problems, this study proposes a query-aware cross-modal dual contrastive learning network for multi-modal video moment retrieval (QACLN), which achieves the unified semantic representation of different modal data by combining cross-modal and intra-modal contrastive learning. First, the study proposes a query-aware cross-modal semantic fusion strategy, obtaining the query-aware multi-modal joint representation of the video by adaptively fusing multi-modal features such as visual modal features and caption modality features of the video according to the aware query semantics. Then, a cross-modal and intra-modal dual contrastive learning mechanism for video and text query is proposed to enhance the semantic alignment and fusion of different modalities, which can improve the discriminability and semantic consistency of data representations of different modalities. Finally, the 1D convolution boundary regression and cross-modal semantic similarity calculation are employed to perform moment localization and video retrieval. Extensive experiments demonstrate that the proposed QACLN outperforms the benchmark methods.

Key words:cross-modal semantic fusion;cross-modal retrieval;video moment localization;contrastive learning

Get Citation

尹梦冉,梁美玉,于洋,曹晓雯,杜军平,薛哲.面向跨模态检索的查询感知双重对比学习网络.软件学报,2024,35(5):2120-2132

Copy

Article Metrics

Abstract:1148
PDF: 3691
HTML: 1322
Cited by: 0

History

Received:March 26,2023
Revised:June 08,2023
Adopted:
Online: September 11,2023
Published: May 06,2024

You are the first2035067Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History