Abstract:Recently, a new task named cross-modal video corpus moment retrieval (VCMR) has been proposed, which aims to retrieve a small video segment corresponding to a query statement from an unsegmented video corpus. The key point of the existing cross-modal video text retrieval work is the alignment and fusion of different modal features. However, simply performing cross-modal alignment and fusion cannot ensure that semantically similar data from the same modal remain close under the joint feature space, and the semantics of query statements are not considered. To solve the above problems, this study proposes a query-aware cross-modal dual contrastive learning network for multi-modal video moment retrieval (QACLN), which achieves the unified semantic representation of different modal data by combining cross-modal and intra-modal contrastive learning. First, the study proposes a query-aware cross-modal semantic fusion strategy, obtaining the query-aware multi-modal joint representation of the video by adaptively fusing multi-modal features such as visual modal features and caption modality features of the video according to the aware query semantics. Then, a cross-modal and intra-modal dual contrastive learning mechanism for video and text query is proposed to enhance the semantic alignment and fusion of different modalities, which can improve the discriminability and semantic consistency of data representations of different modalities. Finally, the 1D convolution boundary regression and cross-modal semantic similarity calculation are employed to perform moment localization and video retrieval. Extensive experiments demonstrate that the proposed QACLN outperforms the benchmark methods.