模态间关系促进的行人检索方法

doi:10.13328/j.cnki.jos.006993

微信服务号

微信订阅号

首页 > 过刊浏览>2024年第35卷第10期 >4766-4780. DOI:10.13328/j.cnki.jos.006993

PDF HTML阅读 XML下载导出引用引用提醒

模态间关系促进的行人检索方法
DOI:
                        10.13328/j.cnki.jos.006993
                    
作者:
                        
                        
                    
作者单位:
作者简介:李博(1997－), 男, 硕士生, 主要研究领域为多媒体分析;张飞飞(1989－), 女, 博士, 教授, CCF专业会员, 主要研究领域为多媒体分析, 计算机视觉, 模式识别, 图像处理;徐常胜(1969－), 男, 博士, 研究员, 博士生导师, CCF杰出会员, 主要研究领域为多媒体分析, 计算机视觉, 模式识别, 图像处理
通讯作者:徐常胜, E-mail: csxu@nlpr.ia.ac.cn
中图分类号:TP18
基金项目:国家重点研发计划(2018AAA0102200); 国家自然科学基金(62036012, 62002355, 61720106006, 62102415, 62106262, 62072455, 62202331, 62206200); 天津市自然科学基金(22JCYBJC00030); 北京市自然科学基金 (L201001, 4222039)

Cross-modal Person Retrieval Method Based on Relation Alignment

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

基于文本描述的行人检索是一个新兴的跨模态检索子任务, 由传统行人重识别任务衍生而来, 对公共安全以及人员追踪具有重要意义. 相比于单模态图像检索的行人重识别任务, 基于文本描述的行人检索解决了实际应用中缺少查询图像的问题, 其主要挑战在于该任务结合了视觉内容和文本描述两种不同模态的数据, 要求模型同时具有图像理解能力和文本语义学习能力. 为了缩小行人图像和文本描述的模态间语义鸿沟, 传统的基于文本描述的行人检索方法多是对提取的图像和文本特征进行机械地分割, 只关注于跨模态信息的语义对齐, 忽略了图像和文本模态内部的潜在联系, 导致模态间细粒度匹配的不准确. 为了解决上述问题, 提出模态间关系促进的行人检索方法, 首先利用注意力机制分别构建模态内自注意力矩阵和跨模态注意力矩阵, 并将注意力矩阵看作不同特征序列间的响应值分布. 然后, 分别使用两种不同的矩阵构建方法重构模态内自注意力矩阵和跨模态注意力矩阵. 其中自注意力矩阵的重构利用模态内逐元素重构的方式可以很好地挖掘模态内部的潜在联系, 而跨模态注意力矩阵的重构用模态间整体重构矩阵的方法, 以跨模态信息为桥梁, 可充分挖掘模态间的潜在信息, 缩小语义鸿沟. 最后, 用基于任务的跨模态投影匹配损失和KL散度损失联合约束模型优化, 达到模态间信息相互促进的效果. 在基于文本描述的行人检索公开数据库CUHK-PEDES上进行了定量以及检索结果的可视化, 均表明所提方法可取得目前最优的效果.

Abstract:

Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.

参考文献

相似文献

引证文献

引用本文

李博,张飞飞,徐常胜.模态间关系促进的行人检索方法.软件学报,2024,35(10):4766-4780

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-11-18
最后修改日期:2023-02-28
录用日期:
在线发布日期: 2023-11-15
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码