模态间关系促进的行人检索方法
作者:
作者简介:

李博(1997-), 男, 硕士生, 主要研究领域为多媒体分析;张飞飞(1989-), 女, 博士, 教授, CCF专业会员, 主要研究领域为多媒体分析, 计算机视觉, 模式识别, 图像处理;徐常胜(1969-), 男, 博士, 研究员, 博士生导师, CCF杰出会员, 主要研究领域为多媒体分析, 计算机视觉, 模式识别, 图像处理

通讯作者:

徐常胜, E-mail: csxu@nlpr.ia.ac.cn

中图分类号:

TP18

基金项目:

国家重点研发计划(2018AAA0102200); 国家自然科学基金(62036012, 62002355, 61720106006, 62102415, 62106262, 62072455, 62202331, 62206200); 天津市自然科学基金(22JCYBJC00030); 北京市自然科学基金 (L201001, 4222039)


Cross-modal Person Retrieval Method Based on Relation Alignment
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [54]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    基于文本描述的行人检索是一个新兴的跨模态检索子任务, 由传统行人重识别任务衍生而来, 对公共安全以及人员追踪具有重要意义. 相比于单模态图像检索的行人重识别任务, 基于文本描述的行人检索解决了实际应用中缺少查询图像的问题, 其主要挑战在于该任务结合了视觉内容和文本描述两种不同模态的数据, 要求模型同时具有图像理解能力和文本语义学习能力. 为了缩小行人图像和文本描述的模态间语义鸿沟, 传统的基于文本描述的行人检索方法多是对提取的图像和文本特征进行机械地分割, 只关注于跨模态信息的语义对齐, 忽略了图像和文本模态内部的潜在联系, 导致模态间细粒度匹配的不准确. 为了解决上述问题, 提出模态间关系促进的行人检索方法, 首先利用注意力机制分别构建模态内自注意力矩阵和跨模态注意力矩阵, 并将注意力矩阵看作不同特征序列间的响应值分布. 然后, 分别使用两种不同的矩阵构建方法重构模态内自注意力矩阵和跨模态注意力矩阵. 其中自注意力矩阵的重构利用模态内逐元素重构的方式可以很好地挖掘模态内部的潜在联系, 而跨模态注意力矩阵的重构用模态间整体重构矩阵的方法, 以跨模态信息为桥梁, 可充分挖掘模态间的潜在信息, 缩小语义鸿沟. 最后, 用基于任务的跨模态投影匹配损失和KL散度损失联合约束模型优化, 达到模态间信息相互促进的效果. 在基于文本描述的行人检索公开数据库CUHK-PEDES上进行了定量以及检索结果的可视化, 均表明所提方法可取得目前最优的效果.

    Abstract:

    Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.

    参考文献
    [1] Li S, Xiao T, Li HS, Zhou BL, Yue DY, Wang XG. Person search with natural language description. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5187–5196.
    [2] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3156–3164.
    [3] Wu Q, Teney D, Wang P, Shen CH, Dick A, van den Hengel A. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017, 163: 21–40. [doi: 10.1016/j.cviu.2017.05.001
    [4] Wang KY, He R, Wang L, Wang W, Tan TN. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010–2023. [doi: 10.1109/TPAMI.2015.2505311
    [5] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. [doi: 10.1145/3065386
    [6] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Miami: IEEE, 2009. 248–255.
    [7] Wieczorek M, Rychalska B, Dabrowski J. On the unreasonable effectiveness of centroids in image retrieval. In: Proc. of the 28th Int’l Conf. on Neural Information Processing (NIPS). Sanur: Springer, 2021. 212–223.
    [8] Fu DP, Chen DD, Bao JM, Yang H, Yuan L, Zhang L, Li HQ, Chen D. Unsupervised pre-training for person re-identification. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 14745–14754.
    [9] Wang GC, Lai JH, Huang PG, Xie XH. Spatial-temporal person re-identification. Proc. of the 2019 AAAI Conf. on Artificial Intelligence, 2019, 33(1): 8933–8940. [doi: 10.1609/aaai.v33i01.33018933
    [10] Zhu ZH, Jiang XY, Zheng F, Guo XW, Huang FY, Sun X, Zheng WS. Viewpoint-aware loss with angular regularization for person re-identification. Proc. of the 2020 AAAI Conf. on Artificial Intelligence, 2020, 34(7): 13114–13121. [doi: 10.1609/aaai.v34i07.7014
    [11] Masson H, Bhuiyan A, Nguyen-Meidine LT, Javan M, Siva P, Ben Ayed I, Granger E. A survey of pruning methods for efficient person re-identification across domains. arXiv:1907.02547, 2021.
    [12] Wu L, Wang Y, Gao JB, Wang M, Zha ZJ, Tao DC. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Trans. on Neural Networks and Learning Systems, 2021, 32(2): 722–735. [doi: 10.1109/TNNLS.2020.2979190
    [13] Yang F, Yan K, Lu SJ, Jia HZ, Xie XD, Gao W. Attention driven person re-identification. Pattern Recognition, 2019, 86: 143–155. [doi: 10.1016/j.patcog.2018.08.015
    [14] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.
    [15] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision (ICCV). Santiago: IEEE, 2015. 2641–2649.
    [16] Zheng KC, Liu W, Liu JW, Zha ZJ, Mei T. Hierarchical gumbel attention network for text-based person search. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 3441–3449.
    [17] Niu K, Huang Y, Ouyang WL, Wang L. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. on Image Processing, 2020, 29: 5542–5556. [doi: 10.1109/TIP.2020.2984883
    [18] Wang Z, Fang ZY, Wang J, Yang YZ. ViTAA: Visual-textual attributes alignment in person search by natural language. In: Proc. of the 16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 402–420.
    [19] Aggarwal S, Babu RV, Chakraborty A. Text-based person search via attribute-aided matching. In: Proc. of the 2020 IEEE Winter Conf. on Applications of Computer Vision (WACV). Snowmass: IEEE, 2020. 2606–2614.
    [20] Zheng ZD, Zheng L, Garrett M, Yang Y, Xu ML, Shen YD. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. on Multimedia Computing, Communications, and Applications, 2020, 16(2): 51. [doi: 10.1145/3383184
    [21] Chen YH, Zhang GQ, Lu YJ, Wang ZX, Zheng YH. TIPCB: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 2022, 494: 171–181. [doi: 10.1016/j.neucom.2022.04.081
    [22] Wang JY, Zhu XT, Gong SG, Li W. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018. 2275–2284.
    [23] Chen DP, Li HS, Liu XH, Shen YT, Shao J, Yuan ZJ, Wang XG. Improving deep visual representation for person re-identification by global and local image-language association. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 56–73.
    [24] Ren MY, Kiros R, Zemel RS. Exploring models and data for image question answering. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 2953–2961.
    [25] Noh H, Seo PH, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 30–38.
    [26] Yang ZC, He XD, Gao JF, Deng L, Smola A. Stacked attention networks for image question answering. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 21–29.
    [27] Saito K, Shin A, Ushiku Y, Harada T. DualNet: Domain-invariant network for visual question answering. In: Proc. of the 2017 IEEE Int’l Conf. on Multimedia and Expo (ICME). Hong Kong: IEEE, 2017. 829–834.
    [28] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847, 2016.
    [29] Jing Y, Si CY, Wang JB, Wang W, Wang L, Tan TN. Pose-guided multi-granularity attention network for text-based person search. Proc. of the 2020 AAAI Conf. on Artificial Intelligence, 2020, 34(7): 11189–11196. [doi: 10.1609/aaai.v34i07.6777
    [30] Yi D, Lei Z, Liao SC, Li SZ. Deep metric learning for person re-identification. In: Proc. of the 22nd Int’l Conf. on Pattern Recognition. Stockholm: IEEE, 2014. 34–39.
    [31] Hou RB, Ma BP, Chang H, Gu XQ, Shan SG, Chen XL. Interaction-and-aggregation network for person re-identification. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 9309–9318.
    [32] Xia BN, Gong Y, Zhang YZ, Poellabauer C. Second-order non-local attention networks for person re-identification. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 3759–3768.
    [33] Ye M, Shen JB, Lin GJ, Xiang T, Shao L, Hoi SCH. Deep learning for person re-identification: A survey and outlook. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022, 44(6): 2872–2893. [doi: 10.1109/TPAMI.2021.3054775
    [34] Wu L, Shen CH, van den Hengel A. PersonNet: Person re-identification with deep convolutional neural networks. arXiv:1601.07255, 2016.
    [35] Li W, Zhu XT, Gong SG. Harmonious attention network for person re-identification. arXiv:1802.08122, 2018.
    [36] Wang C, Zhang Q, Huang C, Liu WY, Wang XG. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 384–400.
    [37] Chen GY, Lin CZ, Ren LL, Lu JW, Zhou J. Self-critical attention learning for person re-identification. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 96376–9645.
    [38] Cheng D, Gong YH, Zhou SP, Wang JJ, Zhang NN. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 1335–1344.
    [39] Li DW, Chen XT, Zhang Z, Huang KQ. Learning deep context-aware features over body and latent parts for person re-identification. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 7398–7407.
    [40] Zhao LM, Li X, Zhuang YT, Wang JD. Deeply-learned part-aligned representations for person re-identification. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision (ICCV). Venice: IEEE, 2017. 3239–3248.
    [41] Liu JL, Sun YF, Zhu F, Pei HB, Yang Y, Li WH. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 19344–19353.
    [42] Yu R, Du DW, LaLonde R, Davila D, Funk C, Hoogs A, Clipp B. Cascade transformers for end-to-end person search. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 7257–7266.
    [43] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 770–778.
    [44] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2019.
    [45] Gao CY, Cai GY, Jiang XY, Zheng F, Zhang J, Gong YF, Peng P, Guo XW, Sun X. Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036, 2021.
    [46] Farooq A, Awais M, Kittler J, Khalid SS. AXM-Net: Implicit cross-modal feature alignment for person re-identification. Proc. of the 2022 AAAI Conf. on Artificial Intelligence, 2022, 36(4): 4477–4485. [doi: 10.1609/aaai.v36i4.20370
    [47] Sun YF, Zheng L, Yang Y, Tian Q, Wang SJ. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 501–518.
    [48] Zhang Y, Lu HC. Deep cross-modal projection learning for image-text matching. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 707–723.
    [49] Ren SH, Lin JY, Zhao GX, Men R, Yang A, Zhou JR, Sun X, Yang HX. Learning relation alignment for calibrated cross-modal retrieval. arXiv:2105.13868, 2021.
    [50] Xiao T, Li S, Wang BC, Lin L, Wang XG. End-to-end deep learning for person search. arXiv:1604.01850, 2017.
    [51] Cho YJ, Yoon KJ. PaMM: Pose-aware multi-shot matching for improving person re-identification. IEEE Trans. on Image Processing, 2018, 27(8): 3739–3752. [doi: 10.1109/TIP.2018.2815840
    [52] Li W, Zhao R, Wang XG. Human reidentification with transferred metric learning. In: Proc. of the 11th Asian Conf. on Computer Vision. Daejeon: Springer, 2012. 31–44.
    [53] Reed S, Akata Z, Lee H, Schiele B. Learning deep representations of fine-grained visual descriptions. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 49–58.
    [54] Zhu AC, Wang ZJ, Li YF, Wan XL, Jin J, Wang T, Hu FQ, Hua G. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In: Proc. of the 29th ACM Int’l Conf. on Multimedia. ACM, 2021. 209–217.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李博,张飞飞,徐常胜.模态间关系促进的行人检索方法.软件学报,2024,35(10):4766-4780

复制
分享
文章指标
  • 点击次数:358
  • 下载次数: 1689
  • HTML阅读次数: 574
  • 引用次数: 0
历史
  • 收稿日期:2022-11-18
  • 最后修改日期:2023-02-28
  • 在线发布日期: 2023-11-15
  • 出版日期: 2024-10-06
文章二维码
您是第19799526位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号