Cross-modal Person Retrieval Method Based on Relation Alignment

doi:10.13328/j.cnki.jos.006993

微信服务号

微信订阅号

2025-4-9- 16

Home > Archive>Volume 35, Issue 10, 2024 >4766-4780. DOI:10.13328/j.cnki.jos.006993

PDF HTML XML Export Cite reminder

Cross-modal Person Retrieval Method Based on Relation Alignment
DOI:
                        10.13328/j.cnki.jos.006993
                    
Author:
                        LI BoLI Bo
School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHANG Fei-FeiZHANG Fei-Fei
School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
XU Chang-ShengXU Chang-Sheng
State Key Laboratory of Multimodal Artificial Intelligence Systems (Institute of Automation, Chinese Academy of Sciences), Beijing 100190, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP18
Fund Project:

Article

Figures

Metrics

Reference [54]

Cited by

Materials

Comments

Abstract:

Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.

Key words:person retrieval;cross-modal task;textual semantic learning;relation alignment;attention mechanism

Reference

[1] Li S, Xiao T, Li HS, Zhou BL, Yue DY, Wang XG. Person search with natural language description. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5187–5196.

[2] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3156–3164.

[3] Wu Q, Teney D, Wang P, Shen CH, Dick A, van den Hengel A. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017, 163: 21–40. [doi: 10.1016/j.cviu.2017.05.001

[4] Wang KY, He R, Wang L, Wang W, Tan TN. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010–2023. [doi: 10.1109/TPAMI.2015.2505311

[5] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. [doi: 10.1145/3065386

[6] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Miami: IEEE, 2009. 248–255.

[7] Wieczorek M, Rychalska B, Dabrowski J. On the unreasonable effectiveness of centroids in image retrieval. In: Proc. of the 28th Int’l Conf. on Neural Information Processing (NIPS). Sanur: Springer, 2021. 212–223.

[8] Fu DP, Chen DD, Bao JM, Yang H, Yuan L, Zhang L, Li HQ, Chen D. Unsupervised pre-training for person re-identification. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 14745–14754.

[9] Wang GC, Lai JH, Huang PG, Xie XH. Spatial-temporal person re-identification. Proc. of the 2019 AAAI Conf. on Artificial Intelligence, 2019, 33(1): 8933–8940. [doi: 10.1609/aaai.v33i01.33018933

[10] Zhu ZH, Jiang XY, Zheng F, Guo XW, Huang FY, Sun X, Zheng WS. Viewpoint-aware loss with angular regularization for person re-identification. Proc. of the 2020 AAAI Conf. on Artificial Intelligence, 2020, 34(7): 13114–13121. [doi: 10.1609/aaai.v34i07.7014

[11] Masson H, Bhuiyan A, Nguyen-Meidine LT, Javan M, Siva P, Ben Ayed I, Granger E. A survey of pruning methods for efficient person re-identification across domains. arXiv:1907.02547, 2021.

[12] Wu L, Wang Y, Gao JB, Wang M, Zha ZJ, Tao DC. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Trans. on Neural Networks and Learning Systems, 2021, 32(2): 722–735. [doi: 10.1109/TNNLS.2020.2979190

[13] Yang F, Yan K, Lu SJ, Jia HZ, Xie XD, Gao W. Attention driven person re-identification. Pattern Recognition, 2019, 86: 143–155. [doi: 10.1016/j.patcog.2018.08.015

[14] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.

[15] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision (ICCV). Santiago: IEEE, 2015. 2641–2649.

[16] Zheng KC, Liu W, Liu JW, Zha ZJ, Mei T. Hierarchical gumbel attention network for text-based person search. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 3441–3449.

[17] Niu K, Huang Y, Ouyang WL, Wang L. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. on Image Processing, 2020, 29: 5542–5556. [doi: 10.1109/TIP.2020.2984883

[18] Wang Z, Fang ZY, Wang J, Yang YZ. ViTAA: Visual-textual attributes alignment in person search by natural language. In: Proc. of the 16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 402–420.

[19] Aggarwal S, Babu RV, Chakraborty A. Text-based person search via attribute-aided matching. In: Proc. of the 2020 IEEE Winter Conf. on Applications of Computer Vision (WACV). Snowmass: IEEE, 2020. 2606–2614.

[20] Zheng ZD, Zheng L, Garrett M, Yang Y, Xu ML, Shen YD. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. on Multimedia Computing, Communications, and Applications, 2020, 16(2): 51. [doi: 10.1145/3383184

[21] Chen YH, Zhang GQ, Lu YJ, Wang ZX, Zheng YH. TIPCB: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 2022, 494: 171–181. [doi: 10.1016/j.neucom.2022.04.081

[22] Wang JY, Zhu XT, Gong SG, Li W. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018. 2275–2284.

[23] Chen DP, Li HS, Liu XH, Shen YT, Shao J, Yuan ZJ, Wang XG. Improving deep visual representation for person re-identification by global and local image-language association. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 56–73.

[24] Ren MY, Kiros R, Zemel RS. Exploring models and data for image question answering. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 2953–2961.

[25] Noh H, Seo PH, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 30–38.

[26] Yang ZC, He XD, Gao JF, Deng L, Smola A. Stacked attention networks for image question answering. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 21–29.

[27] Saito K, Shin A, Ushiku Y, Harada T. DualNet: Domain-invariant network for visual question answering. In: Proc. of the 2017 IEEE Int’l Conf. on Multimedia and Expo (ICME). Hong Kong: IEEE, 2017. 829–834.

[28] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847, 2016.

[29] Jing Y, Si CY, Wang JB, Wang W, Wang L, Tan TN. Pose-guided multi-granularity attention network for text-based person search. Proc. of the 2020 AAAI Conf. on Artificial Intelligence, 2020, 34(7): 11189–11196. [doi: 10.1609/aaai.v34i07.6777

[30] Yi D, Lei Z, Liao SC, Li SZ. Deep metric learning for person re-identification. In: Proc. of the 22nd Int’l Conf. on Pattern Recognition. Stockholm: IEEE, 2014. 34–39.

[31] Hou RB, Ma BP, Chang H, Gu XQ, Shan SG, Chen XL. Interaction-and-aggregation network for person re-identification. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 9309–9318.

[32] Xia BN, Gong Y, Zhang YZ, Poellabauer C. Second-order non-local attention networks for person re-identification. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 3759–3768.

[33] Ye M, Shen JB, Lin GJ, Xiang T, Shao L, Hoi SCH. Deep learning for person re-identification: A survey and outlook. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022, 44(6): 2872–2893. [doi: 10.1109/TPAMI.2021.3054775

[34] Wu L, Shen CH, van den Hengel A. PersonNet: Person re-identification with deep convolutional neural networks. arXiv:1601.07255, 2016.

[35] Li W, Zhu XT, Gong SG. Harmonious attention network for person re-identification. arXiv:1802.08122, 2018.

[36] Wang C, Zhang Q, Huang C, Liu WY, Wang XG. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 384–400.

[37] Chen GY, Lin CZ, Ren LL, Lu JW, Zhou J. Self-critical attention learning for person re-identification. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 96376–9645.

[38] Cheng D, Gong YH, Zhou SP, Wang JJ, Zhang NN. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 1335–1344.

[39] Li DW, Chen XT, Zhang Z, Huang KQ. Learning deep context-aware features over body and latent parts for person re-identification. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 7398–7407.

[40] Zhao LM, Li X, Zhuang YT, Wang JD. Deeply-learned part-aligned representations for person re-identification. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision (ICCV). Venice: IEEE, 2017. 3239–3248.

[41] Liu JL, Sun YF, Zhu F, Pei HB, Yang Y, Li WH. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 19344–19353.

[42] Yu R, Du DW, LaLonde R, Davila D, Funk C, Hoogs A, Clipp B. Cascade transformers for end-to-end person search. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 7257–7266.

[43] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 770–778.

[44] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2019.

[45] Gao CY, Cai GY, Jiang XY, Zheng F, Zhang J, Gong YF, Peng P, Guo XW, Sun X. Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036, 2021.

[46] Farooq A, Awais M, Kittler J, Khalid SS. AXM-Net: Implicit cross-modal feature alignment for person re-identification. Proc. of the 2022 AAAI Conf. on Artificial Intelligence, 2022, 36(4): 4477–4485. [doi: 10.1609/aaai.v36i4.20370

[47] Sun YF, Zheng L, Yang Y, Tian Q, Wang SJ. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 501–518.

[48] Zhang Y, Lu HC. Deep cross-modal projection learning for image-text matching. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018. 707–723.

[49] Ren SH, Lin JY, Zhao GX, Men R, Yang A, Zhou JR, Sun X, Yang HX. Learning relation alignment for calibrated cross-modal retrieval. arXiv:2105.13868, 2021.

[50] Xiao T, Li S, Wang BC, Lin L, Wang XG. End-to-end deep learning for person search. arXiv:1604.01850, 2017.

[51] Cho YJ, Yoon KJ. PaMM: Pose-aware multi-shot matching for improving person re-identification. IEEE Trans. on Image Processing, 2018, 27(8): 3739–3752. [doi: 10.1109/TIP.2018.2815840

[52] Li W, Zhao R, Wang XG. Human reidentification with transferred metric learning. In: Proc. of the 11th Asian Conf. on Computer Vision. Daejeon: Springer, 2012. 31–44.

[53] Reed S, Akata Z, Lee H, Schiele B. Learning deep representations of fine-grained visual descriptions. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 49–58.

[54] Zhu AC, Wang ZJ, Li YF, Wan XL, Jin J, Wang T, Hu FQ, Hua G. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In: Proc. of the 29th ACM Int’l Conf. on Multimedia. ACM, 2021. 209–217.

Get Citation

李博,张飞飞,徐常胜.模态间关系促进的行人检索方法.软件学报,2024,35(10):4766-4780

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 18,2022
Revised:February 28,2023
Adopted:
Online: November 15,2023
Published: October 06,2024

You are the first2034140Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History