Text-based Person Search via Virtual Attribute Learning
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [45]
  • | | | |
  • Comments
    Abstract:

    The text-based person search aims to find the image of the target person conforming to a given text description from a person database, which has attracted the attention of researchers from academia and industry. It faces two challenges: fine-grained retrieval and a heterogeneous gap between images and texts. Some methods propose to use supervised attribute learning to obtain attribute-related features and build fine-grained associations between tests and images. The attribute annotations, however, are hard to obtain, which leads to poor performance of these methods in practice. Determining how to extract attribute-related features without attribute annotations and establish fine-grained and cross-modal semantic associations becomes a key problem to be solved. To address this issue, this study incorporates the pre-training technology and proposes a text-based person search via virtual attribute learning, which builds the cross-modal semantic associations between images and texts at a fine-grained level through unsupervised attribute learning. Specifically, in view of the invariance and cross-modal consistency of pedestrian attributes, a semantics-guided attribute decoupling method is proposed, which utilizes identity labels as the supervision signal to guide the model to decouple attribute-related features. Then, a feature learning module based on semantic reasoning is presented, which utilizes the relations between attributes to construct a semantic graph. This model uses the graph model to exchange information among attributes to enhance the cross-modal identification ability of features. The proposed approach is compared with existing methods on the public text-based person search dataset CUHK-PEDES and cross-modal retrieval dataset Flickr30k, and the experimental results verify the effectiveness of the proposed approach.

    Reference
    [1] Zheng L, Shen LY, Tian L, Wang SJ, Wang JD, Tian Q. Scalable person re-identification: A benchmark. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision (ICCV). Santiago: IEEE, 2015. 1116–1124.
    [2] Zhong Z, Zheng L, Cao DL, Li SZ. Re-ranking person re-identification with k-reciprocal encoding. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 3652–3661.
    [3] Xiao T, Li S, Wang BC, Lin L, Wang XG. Joint detection and identification feature learning for person search. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 3376–3385.
    [4] Pang L, Wang YW, Song YZ, Huang TJ, Tian YH. Cross-domain adversarial feature learning for sketch re-identification. In: Proc. of the 26th ACM Int’l Conf. on Multimedia. Seoul: ACM, 2018. 609–617.
    [5] Wu AC, Zheng WS, Yu HX, Gong SG, Lai JH. RGB-infrared cross-modality person re-identification. In: Proc. of the 2017 IEEE Conf. on Computer Vision (ICCV). Venice: IEEE, 2017. 5390–5399.
    [6] Nguyen DT, Hong HG, Kim KW, Park KR. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 2017, 17(3): 605. [doi: 10.3390/s17030605]
    [7] Li S, Xiao T, Li HS, Zhou BL, Yue DY, Wang XG. Person search with natural language description. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5187–5196.
    [8] 卓昀侃, 綦金玮, 彭宇新. 跨媒体深层细粒度关联学习方法. 软件学报, 2019, 30(4): 884–895. http://www.jos.org.cn/1000-9825/5664.htm
    Zhuo YK, Qi JW, Peng YX. Cross-media deep fine-grained correlation learning. Ruan Jian Xue Bao/Journal of Software, 2019, 30(4): 884–895 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5664.htm
    [9] 罗浩, 姜伟, 范星, 张思朋. 基于深度学习的行人重识别研究进展. 自动化学报, 2019, 45(11): 2032-2049. [doi: 10.16383/j.aas.c180154].
    Luo H, Jiang W, Fan X, Zhang SP. A survey on deep learning based person re-identification. Acta Automatica Sinica, 2019, 45(11): 2032-2049 (in Chinese with English abstract). [doi: 10.16383/j.aas.c180154]
    [10] 祁磊, 于沛泽, 高阳. 弱监督场景下的行人重识别研究综述. 软件学报, 2020, 31(9): 2883-2902. http://www.jos.org.cn/1000-9825/6083.htm
    Qi L, Yu PZ, Gao Y. Research on weak-supervised person re-identification. Ruan Jian Xue Bao/Journal of Software, 2020, 31(9): 2883-2902 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6083.htm
    [11] Zhang Y, Lu HC. Deep cross-modal projection learning for image-text matching. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 707–723.
    [12] Zheng ZD, Zheng L, Garrett M, Yang Y, Xu ML, Shen YD. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): 1–23. [doi: 10.1145/3383184]
    [13] Sarafianos N, Xu X, Kakadiaris I. Adversarial representation learning for text-to-image matching. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 5813–5823.
    [14] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: Proc. of the 2019 IEEE Conf. on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 248–255.
    [15] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the 2nd Int’l Conf. on Learning Representations (ICLR). San Diego: ICLR, 2015. 1–14.
    [16] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 770–778.
    [17] Howard AG, Zhu ML, Chen B, Kalenichenko D, Wang WJ, Weyand T, Andreetto M, Adam H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
    [18] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
    [19] Zha ZJ, Liu JW, Chen D, Wu F. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia, 2020, 22(7): 1836–1846. [doi: 10.1109/TMM.2020.2972168]
    [20] Aggarwal S, Babu RV, Chakraborty A. Text-based person search via attribute-aided matching. In: Proc. of the 2020 IEEE Winter Conf. on Applications of Computer Vision (WACV). Snowmass: IEEE, 2020. 2617–2625.
    [21] Dayan P, Abbott LF. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. London: The MIT Press, 2005.
    [22] Li S, Xiao T, Li HS, Yang W, Wang XG. Identity-aware textual-vr>ual matching with latent co-attention. In: Proc. of the 2017 IEEE Conf. on Computer Vision. Venice: IEEE, 2017. 1890–1899.
    [23] Chen TL, Xu CL, Luo JB. Improving text-based person search by spatial matching and adaptive threshold. In: Proc. of the 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV). Lake Tahoe: IEEE, 2018. 1879–1887.
    [24] Niu K, Huang Y, Ouyang WL, Wang L. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, 2020, 29: 5542–5556. [doi: 10.1109/TIP.2020.2984883]
    [25] Gao LY, Niu K, Ma ZH, Jiao BL, Tan TH, Wang P. Text-guided visual feature refinement for text-based person search. In: Proc. of the 2021 Int’l Conf. on Multimedia Retrieval. Taipei: ACM, 2021. 118–126.
    [26] Jing Y, Si CY, Wang JB, Wang W, Wang L, Tan TN. Pose-guided multi-granularity attention network for text-based person search. Proc. of the AAAI Conf. on Artificial Intelligence, 2020, 34(7): 11189–11196.
    [27] Chen W, Liu Y, Bakker EM, Lew MS. Integrating information theory and adversarial learning for cross-modal retrieval. Pattern Recognition, 2021, 117: 107983. [doi: 10.1016/j.patcog.2021.107983]
    [28] Jing Y, Wang W, Wang L, Tan TN. Learning aligned image-text representations using graph attentive relational network. IEEE Transactions on Image Processing, 2021, 30: 1840–1852. [doi: 10.1109/TIP.2020.3048627]
    [29] Liu JW, Zha ZJ, Hong RC, Wang M, Zhang YD. Deep adversarial graph attention convolution network for text-based person search. In: Proc. of the 2019 ACM Int’l Conf. on Multimedia. Nice: ACM, 2019. 665–673.
    [30] Wang Z, Fang ZY, Wang J, Yang YZ. ViTAA: Visual-textual attributes alignment in person search by natural language. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 402–420.
    [31] 史金婉, 宋雪萌, 刘子鑫, 聂礼强. 基于时尚图谱增强的个性化互补服装推荐. 信息安全学报, 2021, 6(5): 181–198. [doi: 10.19363/J.cnki.cn10-1380/tn.2021.09.14].
    Shi JW, Song XM, Liu ZX, Nie LQ. Fashion graph-enhanced personalized complementary clothing recommendation. Journal of Cyber Security, 2021, 6(5): 181–198 (in Chinese with English abstract). [doi: 10.19363/J.cnki.cn10-1380/tn.2021.09.14]
    [32] 郑鑫, 林兰, 叶茂, 王丽, 贺春林. 结合注意力机制和多属性分类的行人再识别. 中国图象图形学报, 2020, 25(5): 936–945. [doi: 10.11834/jig.190185].
    Zheng X, Lin L, Ye M, Wang L, He CL. Improving person re-identification by attention and multi-attributes. Journal of Image and Graphics, 2020, 25(5): 936–945 (in Chinese with English abstract). [doi: 10.11834/jig.190185]
    [33] Dong Q, Zhu XT, Gong SG. Person search by text attribute query as zero-shot learning. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 3652–3661.
    [34] Kim JH, Jun J, Zhang BT. Bilinear attention networks. In: Proc. of the 32nd Conf. on Neural Information Processing Systems. Montréal: NeurIPS, 2018. 1571–1581.
    [35] Li YJ, Tarlow D, Brockschmidt M, Zemel RS. Gated graph sequence neural networks. In: Proc. of the 4th Int’l Conf. on Learning Representations (ICLR). San Juan: ICLR, 2016. 1–20.
    [36] Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014, 2: 67–78. [doi: 10.1162/tacl_a_00166]
    [37] Chen DP, Li HS, Liu XH, Shen YT, Shao J, Yuan ZJ, Wang XG. Improving deep visual representation for person re-identification by global and local image-language association. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 56–73.
    [38] Chen YC, Huang R, Chang H, Tan CQ, Xue T, Ma BP. Cross-modal knowledge adaptation for language-based person search. IEEE Transactions on Image Processing, 2021, 30: 4057–4069. [doi: 10.1109/TIP.2021.3068825]
    [39] 徐童, 周培伦, 陈恩红. 多模态语义理解中的不确定性. 中国人工智能学会通讯, 2020, 10(9): 7–11. (查阅所有网上资料, 未找到本条文献信息, 请联系作者确认)
    Xu T, Zhou PL, Chen EH. Uncertainty in multimodal semantic understanding. Communications of the CAAI, 2020, 10(9): 7–11 (in Chinese).
    Related
    Cited by
Get Citation

王成济,苏家威,罗志明,曹冬林,林耀进,李绍滋.基于虚拟属性学习的文本-图像行人检索方法.软件学报,2023,34(5):2035-2050

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 12,2022
  • Revised:May 29,2022
  • Online: September 20,2022
  • Published: May 06,2023
You are the first2032327Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063