Abstract:The text-based person search aims to find the image of the target person conforming to a given text description from a person database, which has attracted the attention of researchers from academia and industry. It faces two challenges: fine-grained retrieval and a heterogeneous gap between images and texts. Some methods propose to use supervised attribute learning to obtain attribute-related features and build fine-grained associations between tests and images. The attribute annotations, however, are hard to obtain, which leads to poor performance of these methods in practice. Determining how to extract attribute-related features without attribute annotations and establish fine-grained and cross-modal semantic associations becomes a key problem to be solved. To address this issue, this study incorporates the pre-training technology and proposes a text-based person search via virtual attribute learning, which builds the cross-modal semantic associations between images and texts at a fine-grained level through unsupervised attribute learning. Specifically, in view of the invariance and cross-modal consistency of pedestrian attributes, a semantics-guided attribute decoupling method is proposed, which utilizes identity labels as the supervision signal to guide the model to decouple attribute-related features. Then, a feature learning module based on semantic reasoning is presented, which utilizes the relations between attributes to construct a semantic graph. This model uses the graph model to exchange information among attributes to enhance the cross-modal identification ability of features. The proposed approach is compared with existing methods on the public text-based person search dataset CUHK-PEDES and cross-modal retrieval dataset Flickr30k, and the experimental results verify the effectiveness of the proposed approach.