Abstract:Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.