Abstract:Zero-shot sketch-based image retrieval uses sketches of unseen classes as query samples for retrieving images of those classes. This task is thus faced with two challenges: the modal gap between a sketch and the image and inconsistencies between seen and unseen classes. Previous approaches tried to eliminate the modal gap by projecting the sketch and the image into a common space and bridge the semantic inconsistencies between seen and unseen classes with semantic embeddings (e.g., word vectors and word similarity). This study proposes a cross-modal self-distillation approach to investigate generalizable features from the perspective of knowledge distillation without the involvement of semantic embeddings in training. Specifically, the knowledge of the pre-trained image recognition network is transferred to the student network through traditional knowledge distillation. Then, according to the cross-modal correlation between a sketch and the image, cross-modal self-distillation indirectly transfers the above knowledge to the recognition of the sketch modality to enhance the discriminative and generalizable features of sketch features. To further promote the integration and propagation of the knowledge within the sketch modality, this study proposes sketch self-distillation. By learning discriminative and generalizable features from the data, the student network eliminates the modal gap and semantic inconsistencies. Extensive experiments conducted on three benchmark datasets, namely Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of the proposed cross-modal self-distillation approach to the state-of-the-art ones.