Abstract:Due to the exponential growth of multimodal data, traditional databases are confronted with challenges in terms of storage and retrieval. Multimodal hashing is able to effectively reduce the storage cost of databases and improve retrieval efficiency by fusing multimodal features and mapping them into binary hash codes. Although many works on multimodal hashing perform well, there are also three important problems to be solved: (1) Existing methods tend to consider that all samples are modality-complete, while in practical retrieval scenarios, it is also common for samples to miss partial modalities; (2) Most methods are based on shallow learning models, which inevitably limits models’ learning ability and affects the final retrieval performance; (3) Some methods based on deep learning framework have been proposed to address the issue of weak learning ability, but they directly use coarse-grained feature fusion methods, such as concatenation, after extracting features from different modalities, which fails to effectively capture deep semantic information, thereby weakening the representation ability of hash codes and affecting the final retrieval performance. In response to the above problems, the PMH-F3 model is proposed. This model implements partial multimodal hashing for the case of samples missing partial modalities. The model is based on deep network architecture, and the Transformer encoder is used to capture deep semantics in self-attention manner, achieving fine-grained multimodal feature fusion. Sufficient experiments are conducted on MIR Flickr and MS COCO datasets and the best retrieval performance is achieved. The results of experiments show that PMH-F3model can effectively implement partial multimodal hashing and can be applied to large-scale multimodal data retrieval.