Abstract:The performance of image classification algorithms is limited by the diversity of visual information and the influence of background noise. Existing works usually apply cross-modal constraints or heterogeneous feature alignment algorithms to learn visual representations with strong discrimination. However, the difference in feature distribution caused by modal heterogeneity limits the effective learning of visual representations. To address this problem, this study proposes an image classification framework (CMIF) based on cross-modal semantic information inference and fusion and introduces the semantic description of images and statistical knowledge as privileged information. The study uses the privileged information learning paradigm to guide the mapping of image features from visual space to semantic space in the training stage, and a class-aware information selection (CIS) algorithm is proposed to learn the cross-modal enhanced representation of images. In view of the heterogeneous feature differences in representation learning, the partial heterogeneous alignment (PHA) algorithm is used to achieve cross-modal alignment of visual features and semantic features extracted from privileged information. In order to further suppress the interference caused by visual noise in semantic space, the CIS algorithm based on graph fusion is selected to reconstruct the key information in the semantic representation, so as to form an effective supplement to the visual prediction information. Experiments on the cross-modal classification datasets VireoFood-172 and NUS-WIDE show that CMIF can learn robust semantic features of images, and it has achieved stable performance improvement on the convolution-based ResNet-50 and Transformer-based ViT image classification models as a general framework.