Image Classification Method Based on Cross-modal Privileged Information Enhancement
Author:
Affiliation:

Clc Number:

TP391

  • Article
  • | |
  • Metrics
  • |
  • Reference [52]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    The performance of image classification algorithms is limited by the diversity of visual information and the influence of background noise. Existing works usually apply cross-modal constraints or heterogeneous feature alignment algorithms to learn visual representations with strong discrimination. However, the difference in feature distribution caused by modal heterogeneity limits the effective learning of visual representations. To address this problem, this study proposes an image classification framework (CMIF) based on cross-modal semantic information inference and fusion and introduces the semantic description of images and statistical knowledge as privileged information. The study uses the privileged information learning paradigm to guide the mapping of image features from visual space to semantic space in the training stage, and a class-aware information selection (CIS) algorithm is proposed to learn the cross-modal enhanced representation of images. In view of the heterogeneous feature differences in representation learning, the partial heterogeneous alignment (PHA) algorithm is used to achieve cross-modal alignment of visual features and semantic features extracted from privileged information. In order to further suppress the interference caused by visual noise in semantic space, the CIS algorithm based on graph fusion is selected to reconstruct the key information in the semantic representation, so as to form an effective supplement to the visual prediction information. Experiments on the cross-modal classification datasets VireoFood-172 and NUS-WIDE show that CMIF can learn robust semantic features of images, and it has achieved stable performance improvement on the convolution-based ResNet-50 and Transformer-based ViT image classification models as a general framework.

    Reference
    [1] 于谦, 高阳, 霍静, 庄韫恺. 视频人脸识别中判别性联合多流形分析. 软件学报, 2015, 26(11): 2897–2911. http://www.jos.org.cn/1000-9825/4894.htm
    Yu Q, Gao Y, Huo J, Zhuang YK. Discriminative joint multi-manifold analysis for video-based face recognition. Ruan Jian Xue Bao/Journal of Software, 2015, 26(11): 2897–2911 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4894.htm
    [2] Ding CX, Tao DC. Robust face recognition via multimodal deep face representation. IEEE Trans. on Multimedia, 2015, 17(11): 2049–2058.
    [3] 张梦寒, 杜德慧, 张铭茁, 张雷, 王耀, 周文韬. 时空轨迹数据驱动的自动驾驶场景元建模方法. 软件学报, 2021, 32(4): 973–987. http://www.jos.org.cn/1000-9825/6226.htm
    Zhang MH, Du DH, Zhang MZ, Zhang L, Wang Y, Zhou WT. Spatio-temporal trajectory data-driven autonomous driving scenario meta-modeling approach. Ruan Jian Xue Bao/Journal of Software, 2021, 32(4): 973–987 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6226.htm
    [4] Sun QY, Wang C, Fu R, Guo YS, Yuan W, Li Z. Lane change strategy analysis and recognition for intelligent driving systems based on random forest. Expert Systems with Applications, 2021, 186: 115781.
    [5] Wang Y, He Y, Zhu FQ, Boushey C, Delp E. The use of temporal information in food image analysis. In: Proc. of the 2015 Int’l Conf. on Image Analysis and Processing. Genoa: Springer, 2015. 317–325.
    [6] Zhu FQ, Bosch M, Khanna N, Boushey CJ, Delp EJ. Multiple hypotheses image segmentation and classification with application to dietary assessment. IEEE Journal of Biomedical and Health Informatics, 2015, 19(1): 377–388.
    [7] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Proc. of the 25th Int’l Conf. on Neural Information Processing Systems. Lake Tahoe: Curran Associates Inc., 2012. 1097–1105.
    [8] Li JY, Ma HK, Li XX, Qi Z, Meng L, Meng XX. Unsupervised contrastive masking for visual haze classification. In: Proc. of the 2022 Int’l Conf. on Multimedia Retrieval. Newark: ACM, 2022. 426–434.
    [9] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述. 软件学报, 2021, 32(2): 327–348. http://www.jos.org.cn/1000-9825/6125.htm
    Du PF, Li XY, Gao YL. Survey on multimodal visual language representation learning. Ruan Jian Xue Bao/Journal of Software, 2021, 32(2): 327–348 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6125.htm
    [10] Vapnik V, Vashist A. A new learning paradigm: Learning using privileged information. Neural Networks, 2009, 22(5–6): 544–557.
    [11] Vapnik V, Izmailov R. Learning using privileged information: Similarity control and knowledge transfer. The Journal of Machine Learning Research, 2009, 16(1): 2023–2049.
    [12] Yan Y, Nie FP, Li W, Gao CQ, Yang Y, Xu D. Image classification by cross-media active learning with privileged information. IEEE Trans. on Multimedia, 2016, 18(12): 2494–2502.
    [13] Chen JJ, Ngo CW. Deep-based ingredient recognition for cooking recipe retrieval. In: Proc. of the 24th ACM Int’l Conf. on Multimedia. Amsterdam: ACM, 2016. 32–41.
    [14] Motiian S, Piccirilli M, Adjeroh DA, Doretto G. Information bottleneck learning using privileged information for visual recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 1496–1505.
    [15] Yang H, Zhou JT, Cai JF, Ong YS. MIML-FCN+: Multi-instance multi-label learning via fully convolutional networks with privileged information. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 5996–6004.
    [16] George A, Marcel S. Cross modal focal loss for RGBD face anti-spoofing. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 7878–7887.
    [17] Wen KY, Xia J, Huang YY, Li LY, Xu JY, Shao J. COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 2188–2197.
    [18] Chen JJ, Ngo CW, Chua TS. Cross-modal recipe retrieval with rich food attributes. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain View: ACM, 2017. 1771–1779.
    [19] Chen JJ, Zhu B, Ngo CW, Chua TS, Jiang YG. A study of multi-task and region-wise deep learning for food ingredient recognition. IEEE Trans. on Image Processing, 2021, 30: 1514–1526.
    [20] Wang ZL, Min WQ, Li Z, Kang LP, Wei XM, Wei XL, Jiang SQ. Ingredient-guided region discovery and relationship modeling for food category-ingredient prediction. IEEE Trans. on Image Processing, 2022, 31: 5214–5226.
    [21] Min WQ, Liu LH, Luo ZD, Jiang SQ. Ingredient-guided cascaded multi-attention network for food recognition. In: Proc. of the 27th ACM Int’l Conf. on Multimedia. Nice: ACM, 2019. 1331–1339.
    [22] Jiang SQ, Min WQ, Liu LH, Luo ZD. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Trans. on Image Processing, 2020, 29(1): 265–276.
    [23] Sun B, Saenko K. Deep CORAL: Correlation alignment for deep domain adaptation. In: Proc. of the 2016 European Conf. on Computer Vision. Amsterdam: Springer, 2016. 443–450.
    [24] Tang JH, Shu XB, Li ZC, Qi GJ, Wang JD. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. on Multimedia Computing, Communications, and Applications, 2016, 12(4S): 68.
    [25] Chen SM, Xie GS, Liu Y, Peng QM, Sun BG, Li H, You XG, Shao L. HSVA: Hierarchical semantic-visual adaptation for zero-shot learning. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. 2021. 16622–16634.
    [26] Theodoridis T, Chatzis T, Solachidis V, Dimitropoulos K, Daras P. Cross-modal variational alignment of latent spaces. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020. 960–961.
    [27] Li S, Xie BH, Wu JS, Zhao Y, Liu CH, Ding ZM. Simultaneous semantic alignment network for heterogeneous domain adaptation. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 3866–3874.
    [28] Meng L, Chen L, Yang X, Tao DC, Zhang HW, Miao CY. Learning using privileged information for food recognition. In: Proc. of the 27th ACM Int’l Conf. on Multimedia. Nice: ACM, 2019. 557–565.
    [29] Li XY, Xu Z, Wei K, Deng C. Generalized zero-shot learning via disentangled representation. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI Press, 2021. 1966–1974.
    [30] Chua TS, Tang JH, Hong RC, Li HJ, Luo HJ, Zheng YT. NUS-WIDE: A real-world Web image database from National University of Singapore. In: Proc. of the ACM Int’l Conf. on Image and Video Retrieval. Santorini: ACM, 2009. 48.
    [31] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2020.
    [32] Day O, Khoshgoftaar TM. A survey on heterogeneous transfer learning. Journal of Big Data, 2017, 4: 29.
    [33] Graves A. Long short-term memory. In: Graves A, ed. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012. 37–45.
    [34] Hu JW, Liu YC, Zhao JM, Jin Q. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing (Vol. 1: Long Papers). Association for Computational Linguistics, 2021. 5666–5675.
    [35] Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017.
    [36] Tang JH, Shu XB, Qi GJ, Li ZC, Wang M, Yan SC, Jain R. Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016, 39(8): 1662–1674.
    [37] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.
    [38] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015. 1–14.
    [39] Zagoruyko S, Komodakis N. Wide residual networks. In: Proc. of the 2016 British Machine Vision Conf. York: BMVA Press, 2016.
    [40] Martinel N, Foresti GL, Micheloni C. Wide-slice residual networks for food recognition. In: Proc. of the 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV). Lake Tahoe: IEEE, 2018. 567–576.
    [41] Ding XH, Zhang XY, Ma NN, Han JG, Ding GG, Sun J. RepVGG: Making VGG-style convNets great again. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 13728–13737.
    [42] Ding XH, Chen HH, Zhang XY, et al. RepMLPnet: Hierarchical vision MLP with re-parameterized locality. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 568–577.
    [43] Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 618–626.
    [44] 吕天根, 洪日昌, 何军, 胡社教. 多模态引导的局部特征选择小样本学习方法. 软件学报, 2023, 34(5): 2068–2082. http://www.jos.org.cn/1000-9825/6771.htm
    Lü TG, Hong RC, He J, Hu SJ. Multimodal-guided local feature selection for few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5): 2068–2082 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6771.htm
    [45] Ma HK, Qi Z, Dong XX, Li XX, Zheng YZ, Meng XX, Meng L. Cross-modal content inference and feature enrichment for cold-start recommendation. In: Proc. of the 2023 Int’l Joint Conf. on Neural Networks (IJCNN). Gold Coast: IEEE, 2023. 1–8.
    [46] Tang KH, Huang JQ, Zhang HW. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 128.
    [47] Li XX, Ma HK, Meng L, Meng XX. Comparative study of adversarial training methods for long-tailed classification. In: Proc. of the 1st Int’l Workshop on Adversarial Learning for Multimedia. ACM, 2021. 1–7.
    [48] Wang YQ, Li XX, Ma HK, QI Z, Meng XX, Meng L. Causal inference with sample balancing for out-of-distribution detection in visual classification. In: Proc. of the 2nd CAAI Int’l Conf. on Artificial Intelligence. Beijing: Springer, 2022. 572–583.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

李象贤,郑裕泽,马浩凯,齐壮,闫晓硕,孟祥旭,孟雷.基于跨模态特权信息增强的图像分类方法.软件学报,2024,35(12):5636-5652

Copy
Share
Article Metrics
  • Abstract:763
  • PDF: 2454
  • HTML: 456
  • Cited by: 0
History
  • Received:December 06,2022
  • Revised:March 21,2023
  • Online: January 31,2024
  • Published: December 06,2024
You are the first2049962Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063