基于视觉Transformer的双视图融合细粒度图像识别
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

国家自然科学基金(62425603, U21B2043); 江苏省基础研究计划攀登项目(BK20240011)


Dual-view Fusion for Fine-grained Image Recognition with Vision Transformer
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着计算机视觉技术的不断进步, 细粒度图像识别在众多应用领域中发挥着重要作用. 与传统的粗粒度图像识别不同, 细粒度图像识别着重于在同一大类别下对具有细微视觉差异的子类别进行精确划分, 因此该任务更具有挑战性. 近年来, 视觉Transformer以其在全局上下文信息建模方面的出色表现而被广泛应用于图像识别领域. 然而, 当应用于细粒度图像识别任务时, 视觉Transformer在处理细节特征和背景噪声方面却存在一定的局限性. 针对上述问题, 提出一种基于视觉Transformer的双视图融合识别框架, 有效融合细粒度图像的全局视图与局部视图以提升识别准确率. 该框架设计了一个基于注意力融合的冗余信息过滤模块, 在编码器内部通过层级注意力权重的融合筛选图像块特征, 以优化全局视图的分类标记嵌入. 同时, 还设计了一个基于注意力阈值的关键区域定位模块, 通过自适应阈值策略动态选定并放大全局视图中的关键区域, 形成细致的局部视图以供再次分析. 此外, 所提出的局部区域特征自适应增强模块进一步增强了对局部细节的关注, 有效提升了细粒度特征的辨识能力. 为优化此双视图融合框架, 提出了基于双视图相似度的对比损失函数和基于双视图置信度的自适应推理策略, 旨在增强视觉Transformer模型输出的全局与局部特征辨识性, 同时有效节约计算资源并缩短推理时间. 在CUB-200-2011、Stanford Dogs、NABirds和iNaturalist2017这4个公共数据集上的实验结果表明, 该方法相较于传统视觉Transformer模型在识别准确率上实现了显著提升, 展示了其在细粒度图像识别任务中的有效性和优越性.

    Abstract:

    With the continuous advancement of computer vision technology, fine-grained image recognition plays a crucial role across various application domains. Unlike traditional coarse-grained image recognition, fine-grained image recognition aims to precisely distinguish subcategories with subtle visual differences within the same major category, making this task particularly challenging. In recent years, the vision Transformer has gained widespread adoption in image recognition due to its exceptional performance in modeling global contextual information. However, the vision Transformer exhibits certain limitations when applied to fine-grained image recognition, particularly in processing detailed features and mitigating background noise. To address these issues, this study proposes a dual-view recognition framework based on the vision Transformer. This framework effectively integrates global and local views to enhance recognition accuracy. In this framework, an attention-based fusion module is designed to filter redundant information and optimize the classification token embedding of global views by merging and filtering patch features through hierarchical attention weights within the encoder. In addition, an attention threshold-based key region localization module is introduced. This module dynamically selects and magnifies key patches in the global view using an adaptive threshold strategy, forming detailed local views for further analysis. Furthermore, an adaptive enhancement module for local region features is proposed to strengthen the focus on local details, thus enhancing the recognition capability of fine-grained features. To optimize the dual-view framework, a contrastive loss function based on dual-view similarity and an adaptive inference strategy based on dual-view confidence are proposed. These strategies aim to enhance the global and local feature discriminability of the vision Transformer model while efficiently saving computational resources and shortening inference time. Experimental results on the CUB-200-2011, Stanford Dogs, NABirds, and iNaturalist2017 public datasets demonstrate that the proposed method achieves significant improvements in recognition accuracy compared to the traditional vision Transformer model. These results validate the proposed method’s effectiveness and superiority in fine-grained image recognition tasks.

    参考文献
    相似文献
    引证文献
引用本文

唐昊,李泽超,蒋鑫,唐金辉.基于视觉Transformer的双视图融合细粒度图像识别.软件学报,,():1-23

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-10-15
  • 最后修改日期:2025-03-11
  • 录用日期:
  • 在线发布日期: 2025-10-29
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号