跨模态交互融合与全局感知的RGB-D显著性目标检测
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61976042, 61972068);兴辽英才计划(XLYC2007023);辽宁省高等学校创新人才支持计划(LR2019020)


RGB-D Salient Object Detection Based on Cross-modal Interactive Fusion and Global Awareness
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近年来, RGB-D显著性检测方法凭借深度图中丰富的几何结构和空间位置信息, 取得了比RGB显著性检测模型更好的性能, 受到学术界的高度关注. 然而, 现有的RGB-D检测模型仍面临着持续提升检测性能的需求. 最近兴起的Transformer擅长建模全局信息, 而卷积神经网络(CNN)擅长提取局部细节. 因此, 如何有效结合CNN和Transformer两者的优势, 挖掘全局和局部信息, 将有助于提升显著性目标检测的精度. 为此, 提出一种基于跨模态交互融合与全局感知的RGB-D显著性目标检测方法, 通过将Transformer网络嵌入U-Net中, 从而将全局注意力机制与局部卷积结合在一起, 能够更好地对特征进行提取. 首先借助U-Net编码-解码结构, 高效地提取多层次互补特征并逐级解码生成显著特征图. 然后, 使用Transformer模块学习高级特征间的全局依赖关系增强特征表示, 并针对输入采用渐进上采样融合策略以减少噪声信息的引入. 其次, 为了减轻低质量深度图带来的负面影响, 设计一个跨模态交互融合模块以实现跨模态特征融合. 最后, 5个基准数据集上的实验结果表明, 所提算法与其他最新的算法相比具有显著优势.

    Abstract:

    In recent years, RGB-D salient detection method has achieved better performance than RGB salient detection model by virtue of its rich geometric structure and spatial position information in depth maps and thus has been highly concerned by the academic community. However, the existing RGB-D detection model still faces the challenge of improving performance continuously. The emerging Transformer is good at modeling global information, while the convolutional neural network (CNN) is good at extracting local details. Therefore, effectively combining the advantages of CNN and Transformer to mine global and local information will help to improve the accuracy of salient object detection. For this purpose, an RGB-D salient object detection method based on cross-modal interactive fusion and global awareness is proposed in this study. The transformer network is embedded into U-Net to better extract features by combining the global attention mechanism with local convolution. First, with the help of the U-Net encoder-decoder structure, this study efficiently extracts multi-level complementary features and decodes them step by step to generate a salient feature map. Then, the Transformer module is used to learn the global dependency between high-level features to enhance the feature representation, and the progressive upsampling fusion strategy is used to process the input and reduce the introduction of noise information. Moreover, to reduce the negative impact of low-quality depth maps, the study also designs a cross-modal interactive fusion module to realize cross-modal feature fusion. Finally, experimental results on five benchmark datasets show that the proposed algorithm has an excellent performance than other latest algorithms.

    参考文献
    相似文献
    引证文献
引用本文

孙福明,胡锡航,武景宇,孙静,王法胜.跨模态交互融合与全局感知的RGB-D显著性目标检测.软件学报,2024,35(4):1899-1913

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-06-29
  • 最后修改日期:2022-09-01
  • 录用日期:
  • 在线发布日期: 2023-06-14
  • 出版日期: 2024-04-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号