国家自然科学基金(61976042, 61972068); 兴辽英才计划(XLYC2007023); 辽宁省高等学校创新人才支持计划(LR2019020)
近年来, RGB-D显著性检测方法凭借深度图中丰富的几何结构和空间位置信息, 取得了比RGB显著性检测模型更好的性能, 受到学术界高度关注. 然而, 现有的RGB-D检测模型仍面临着持续提升检测性能的需求. 最近兴起的Transformer擅长建模全局信息, 而卷积神经网络(CNN)擅长于提取局部细节. 因此, 如何有效结合CNN和Transformer两者的优势, 挖掘全局和局部信息, 将有助于提升显著性目标检测的精度. 为此, 提出一种基于跨模态交互融合与全局感知的RGB-D显著性目标检测方法, 通过将Transformer网络嵌入U-Net中, 从而将全局注意力机制与局部卷积结合在一起, 能够更好地对特征进行提取. 首先借助U-Net编码-解码结构, 高效地提取多层次互补特征并逐级解码生成显著特征图. 然后, 使用Transformer模块学习高级特征间的全局依赖关系增强特征表示, 并针对输入采用渐进上采样融合策略以减少噪声信息的引入. 其次, 为了减轻低质量深度图带来的负面影响, 设计一个跨模态交互融合模块实现跨模态特征融合. 最后, 5个基准数据集上的实验结果表明, 所提算法与其他最新的算法相比具有显著优势.
In recent years, RGB-D salient detection method has achieved better performance than RGB salient detection model by virtue of its rich geometric structure and spatial position information in depth maps and thus has been highly concerned by the academic community. However, the existing RGB-D detection model still faces the challenge of improving performance continuously. The emerging Transformer is good at modeling global information, while the convolutional neural network (CNN) is good at extracting local details. Therefore, effectively combining the advantages of CNN and Transformer to mine global and local information will help to improve the accuracy of salient object detection. For this purpose, an RGB-D salient object detection method based on cross-modal interactive fusion and global awareness is proposed in this study. The transformer network is embedded into U-Net to better extract features by combining the global attention mechanism with local convolution. First, with the help of the U-Net encoder-decoder structure, this study efficiently extracts multi-level complementary features and decodes them step by step to generate a salient feature map. Then, the Transformer module is used to learn the global dependency between high-level features to enhance the feature representation, and the progressive upsampling fusion strategy is used to process the input and reduce the introduction of noise information. Moreover, to reduce the negative impact of low-quality depth maps, the study also designs a cross-modal interactive fusion module to realize cross-modal feature fusion. Finally, experimental results on five benchmark datasets show that the proposed algorithm has an excellent performance than other latest algorithms.