预训练驱动的多模态边界感知视觉Transformer
作者:
作者单位:

作者简介:

石泽男(1993-),女,博士,CCF学生会员,主要研究领域为计算机视觉,医学图像分割,多媒体取证;陈海鹏(1978-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为机器学习与视觉推理;张冬(1989-),男,博士,主要研究领域为目标检测,语义分割,视频对象分割,跨场景分割;申铉京(1958-),男,博士,教授,博士生导师,主要研究领域为医学图像分割,多媒体取证,光电及混合系统,智能测量系统,视频理解技术

通讯作者:

陈海鹏,chenhp@jlu.edu.cn

中图分类号:

基金项目:

国家重点研发计划(2018YFB0804202,2018YFB0804203);国家自然科学基金(U19A2057,61876070);吉林大学2021年度“学科交叉融合创新”青年学者自由探索类项目(JLUXKJC2021QZ01)


Pre-training-driven Multimodal Boundary-aware Vision Transformer
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    卷积神经网络(convolutional neural network,CNN)在图像篡改检测任务中不断取得性能突破,但在面向真实场景下篡改手段未知的情况时,现有方法仍然无法有效地捕获输入图像的长远依赖关系以缓解识别偏差问题,从而影响检测精度.此外,由于标注困难,图像篡改检测任务通常缺乏精准的像素级图像标注信息.针对以上问题,提出一种预训练驱动的多模态边界感知视觉Transformer.首先,为捕获在RGB域中不可见的细微伪造痕迹,引入图像的频域模态并将其与RGB空间域结合作为多模态嵌入形式.其次利用ImageNet对主干网络的编码器进行训练以缓解当前训练样本不足的问题.然后,Transformer模块被整合到该编码器的尾部,以达到同时捕获低级空间细节信息和全局上下文的目的,从而提升模型的整体表征能力.最后,为有效地缓解因伪造区域边界模糊导致的定位难问题,构建边界感知模块,其可以通过Scharr卷积层获得的噪声分布以更多地关注噪声信息而不是语义内容,并利用边界残差块锐化边界信息,从而提升模型的边界分割性能.大量实验结果表明,所提方法在识别精度上优于现有的图像篡改检测方法,并对不同的篡改手段具有较好的泛化性和鲁棒性.

    Abstract:

    Convolutional neural networks (CNN) have continuously achieved performance breakthroughs in image forgery detection, but when faced with realistic scenarios where the means of tampering is unknown, the existing methods are still unable to effectively capture the long-term dependencies of the input image to alleviate the recognition bias problem, which affects the detection accuracy. In addition, due to the difficulty in labeling, image forgery detection usually lacks accurate pixel-level image labeling information. Considering the above problems, this study proposes a pre-training-driven multimodal boundary-aware vision transformer. To capture the subtle forgery traces invisible in the RGB domain, the method first introduces the frequency-domain modality of the image and combines it with the RGB spatial domain as a form of multimodal embedding. Secondly, the encoder of the backbone network is trained with ImageNet to alleviate the current problem of insufficient training samples. Then, the transformer module is integrated into the tail of this encoder to capture both low-level spatial details and global contexts, which improves the overall representation ability of the model. Finally, to effectively alleviate the problem of difficult localization caused by the blurred boundary of the forged regions, this study establishes a boundary-aware module, which can use the noise distribution obtained by the Scharr convolutional layer to pay more attention to the noise information rather than the semantic content and utilize the boundary residual block to sharpen the boundary information. In this way, the boundary segmentation performance of the model can be enhanced. The results of extensive experiments show that the proposed method outperforms existing image forgery detection methods in terms of recognition accuracy and has better generalization and robustness to different forgery methods.

    参考文献
    相似文献
    引证文献
引用本文

石泽男,陈海鹏,张冬,申铉京.预训练驱动的多模态边界感知视觉Transformer.软件学报,2023,34(5):2051-2067

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-04-15
  • 最后修改日期:2022-05-29
  • 录用日期:
  • 在线发布日期: 2022-09-20
  • 出版日期: 2023-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号