中文对抗攻击下的ChatGPT鲁棒性评估
作者:
中图分类号:

TP18

基金项目:

黑龙江省重点研发计划(2023ZX01A19)


Robustness Evaluation of ChatGPT Against Chinese Adversarial Attacks
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    以ChatGPT为代表的大语言模型(large language model, LLM)因其强大的自然语言理解和生成能力在各领域中得到广泛应用. 然而, 深度学习模型在受到对抗样本攻击时往往展现出脆弱性. 在自然语言处理领域中, 当前对抗样本生成方法的研究通常使用CNN类模型、RNN类模型和基于Transformer结构的预训练模型作为目标模型, 而很少有工作探究LLM受到对抗攻击时的鲁棒性并量化LLM鲁棒性的评估标准. 以中文对抗攻击下的ChatGPT为例, 引入了偏移平均差(offset average difference, OAD)这一新概念, 提出了一种基于OAD的可量化的LLM鲁棒性评价指标OAD-based robustness score (ORS). 在黑盒攻击场景下, 选取9种基于词语重要性的主流中文对抗攻击方法来生成对抗文本, 利用这些对抗文本攻击ChatGPT后可以得到每种方法的攻击成功率. 所提的ORS基于攻击成功率为LLM面向每种攻击方法的鲁棒性打分. 除了输出为硬标签的ChatGPT, 还基于攻击成功率和以高置信度误分类对抗文本占比, 设计了适用于输出为软标签的目标模型的ORS. 与此同时, 将这种打分公式推广到对抗文本的流畅性评估中, 提出了一种基于OAD的对抗文本流畅性打分方法OAD-based fluency score (OFS). 相比于需要人类参与的传统方法, 所提的OFS大大降低了评估成本. 分别在真实世界中的中文新闻分类和情感倾向分类数据集上开展实验. 实验结果在一定程度上初步表明, 面向文本分类任务, 对抗攻击下的ChatGPT鲁棒性分数比中文BERT高近20%. 然而, ChatGPT在受到对抗攻击时仍会产生错误预测, 攻击成功率最高可超过40%.

    Abstract:

    Large language model (LLM) like ChatGPT has found widespread applications across various fields due to their strong natural language understanding and generation capabilities. However, deep learning models exhibit vulnerability when subjected to adversarial example attacks. In natural language processing, current research on adversarial example generation methods typically employs CNN-based models, RNN-based models, and Transformer-based pre-trained models as target models, with few studies exploring the robustness of LLMs under adversarial attacks and quantifying the evaluation criteria of LLM robustness. Taking ChatGPT against Chinese adversarial attacks as an example, this study introduces a novel concept termed offset average difference (OAD) and proposes a quantifiable LLM robustness evaluation metric based on OAD, named OAD-based robustness score (ORS). In a black-box attack scenario, this study selects nine mainstream Chinese adversarial attack methods based on word importance to generate adversarial texts, which are then employed to attack ChatGPT and yield the attack success rate of each method. The proposed ORS assigns a robustness score to LLMs for each attack method based on the attack success rate. In addition to the ChatGPT that outputs hard labels, this study designs ORS for target models with soft-labeled outputs based on the attack success rate and the proportion of misclassified adversarial texts with high confidence. Meanwhile, this study extends the scoring formula to the fluency assessment of adversarial texts, proposing an OAD-based adversarial text fluency scoring method, named OAD-based fluency score (OFS). Compared to traditional methods requiring human involvement, the proposed OFS greatly reduces evaluation costs. Experiments conducted on real-world Chinese news and sentiment classification datasets to some extent initially demonstrate that, for text classification tasks, the robustness score of ChatGPT against adversarial attacks is nearly 20% higher than that of Chinese BERT. However, the powerful ChatGPT still produces erroneous predictions under adversarial attacks, with the highest attack success rate exceeding 40%.

    参考文献
    相似文献
    引证文献
引用本文

张云婷,叶麟,李柏松,张宏莉.中文对抗攻击下的ChatGPT鲁棒性评估.软件学报,,():1-25

复制
分享
文章指标
  • 点击次数:30
  • 下载次数: 44
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-03-29
  • 最后修改日期:2024-06-18
  • 在线发布日期: 2025-02-26
文章二维码
您是第19727615位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号