Robustness Evaluation of ChatGPT Against Chinese Adversarial Attacks
Author:
Affiliation:

Clc Number:

TP18

  • Article
  • | |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    Large language model (LLM) like ChatGPT has found widespread applications across various fields due to their strong natural language understanding and generation capabilities. However, deep learning models exhibit vulnerability when subjected to adversarial example attacks. In natural language processing, current research on adversarial example generation methods typically employs CNN-based models, RNN-based models, and Transformer-based pre-trained models as target models, with few studies exploring the robustness of LLMs under adversarial attacks and quantifying the evaluation criteria of LLM robustness. Taking ChatGPT against Chinese adversarial attacks as an example, this study introduces a novel concept termed offset average difference (OAD) and proposes a quantifiable LLM robustness evaluation metric based on OAD, named OAD-based robustness score (ORS). In a black-box attack scenario, this study selects nine mainstream Chinese adversarial attack methods based on word importance to generate adversarial texts, which are then employed to attack ChatGPT and yield the attack success rate of each method. The proposed ORS assigns a robustness score to LLMs for each attack method based on the attack success rate. In addition to the ChatGPT that outputs hard labels, this study designs ORS for target models with soft-labeled outputs based on the attack success rate and the proportion of misclassified adversarial texts with high confidence. Meanwhile, this study extends the scoring formula to the fluency assessment of adversarial texts, proposing an OAD-based adversarial text fluency scoring method, named OAD-based fluency score (OFS). Compared to traditional methods requiring human involvement, the proposed OFS greatly reduces evaluation costs. Experiments conducted on real-world Chinese news and sentiment classification datasets to some extent initially demonstrate that, for text classification tasks, the robustness score of ChatGPT against adversarial attacks is nearly 20% higher than that of Chinese BERT. However, the powerful ChatGPT still produces erroneous predictions under adversarial attacks, with the highest attack success rate exceeding 40%.

    Reference
    Related
    Cited by
Get Citation

张云婷,叶麟,李柏松,张宏莉.中文对抗攻击下的ChatGPT鲁棒性评估.软件学报,,():1-25

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:March 29,2024
  • Revised:June 18,2024
  • Online: February 26,2025
You are the first2036689Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063