神经程序修复领域数据泄露问题的实证研究
CSTR:
作者:
作者单位:

作者简介:

李卿源(2000—),男,硕士生,CCF学生会员,主要研究领域为软件工程,自然语言处理;钟文康(1997—),男,博士生,主要研究领域为软件工程,自然语言处理,程序自动修复;李传艺(1991—),男,博士,准聘助理教授,博士生导师,CCF专业会员,主要研究领域为软件工程,业务过程管理,自然语言处理;葛季栋(1978—),男,博士,副教授,博士生导师,CCF高级会员,主要研究领域为自然语言处理,智能软件工程,分布式计算,边缘计算,服务计算,业务过程管理;骆斌(1967—),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为分布式计算,边缘计算,自然语言处理,智能软件工程.

通讯作者:

葛季栋,E-mail:gjd@nju.edu.cn

中图分类号:

基金项目:

国家重点研发计划(2022YFF0711404);江苏省第六期“333工程”领军型人才团队项目;江苏省自然科学基金(BK20201250)


Empirical Study on Data Leakage Problem in Neural Program Repair
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    修复软件缺陷是软件工程领域一个无法回避的重要问题, 而程序自动修复技术则旨在自动、准确且高效地修复存在缺陷的程序, 以缓解软件缺陷所带来的问题. 近年来, 随着深度学习的快速发展, 程序自动修复领域兴起了一种使用深度神经网络去自动捕捉缺陷程序及其补丁之间关系的方法, 被称为神经程序修复. 从在基准测试上被正确修复的缺陷的数量上看, 神经程序修复工具的修复性能已经显著超过了非学习的程序自动修复工具.然而, 近期有研究发现: 神经程序修复系统性能的提升可能得益于测试数据在训练数据中存在, 即数据泄露. 受此启发, 为了进一步探究神经程序修复系统数据泄露的原因及影响, 更公平地评估现有的系统: (1) 对现有神经程序修复系统进行了系统的分类和总结, 根据分类结果定义了神经程序修复系统的数据泄露, 并为每个类别的系统设计了数据泄露的检测方法; (2) 依照上一步骤中的数据泄露检测方法对现有模型展开了大规模检测, 并探究了数据泄露对模型真实性能与评估性能间差异的影响以及对模型本身的影响; (3) 分析现有神经程序修复系统数据集的收集和过滤策略, 加以改进和补充, 在现有流行的数据集上, 基于改进后的策略构建了一个纯净的大规模程序修复训练数据集, 并验证了该数据集避免数据泄露的有效性. 由实验结果发现: 调研的10个神经程序修复系统在基准测试集上均出现了数据泄露, 其中, 神经程序修复系统RewardRepair的数据泄露问题较为严重, 在基准测试集Defects4J (v1.2.0)上的数据泄露达24处, 泄露比例高达53.33%. 此外, 数据泄露对神经程序修复系统的鲁棒性也造成了影响, 调研的5个神经程序修复系统均因数据泄露产生了鲁棒性降低的问题. 由此可见, 数据泄露是一个十分常见的问题, 且会使神经程序修复系统得到不公平的性能评估结果以及影响系统在基准测试集上的鲁棒性. 研究人员在训练神经程序修复模型时, 应尽可能避免出现数据泄露, 且要考虑数据泄露问题对神经程序修复系统性能评估产生的影响, 尽可能更公平地评估系统.

    Abstract:

    Repairing software defects is an inevitable and significant problem in the field of software engineering, while automated program repair (APR) techniques aim to alleviate software defect problem by repairing the defective programs automatically, accurately, and efficiently. In recent years, with the rapid development of deep learning, the field of automated program repair has emerged a method that utilizes deep neural networks to automatically capture the relationship between defective programs and their patches, called neural program repair (NPR). In terms of the number of defects that can be correctly repaired on the benchmark, NPR tools have significantly outperformed non-deep learning APR tools. However, a recent study found that the performance improvement of NPR systems may be due to the presence of test data in the training data, i.e., the data leakage. Inspired by this, to further investigate the causes and effects of data leakage in NPR systems and to evaluate existing systems more fairly, this study: (1) systematically categorizes and summarizes the existing NPR systems, defines the data leakage of NPR systems based on this classification, and designs the data leakage detection method for each category of system; (2) conducts a large-scale testing of existing models according to the data leakage detection method in the previous step and investigates the effect of data leakage on model realism and evaluation performance and the impact on the model itself; (3) analyzes the collection and filtering strategies of existing NPR system datasets, improves and supplements them, then constructs a pure large-scale NPR training dataset based on the improved strategy with the existing popular dataset, and verifies the effectiveness of this dataset in preventing data leakage. From the experimental results, it is found that the ten NPR systems studied in this investigation all had data leakage on the evaluation dataset, among which the NPR system RewardRepair had the more serious data leakage problem, with 24 data leaks on the Defects4J (v1.2.0) benchmark, and the leakage ratio was as high as 53.33%. In addition, data leakage has an impact on the robustness of the NPR system, and all five NPR systems investigated had reduced robustness due to data leakage. As a result, data leakage is a very common problem and can lead to unfair performance evaluation results of NPR systems and affect therobustness of the NPR system on the benchmark. When training NPR models, researchers should avoid data leakage as much as possible and consider the impact of data leakage on the evaluation of the performance of NPR systems to evaluate the NPR systems as fairly as possible.

    参考文献
    相似文献
    引证文献
引用本文

李卿源,钟文康,李传艺,葛季栋,骆斌.神经程序修复领域数据泄露问题的实证研究.软件学报,2024,35(7):3071-3092

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-09-11
  • 最后修改日期:2023-10-30
  • 录用日期:
  • 在线发布日期: 2024-01-05
  • 出版日期: 2024-07-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号