神经程序修复领域数据泄露问题的实证研究

doi:10.13328/j.cnki.jos.007110

微信服务号

微信订阅号

2025年8月4日 23:24 星期一

首页 > 过刊浏览>2024年第35卷第7期 >3071-3092. DOI:10.13328/j.cnki.jos.007110

PDF HTML阅读 XML下载导出引用引用提醒

神经程序修复领域数据泄露问题的实证研究
DOI:
                        10.13328/j.cnki.jos.007110
                    
CSTR:
                        
                    
作者:
                        李卿源李卿源
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
钟文康钟文康
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
李传艺李传艺
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
葛季栋葛季栋
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
骆斌骆斌
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:李卿源(2000—)，男，硕士生，CCF学生会员，主要研究领域为软件工程，自然语言处理;钟文康(1997—)，男，博士生，主要研究领域为软件工程，自然语言处理，程序自动修复;李传艺(1991—)，男，博士，准聘助理教授，博士生导师，CCF专业会员，主要研究领域为软件工程，业务过程管理，自然语言处理;葛季栋(1978—)，男，博士，副教授，博士生导师，CCF高级会员，主要研究领域为自然语言处理，智能软件工程，分布式计算，边缘计算，服务计算，业务过程管理;骆斌(1967—)，男，博士，教授，博士生导师，CCF杰出会员，主要研究领域为分布式计算，边缘计算，自然语言处理，智能软件工程.
通讯作者:葛季栋，E-mail:gjd@nju.edu.cn
中图分类号:
基金项目:国家重点研发计划(2022YFF0711404);江苏省第六期“333工程”领军型人才团队项目;江苏省自然科学基金(BK20201250)

Empirical Study on Data Leakage Problem in Neural Program Repair

Author:

LI Qing-Yuan
LI Qing-Yuan
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
ZHONG Wen-Kang
ZHONG Wen-Kang
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
LI Chuan-Yi
LI Chuan-Yi
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
GE Ji-Dong
GE Ji-Dong
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
LUO Bin
LUO Bin
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

修复软件缺陷是软件工程领域一个无法回避的重要问题, 而程序自动修复技术则旨在自动、准确且高效地修复存在缺陷的程序, 以缓解软件缺陷所带来的问题. 近年来, 随着深度学习的快速发展, 程序自动修复领域兴起了一种使用深度神经网络去自动捕捉缺陷程序及其补丁之间关系的方法, 被称为神经程序修复. 从在基准测试上被正确修复的缺陷的数量上看, 神经程序修复工具的修复性能已经显著超过了非学习的程序自动修复工具.然而, 近期有研究发现: 神经程序修复系统性能的提升可能得益于测试数据在训练数据中存在, 即数据泄露. 受此启发, 为了进一步探究神经程序修复系统数据泄露的原因及影响, 更公平地评估现有的系统: (1) 对现有神经程序修复系统进行了系统的分类和总结, 根据分类结果定义了神经程序修复系统的数据泄露, 并为每个类别的系统设计了数据泄露的检测方法; (2) 依照上一步骤中的数据泄露检测方法对现有模型展开了大规模检测, 并探究了数据泄露对模型真实性能与评估性能间差异的影响以及对模型本身的影响; (3) 分析现有神经程序修复系统数据集的收集和过滤策略, 加以改进和补充, 在现有流行的数据集上, 基于改进后的策略构建了一个纯净的大规模程序修复训练数据集, 并验证了该数据集避免数据泄露的有效性. 由实验结果发现: 调研的10个神经程序修复系统在基准测试集上均出现了数据泄露, 其中，神经程序修复系统RewardRepair的数据泄露问题较为严重, 在基准测试集Defects4J (v1.2.0)上的数据泄露达24处, 泄露比例高达53.33%. 此外, 数据泄露对神经程序修复系统的鲁棒性也造成了影响, 调研的5个神经程序修复系统均因数据泄露产生了鲁棒性降低的问题. 由此可见, 数据泄露是一个十分常见的问题, 且会使神经程序修复系统得到不公平的性能评估结果以及影响系统在基准测试集上的鲁棒性. 研究人员在训练神经程序修复模型时, 应尽可能避免出现数据泄露, 且要考虑数据泄露问题对神经程序修复系统性能评估产生的影响, 尽可能更公平地评估系统.

关键词:程序自动修复;神经程序修复;深度学习;数据泄露;程序修复数据集

Abstract:

Repairing software defects is an inevitable and significant problem in the field of software engineering, while automated program repair (APR) techniques aim to alleviate software defect problem by repairing the defective programs automatically, accurately, and efficiently. In recent years, with the rapid development of deep learning, the field of automated program repair has emerged a method that utilizes deep neural networks to automatically capture the relationship between defective programs and their patches, called neural program repair (NPR). In terms of the number of defects that can be correctly repaired on the benchmark, NPR tools have significantly outperformed non-deep learning APR tools. However, a recent study found that the performance improvement of NPR systems may be due to the presence of test data in the training data, i.e., the data leakage. Inspired by this, to further investigate the causes and effects of data leakage in NPR systems and to evaluate existing systems more fairly, this study: (1) systematically categorizes and summarizes the existing NPR systems, defines the data leakage of NPR systems based on this classification, and designs the data leakage detection method for each category of system; (2) conducts a large-scale testing of existing models according to the data leakage detection method in the previous step and investigates the effect of data leakage on model realism and evaluation performance and the impact on the model itself; (3) analyzes the collection and filtering strategies of existing NPR system datasets, improves and supplements them, then constructs a pure large-scale NPR training dataset based on the improved strategy with the existing popular dataset, and verifies the effectiveness of this dataset in preventing data leakage. From the experimental results, it is found that the ten NPR systems studied in this investigation all had data leakage on the evaluation dataset, among which the NPR system RewardRepair had the more serious data leakage problem, with 24 data leaks on the Defects4J (v1.2.0) benchmark, and the leakage ratio was as high as 53.33%. In addition, data leakage has an impact on the robustness of the NPR system, and all five NPR systems investigated had reduced robustness due to data leakage. As a result, data leakage is a very common problem and can lead to unfair performance evaluation results of NPR systems and affect therobustness of the NPR system on the benchmark. When training NPR models, researchers should avoid data leakage as much as possible and consider the impact of data leakage on the evaluation of the performance of NPR systems to evaluate the NPR systems as fairly as possible.

Key words:automated program repair (APR);neural program repair;deep learning;data leakage;program repair dataset

引用本文

李卿源,钟文康,李传艺,葛季栋,骆斌.神经程序修复领域数据泄露问题的实证研究.软件学报,2024,35(7):3071-3092

复制

文章指标

点击次数:736
下载次数: 2933
HTML阅读次数: 1298
引用次数: 0

历史

收稿日期:2023-09-11
最后修改日期:2023-10-30
录用日期:
在线发布日期: 2024-01-05
出版日期: 2024-07-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码