Empirical Study on Data Leakage Problem in Neural Program Repair

doi:10.13328/j.cnki.jos.007110

微信服务号

微信订阅号

2025-4-16- 5

Home > Archive>Volume 35, Issue 7, 2024 >3071-3092. DOI:10.13328/j.cnki.jos.007110

PDF HTML XML Export Cite reminder

Empirical Study on Data Leakage Problem in Neural Program Repair
DOI:
                        10.13328/j.cnki.jos.007110
                    
Author:
                        LI Qing-YuanLI Qing-Yuan
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHONG Wen-KangZHONG Wen-Kang
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LI Chuan-YiLI Chuan-Yi
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GE Ji-DongGE Ji-Dong
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LUO BinLUO Bin
National Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Repairing software defects is an inevitable and significant problem in the field of software engineering, while automated program repair (APR) techniques aim to alleviate software defect problem by repairing the defective programs automatically, accurately, and efficiently. In recent years, with the rapid development of deep learning, the field of automated program repair has emerged a method that utilizes deep neural networks to automatically capture the relationship between defective programs and their patches, called neural program repair (NPR). In terms of the number of defects that can be correctly repaired on the benchmark, NPR tools have significantly outperformed non-deep learning APR tools. However, a recent study found that the performance improvement of NPR systems may be due to the presence of test data in the training data, i.e., the data leakage. Inspired by this, to further investigate the causes and effects of data leakage in NPR systems and to evaluate existing systems more fairly, this study: (1) systematically categorizes and summarizes the existing NPR systems, defines the data leakage of NPR systems based on this classification, and designs the data leakage detection method for each category of system; (2) conducts a large-scale testing of existing models according to the data leakage detection method in the previous step and investigates the effect of data leakage on model realism and evaluation performance and the impact on the model itself; (3) analyzes the collection and filtering strategies of existing NPR system datasets, improves and supplements them, then constructs a pure large-scale NPR training dataset based on the improved strategy with the existing popular dataset, and verifies the effectiveness of this dataset in preventing data leakage. From the experimental results, it is found that the ten NPR systems studied in this investigation all had data leakage on the evaluation dataset, among which the NPR system RewardRepair had the more serious data leakage problem, with 24 data leaks on the Defects4J (v1.2.0) benchmark, and the leakage ratio was as high as 53.33%. In addition, data leakage has an impact on the robustness of the NPR system, and all five NPR systems investigated had reduced robustness due to data leakage. As a result, data leakage is a very common problem and can lead to unfair performance evaluation results of NPR systems and affect therobustness of the NPR system on the benchmark. When training NPR models, researchers should avoid data leakage as much as possible and consider the impact of data leakage on the evaluation of the performance of NPR systems to evaluate the NPR systems as fairly as possible.

Key words:automated program repair (APR);neural program repair;deep learning;data leakage;program repair dataset

Get Citation

李卿源,钟文康,李传艺,葛季栋,骆斌.神经程序修复领域数据泄露问题的实证研究.软件学报,2024,35(7):3071-3092

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:September 11,2023
Revised:October 30,2023
Adopted:
Online: January 05,2024
Published: July 06,2024

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History