基于特征迁移和实例迁移的跨项目缺陷预测方法
作者:
作者简介:

倪超(1990-),男,江苏南京人,博士生,主要研究领域为软件缺陷预测;顾庆(1972-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为软件质量保障,分布式计算;陈翔(1980-),男,博士,副教授,CCF高级会员,主要研究领域为软件缺陷预测,软件缺陷定位,回归测试和组合测试;黄启国(1979-),男,博士生,主要研究领域为软件缺陷预测;刘望舒(1987-),男,博士,讲师,CCF专业会员,主要研究领域为软件缺陷预测,软件缺陷定位;李娜(1991-),女,博士生,主要研究领域为机器学习.

通讯作者:

陈翔,E-mail:xchencs@ntu.edu.cn;顾庆,E-mail:guq@nju.edu.cn

基金项目:

国家自然科学基金(61373012,61202006,91218302,61321491);南京大学计算机软件新技术国家重点实验室开放课题(KFKT2016B18,KFKT2018B17);江苏省自然科学基金(BK20180695);国家建设高水平大学公派研究生项目(201806190172)


Cross-project Defect Prediction Method Based on Feature Transfer and Instance Transfer
Author:
Fund Project:

National Natural Science Foundation of China (61373012, 61202006, 91218302, 61321491); Open Fund of State Key Laboratory for Novel Software Technology (Nanjing University) (KFKT2016B18, KFKT2018B17); Natural Science Foundation of Jiangsu Province (BK20180695); State Scholarship Fund of China Scholarship Council (201806190172)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [79]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    在实际软件开发中,需要进行缺陷预测的项目可能是一个新启动项目,或者这个项目的历史训练数据较为稀缺.一种解决方案是利用其他项目(即源项目)已搜集的训练数据来构建模型,并完成对当前项目(即目标项目)的预测.但不同项目的数据集间会存在较大的分布差异性.针对该问题,从特征迁移和实例迁移角度出发,提出了一种两阶段跨项目缺陷预测方法FeCTrA.具体来说,在特征迁移阶段,该方法借助聚类分析选出源项目与目标项目之间具有高分布相似度的特征;在实例迁移阶段,该方法基于TrAdaBoost方法,借助目标项目中的少量已标注实例,从源项目中选出与这些已标注实例分布相近的实例.为了验证FeCTrA方法的有效性,选择Relink数据集和AEEEM数据集作为评测对象,以F1作为评测指标.首先,FeCTrA方法的预测性能要优于仅考虑特征迁移阶段或实例迁移阶段的单阶段方法;其次,与经典的跨项目缺陷预测方法TCA+、Peters过滤法、Burak过滤法以及DCPDP法相比,FeCTrA方法的预测性能在Relink数据集上可以分别提升23%、7.2%、9.8%和38.2%,在AEEEM数据集上可以分别提升96.5%、108.5%、103.6%和107.9%;最后,分析了FeCTrA方法内的影响因素对预测性能的影响,从而为有效使用FeCTrA方法提供了指南.

    Abstract:

    In real software development, a project, which needs defect prediction, may be a new project or maybe has less training data. A simple solution is to use training data from other projects (i.e., source projects) to construct the model, and use the trained model to perform prediction on the current project (i.e., target project). However, datasets among different projects may have large distribution difference. To solve this problem, a novel two phase cross-project defect prediction method FeCTrA is proposed, which considers both feature transfer and instance transfer. In the feature transfer phase, FeCTrA uses cluster analysis to select features, which have high distribution similarity between the source project and the target project. In the instance transfer phase, FeCTrA utilizes TrAdaBoost, which selects relevant instances from the source project when give some labeled instances in the target project. To verify the effectiveness of FeCTrA, Relink and AEEEM datasets are choosen as the experimental subjects and F1 as the performance measure. Firstly, it is found that FeCTrA outperforms single phase methods, which only consider feature transfer or instance transfer. Then after comparing with state-of-the-art baseline methods (i.e., TCA+, Peters filter, Burak filter, and DCPDP), the performance of FeCTrA improves 23%, 7.2%, 9.8%, and 38.2% on Relink dataset and the performance of FeCTrA improves 96.5%, 108.5%, 103.6%, and 107.9% on AEEEM dataset. Finally, the influence of factors in FeCTrA is analyzed and a guideline to effectively use this method is provided.

    参考文献
    [1] Chen X, Gu Q, Liu WS, Liu SL, Ni C. Survey of static software defect prediction. Ruan Jian Xue Bao/Journal of Software, 2016, 27(1):1-25(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4923.htm[doi:10.13328/j.cnki.jos.004923]
    [2] Wang Q, Wu SJ, Li MS. Software defect prediction. Ruan Jian Xue Bao/Journal of Software, 2008,19(7):1565-1580(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/19/1565.htm[doi:10.3724/SP.J.1001.2008.01565]
    [3] Hall T, Beecham S, Bowes D, et al. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. on Software Engineering, 2012,38(6):1276-1304.
    [4] Hosseini S, Turhan B, Gunarathna D. A systematic literature review and metaanalysis on cross project defect prediction. IEEE Trans. on Software Engineering, 2019,45(2):111-147.
    [5] Chen X, Wang LP, Gu Q, Wang Z, Ni C, Liu WS, Wang OP. A survey on cross-project software defect prediction methods. Chinese Journal of Computers, 2018,41(1):254-274(in Chinese with English abstract).
    [6] Xia X, Lo D, Pan SJ, Nagappan N, Wang XY. Hydra:Massively compositional model for cross-project defect prediction. IEEE Trans. on Software Engineering, 2016,42(10):977-998.
    [7] Ni C, Liu WS, Chen X, Gu Q, Chen DX, Huang QG. A cluster based feature selection method for cross-project software defect prediction. Journal of Computer Science and Technology, 2017,32(6):1090-1107.
    [8] Ni C, Liu WS, Gu Q, Chen X, Chen DX. Fesch:A feature selection method using clusters of hybrid-data for cross-project defect prediction. In:Proc. of the Computer Software and Applications Conf. 2017. 51-56.
    [9] Hosseini S, Turhan B, Mäntylä M. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Information & Software Technology, 2018,95:296-312.
    [10] Krishna R, Menzies T, Fu W. Too much automation? The bellwether effect and its implications for transfer learning. In:Proc. of the IEEE/ACM Int'l Conf. on Automated Software Engineering. 2016. 122-131.
    [11] Li ZQ, Jing XY, Zhu XK, Zhang HY. Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In:Proc. of the IEEE Int'l Conf. on Software Maintenance and Evolution. 2017. 91-102.
    [12] Nam J, Pan SJ, Kim S. Transfer defect learning. In:Proc. of the 35th Int'l Conf. on Software Engineering. 2013. 382-391.
    [13] Peters F, Menzies T, Marcus A. Better cross company defect prediction. In:Proc. of the IEEE Working Conf. on Mining Software Repositories. 2013. 409-418.
    [14] Turhan B, Menzies T, Bener AB, et al. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 2009,14(5):540-578.
    [15] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction:A large scale experiment on data vs. domain vs. process. In:Proc. of the Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Engineering. 2009. 91-100.
    [16] Liu WS, Chen X, Gu Q, Liu SL, Chen DX. A noise tolerable feature selection framework for software defect prediction. Chinese Journal of Computers, 2018,41(3):506-520(in Chinese with English abstract).
    [17] Liu WS, Liu SL, Gu Q, Chen JQ, Chen X, Chen DX. Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans. on Reliability, 2016,65(1):38-53.
    [18] Chen X, Zhao YQ, Wang QP, Yuan ZD. Multi:Multi-objective effort-aware just-in-time software defect prediction. Information and Software Technology, 2018,93:1-13.
    [19] Chen X, Zhang D, Zhao YQ, Cui ZQ, Ni C. Software defect number prediction:Unsupervised vs supervised methods. Information and Software Technology, 2019,106:161-181.
    [20] Liu WS, Chen X, Gu Q, Liu SL,Chen DX. A cluster analysis based feature selection method for software defect prediction. Scientia Sinica Informationis, 2016,46(9):1298-1320(in Chinese with English abstract).
    [21] He JY. Search based semi-supervised ensemble learning research for cross-project defect prediction[MS. Thesis]. Tianjin:Tianjin University, 2017.
    [22] Ghotra B, Mcintosh S, Hassan AE. Revisiting the impact of classification techniques on the performance of defect prediction models. In:Proc. of the Int'l Conf. on Software Engineering. 2015. 789-800.
    [23] Peters F, Menzies T, Layman L. Lace2:Better privacy-preserving data sharing for cross project defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2015. 801-811.
    [24] Tantithamthavorn C, Mcintosh S, Hassan AE, Ihara A, Matsumoto K. The impact of mislabelling on the performance and interpretation of defect prediction models. In:Proc. of the Int'l Conf. on Software Engineering. 2015. 812-823.
    [25] Jing XY, Wu F, Dong XW, Qi FM, Xu BW. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In:Proc. of the Joint Meeting on Foundations of Software Engineering. 2015. 496-507.
    [26] Kim MJ, Nam JC, Yeon JY, Choi SW, Kim SH. Remi:Defect prediction for efficient API testing. In:Proc. of the Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). 2015. 990-993.
    [27] Nam JC Kim SH. Clami:Defect prediction on unlabeled datasets (t). In:Proc. of the Int'l Conf. on Automated Software Engineering. 2015. 452-463.
    [28] Radjenović D, Heričko M, Torkar R, et al. Software fault prediction metrics:A systematic literature review. Information & Software Technology, 2013,55(8):1397-1418.
    [29] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans. on Software Engineering, 2007,33(1):2-13.
    [30] Song QB, Jia ZH, Shepperd M, Ying S. A general software defect-proneness prediction framework. IEEE Trans. on Software Engineering, 2011,37(3):356-370.
    [31] Agrawal A, Menzies T. Is "better data" better than "better data miners"? On the benefits of tuning smote for defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2018. 1050-1061.
    [32] Yu X, Liu J, Yang ZJ, Jia XY, Ling Q, Ye SZ. Learning from imbalanced data for predicting the number of software defects. In:Proc. of the Int'l Symp. on Software Reliability Engineering. 2017. 78-89.
    [33] Xu Z, Liu J, Yang ZJ, An GG, Jia XY. The impact of feature selection on defect prediction performance:An empirical comparison. In:Proc. of the Int'l Symp. on Software Reliability Engineering. 2016. 309-320.
    [34] Fukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N. An empirical study of just-in-time defect prediction using cross-project models. In:Proc. of the 11th Working Conf. on Mining Software Repositories. 2014. 172-181.
    [35] He JY, Meng ZP, Chen X, Wang Z, Fan XY. Semi-supervised ensemble learning approach for cross-project defect prediction. Ruan Jian Xue Bao/Journal of Software, 2017,28(6):1455-1473(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5228.htm[doi:10.13328/j.cnki.jos.005228]
    [36] Ma Y, Luo GC, Zeng X, Chen AG. Transfer learning for cross-company software defect prediction. Information and Software Technology, 2012,54(3):248-256.
    [37] Wang S, Liu TY, Tan L. Automatically learning semantic features for defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2016. 297-308.
    [38] Chen L, Fang B, Shang ZW, Tang YY. Negative samples reduction in cross-company software defects prediction. Information and Software Technology, 2015,62:67-77.
    [39] He P, Li B, Ma YT. Towards cross-project defect prediction with imbalanced feature sets. arXiv preprint arXiv:1411.4228, 2014.
    [40] Nam JC, Kim SH. Heterogeneous defect prediction. In:Proc. of the Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Engineering. 2015. 508-519.
    [41] Zhong S, Khoshgoftaar TM, Seliya N. Unsupervised learning for expert-based software quality estimation. In:Proc. of the 20048th IEEE Int'l Symp. on High Assurance Systems Engineering. 2004. 149-155.
    [42] Zhang F, Zheng Q, Zou Y, Hassan AE. Cross-project defect prediction using a connectivity-based unsupervised classifier. In:Proc. of the Int'l Conf. on Software Engineering. 2016. 309-320.
    [43] Yang YB, Zhou YM, Liu JP, Zhao YY, Lu HM, Xu L, Xu BW, Leung H. Effort-aware just-in-time defect prediction:Simple unsupervised models could be better than supervised models. In:Proc. of the 24th ACM SIGSOFT Int'l Symp. on Foundations of Software Engineering. 2016. 157-168.
    [44] Zhou YM, Yang YB, Lu HM, Chen L, Li YH, Zhao YY, Qian JY, Xu BW. How far we have progressed in the journey? An examination of cross-project defect prediction. ACM Trans. on Software Engineering and Methodology, 2018,27(1):Article No.1.
    [45] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans. on Knowledge & Data Engineering, 2010,22(10):1345-1359.
    [46] Zhuang FZ, Luo P, Xiong H, Xiong YH, He Q, Shi ZZ. Cross-domain learning from multiple sources:A consensus regularization perspective. IEEE Trans. on Knowledge & Data Engineering, 2010,22(12):1664-1678.
    [47] Dai WY, Yang Q, Xue GR, Yu Y. Boosting for transfer learning. In:Proc. of the 24th Int'l Conf. on Machine Learning. 2007. 193-200.
    [48] Dai WY, Xue GR, Yang Q, Yu Y. Transferring naive Bayes classifiers for text classification. In:Proc. of the National Conf. on Artificial Intelligence. 2007. 540-545.
    [49] Swarup S, Ray SR. Cross-domain knowledge transfer using structured representations. In:Proc. of the National Conf. on Artificial Intelligence. 2006. 506-511.
    [50] Ni C. Research on software defect prediction based on transfer learning[MS. Thesis]. Nanjing:Nanjing University, 2017.
    [51] Wu Q. Cross-project defect prediction based on transfer learning[MS. Thesis]. Changchun:Jilin University, 2018.
    [52] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 5(12):1205-1224.
    [53] Kira K, Rendell LA. The feature selection problem:Traditional methods and a new algorithm. In:Proc. of the 10th National Conf. on Artificial Intelligence. 1992. 129-134.
    [54] D'Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches:A benchmark and an extensive comparison. Empirical Software Engineering, 2012,17(4):531-577.
    [55] Peters F, Menzies T. Privacy and utility for defect prediction:Experiments with MORPH. In:Proc. of the Int'l Conf. on Software Engineering. 2012. 189-199.
    [56] Wu RX, Zhang HY, Kim SH, Cheung SC. Relink:Recovering links between bugs and changes. In:Proc. of the ACM Sigsoft Symp. and the European Conf. on Foundations of Software Engineering. 2011. 15-25.
    [57] D'Ambros M, Lanza M, Robbes R. An extensive comparison of bug prediction approaches. In:Proc. of the Mining Software Repositories. 2010. 31-41.
    [58] Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin, 1945,1(6):80-83.
    [59] Janez Ar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006,7(1):1-30.
    [60] Liu SL, Chen X, Liu WS, Chen JQ, Gu Q, Chen DX. Fecar:A feature selection framework for software defect prediction. In:Proc. of the Computer Software and Applications Conf. 2014. 426-435.
    [61] Gao KH, Khoshgoftaar TM, Wang HJ, Seliya N. Choosing software metrics for defect prediction:An investigation on feature selection techniques. Software Practice & Experience, 2011,41(5):579-606.
    [62] Kim SH, Zhang HY, Wu RX, Gong L. Dealing with noise in defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2011. 481-490.
    [63] Herbold S. CrossPare:A tool for benchmarking cross-project defect predictions. In:Proc. of the Int'l Conf. on Automated Software Engineering Workshop. 2015. 90-96.
    [64] He ZM, Shu FD, Yang Y, Li MS, Wang Q. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 2012,19(2):167-199.
    [65] Rahman F, Posnett D, Devanbu P. Recalling the "imprecision" of cross-project defect prediction. In:Proc. of the ACM SIGSOFT Symp. on the Foundations of Software Engineering. 2012. 1-11.
    [66] Fan LL, Su T, Chen S, Meng GZ, Liu Y, Xu LH, Pu GG. Efficiently manifesting asynchronous programming errors in android apps. In:Proc. of the 33rd ACM/IEEE Int'l Conf. on Automated Software Engineering. 2018. 486-497.
    [67] Fan LL, Su T, Chen S, Meng GZ, Liu Y, Xu LH, Pu GG, Su ZD. Large-scale analysis of framework-specific exceptions in android apps. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018. 408-419.
    [68] Su T, Meng GZ, Chen YT, Wu K, Yang WM, Yao Y, Pu GG, Liu Y, Su ZD. Guided, stochastic model-based GUI testing of android apps. In:Proc. of the 201711th Joint Meeting on Foundations of Software Engineering. 2017. 245-256.
    [69] Lewis C, Lin ZP, Sadowski C, Zhu XY, Ou R, Whitehead EJ. Does bug prediction support human developers? Findings from a google case study. In:Proc. of the 2013 Int'l Conf. on Software Engineering. 2013. 372-381.
    附中文参考文献:
    [1] 陈翔,顾庆,刘望舒,刘树龙,倪超.静态软件缺陷预测方法研究.软件学报,2016,27(1):1-25. http://www.jos.org.cn/1000-9825/4923.htm[doi:10.13328/j.cnki.jos.004923]
    [2] 王青,伍书剑,李明树.软件缺陷预测技术.软件学报,2008,19(7):1565-1580. http://www.jos.org.cn/1000-9825/19/1565.htm[doi:10.3724/SP.J.1001.2008.01565]
    [5] 陈翔,王莉萍,顾庆,王赞,倪超,刘望舒,王秋萍.跨项目软件缺陷预测方法研究综述.计算机学报,2018,41(1):254-274.
    [16] 刘望舒,陈翔,顾庆,刘树龙,陈道蓄.一种面向软件缺陷预测的可容忍噪声的特征选择框架.计算机学报,2018,41(3):506-520.
    [20] 刘望舒,陈翔,顾庆,刘树龙,陈道蓄.软件缺陷预测中基于聚类分析的特征选择方法.中国科学:信息科学,2016,46(9):1298-1320.
    [21] 何吉元.基于搜索的半监督集成跨项目软件缺陷预测方法研究[硕士学位论文].天津:天津大学,2017.
    [35] 何吉元,孟昭鹏,陈翔,王赞,樊向宇.一种半监督集成跨项目软件缺陷预测方法.软件学报,2017,28(6):1455-1473. http://www.jos.org.cn/1000-9825/5228.htm[doi:10.13328/j.cnki.jos.005228]
    [50] 倪超.基于迁移学习的软件缺陷预测方法研究[硕士学位论文].南京:南京大学,2017.
    [51] 吴琦.基于迁移学习的跨项目软件缺陷预测[硕士学位论文].长春:吉林大学,2018.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

倪超,陈翔,刘望舒,顾庆,黄启国,李娜.基于特征迁移和实例迁移的跨项目缺陷预测方法.软件学报,2019,30(5):1308-1329

复制
分享
文章指标
  • 点击次数:3351
  • 下载次数: 6567
  • HTML阅读次数: 3131
  • 引用次数: 0
历史
  • 收稿日期:2018-08-28
  • 最后修改日期:2018-10-31
  • 在线发布日期: 2019-05-08
文章二维码
您是第19920779位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号