Semi-Supervised Ensemble Learning Approach for Cross-Project Defect Prediction
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61202030, 61373012, 61202006, 71502125)

  • Article
  • | |
  • Metrics
  • |
  • Reference [56]
  • |
  • Related [20]
  • |
  • Cited by [2]
  • | |
  • Comments
    Abstract:

    Software defect prediction can help developers to optimize the distribution of test resources by predicting whether or not a software module is defect-prone. Most defect prediction researches focus on within-project defect prediction which needs sufficient training data from the same project. However, in real software development, a project which needs defect prediction is always new or without any historical data. Therefore cross-project defect prediction becomes a hot topic which uses training data from several projects and performs prediction on another one. The main research challenges in cross-project defect prediction are the variety of distribution from source project to target project and class imbalance problem among datasets. Inspired by search based software engineering, this paper proposes a search based semi-supervised ensemble learning approach S3EL. By adjusting the ratio of distribution in training dataset,several Naïve Bayes classifiers are built as the base learners, then a small amount of labeled target instances and genetic algorithm are used to combine these base classifiers as a final prediction model. S3EL is compared with other up-to-date classical cross-project defect prediction approaches (such as Burak filter, Peters filter, TCA+, CODEP and HYDRA) on AEEEM and Promise dataset. Final results show that S3EL has the best prediction performance in most cases under the F1 measure.

    Reference
    [1] Kim S, Whitehead EJ, Zhang Y. Classifying software changes:Clean or buggy? IEEE Trans. on Software Engineering, 2008,34(2):181-196.[doi:10.1109/TSE.2007.70773]
    [2] Xia X, Lo D, Pan SJ, Nagappan N, Wang X. HYDRA:Massively compositional model for cross-project defect prediction. IEEE Trans. on Software Engineering, 2016,42(10):977-998.[doi:10.1109/TSE.2016.2543218]
    [3] Kim S, Zhang H, Wu R, Gong L. Dealing with noise in defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2011. 481-490.[doi:10.1145/1985793.1985859]
    [4] Wang J, Shen B, Chen Y. Compressed C4.5 models for software defect prediction. In:Proc. of the Int'l Conf. on Quality Software. 2012. 13-16.[doi:10.1109/QSIC.2012.19]
    [5] Sun Z, Song Q, Zhu X. Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012,42(6):1806-1817.[doi:10.1109/TSMCC.2012.2226152]
    [6] Zhou MH, Guo CG. New thinking of software engineering based on big data. Communications of the CCF, 2014,10(3):37-42(in Chinese).
    [7] Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A, Panichella S. Multi-Objective cross-project defect prediction. In:Proc. of the Int'l Conf. on Software Testing, Verification and Validation. 2013. 252-261.[doi:10.1109/ICST.2013.38]
    [8] Briand LC, Melo WL, Wust J. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. on Software Engineering, 2002,28(7):706-720.[doi:10.1109/TSE.2002.1019484]
    [9] Cruz AEC, Ochimizu K. Towards logistic regression models for predicting fault-prone code across software projects. In:Proc. of the Int'l Symp. on Empirical Software Engineering and Measurement. 2009. 460-463.[doi:10.1109/ESEM.2009.5316002]
    [10] Nam J, Pan SJ, Kim S. Transfer defect learning. In:Proc. of the Int'l Conf. on Software Engineering. 2013. 382-391.
    [11] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans. on Knowledge and Data Engineering, 2010,22(10):1345-1359.[doi:10.1109/TKDE.2009.191]
    [12] Zhuang FZ, Ping L, Qing HE, Shi ZZ. Survey on transfer learning research. Ruan Jian Xue Bao/Journal of Software, 2015,26(1):26-39(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4631.htm[doi:10.13328/j.cnki.jos.004631]
    [13] Pelayo L, Dick S. Evaluating stratification alternatives to improve software defect prediction. IEEE Trans. on Reliability, 2012, 61(2):516-525.[doi:10.1109/TR.2012.2183912]
    [14] Chen X, Gu Q, Liu WS, Liu SL, Ni C. Software defect prediction. Ruan Jian Xue Bao/Journal of Software, 2016,27(1):1-25(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4923.htm[doi:10.13328/j.cnki.jos.004923]
    [15] Harman M, Mansouri SA, Zhang YY. Search-Based software engineering:Trends, techniques and applications. ACM Computing Surveys, 2012,45(1):1-61.[doi:10.1145/2379776.2379787]
    [16] Turhan B, Menzies T, Bener AB, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 2009,14(5):540-578.[doi:10.1007/s10664-008-9103-7]
    [17] Peters F, Menzies T, Marcus A. Better cross company defect prediction. In:Proc. of the IEEE Working Conf. on Mining Software Repositories. 2013. 409-418.[doi:10.1109/MSR.2013.6624057]
    [18] Panichella A, Oliveto R, Lucia AD. Cross-Project defect prediction models:L'Union fait la force. In:Proc. of the IEEE Conf. on Software Maintenance, Reengineering and Reverse Engineering. 2014. 164-173.[doi:10.1109/CSMR-WCRE.2014.6747166]
    [19] He Z, Shu F, Yang Y, Li M, Wang Q. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 2011,19(2):167-199.[doi:10.1007/s10515-011-0090-3] 1472
    [20] Malhotra R, Raje R. An empirical comparison of machine learning techniques for software defect prediction. In:Proc. of the Int'l Conf. on Bioinspired Information and Communications Technologies. 2014. 320-327.[doi:10.4108/icst.bict.2014.257871]
    [21] Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction:A proposed framework and novel findings. IEEE Trans. on Software Engineering, 2008,34(4):485-496.[doi:10.1109/TSE.2008.35]
    [22] Ghotra B, McIntosh S, Hassan AE. Revisiting the impact of classification techniques on the performance of defect prediction models. In:Proc. of the Int'l Conf. on Software Engineering. 2015. 789-800.[doi:10.1109/ICSE.2015.91]
    [23] Zhang Y, Lo D, Xia X, Sun J. An empirical study of classifier combination for cross-project defect prediction. In:Proc. of the IEEE Computer Software and Applications Conf. 2015. 264-269.[doi:10.1109/COMPSAC.2015.58]
    [24] Ryu D, Choi O, Baik J. Value-Cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 2014,21(1):43-71.[doi:10.1007/s10664-014-9346-4]
    [25] Ryu D, Jang J, Baik J. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. Journal of Computer Science and Technology, 2015,30(5):969-980.[doi:10.1007/s11390-015-1575-5]
    [26] Turhan B, Misirli AT, Bener A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 2013,55(6):1101-1118.[doi:10.1016/j.infsof.2012.10.003]
    [27] Zhong S, Khoshgoftaar TM, Seliya N. Unsupervised learning for expert-based software quality estimation. In:Proc. of the IEEE Int'l Symp. on High Assurance Systems Engineering. 2004. 149-155.[doi:10.1109/HASE.2004.1281739]
    [28] Zhang F, Zheng Q, Zou Y, Hassan AE. Cross-Project defect prediction using a connectivity-based unsupervised classifier. In:Proc. of the Int'l Conf. on Software Engineering. 2016. 309-320.[doi:10.1145/2884781.2884839]
    [29] Nam J, Kim S. CLAMI:Defect prediction on unlabeled datasets. In:Proc. of the Int'l Conf. on Automated Software Engineering. 2015. 452-463.[doi:10.1109/ASE.2015.56]
    [30] Concas G, Marchesi M, Pinna S, Serra N. Power-Laws in a large object-oriented software system. IEEE Trans. on Software Engineering, 2007,33(10):687-708.[doi:10.1109/TSE.2007.1019]
    [31] Jiang Y, Cukic B, Menzies T. Can data transformation help in the detection of fault-prone modules? In:Proc. of the Workshop on Defects in Large Software Systems. 2008. 16-20.[doi:10.1145/1390817.1390822]
    [32] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans. on Software Engineering, 2007,33(1):2-13.[doi:10.1109/TSE.2007.256941]
    [33] Song Q, Jia Z, Shepperd M, Ying S, Liu J. A general software defect-proneness prediction framework. IEEE Trans. on Software Engineering, 2011,37(3):356-370.[doi:10.1109/TSE.2010.90]
    [34] Zhang F, Mockus A, Keivanloo I, Zou Y. Towards building a universal defect prediction model. In:Proc. of the Working Conf. on Mining Software Repositories. 2014. 182-191.[doi:10.1145/2597073.2597078]
    [35] Rahman F, Devanbu P. How, and why, process metrics are better. In:Proc. of the Int'l Conf. on Software Engineering. 2013. 432-441.[doi:10.1109/ICSE.2013.6606589]
    [36] Bacchelli A, D'Ambros M, Lanza M. Are popular classes more defect prone? Lecture Notes in Computer Science, 2010,6013:59-73.[doi:10.1007/978-3-642-12029-9_5]
    [37] Nagappan N, Ball T. Use of relative code churn measures to predict system defect density. In:Proc. of the Int'l Conf. on Software Engineering. 2005. 284-292.[doi:10.1109/ICSE.2005.1553571]
    [38] Hassan AE. Predicting faults using the complexity of code changes. In:Proc. of the Int'l Conf. on Software Engineering. 2009. 78-88.[doi:10.1109/ICSE.2009.5070510]
    [39] Moser R, Pedrycz W, Succi G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In:Proc. of the Int'l Conf. on Software Engineering. 2008. 181-190.[doi:10.1145/1368088.1368114]
    [40] Herzig K, Just S, Rau A, Zeller A. Predicting defects using change genealogies. In:Proc. of the Int'l Symp. on Software Reliability Engineering. 2013. 118-127.[doi:10.1109/ISSRE.2013.6698911]
    [41] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software:An update. ACM SIGKDD Explorations Newsletter, 2009,11(1):10-18.[doi:10.1145/1656274.1656278]
    [42] Ambros MD, Lanza M, Robbes R. An extensive comparison of bug prediction approaches. In:Proc. of the IEEE Working Conf. on Mining Software Repositories. 2010. 31-41.[doi:10.1109/MSR.2010.5463279]
    [43] Agarwal S. Data mining:Data mining concepts and techniques. In:Proc. of the Int'l Conf. on Machine Intelligence and Research Advancement. 2013. 203-207.[doi:10.1109/ICMIRA.2013.45]
    [44] Nguyen AT, Nguyen TT, Nguyen HA, Nguyen TN. Multi-Layered approach for recovering links between bug reports and fixes. In:Proc. of the ACM SIGSOFT Int'l Symp. on the Foundations of Software Engineering. 2012. 1-11.[doi:10.1145/2393596. 2393671]
    [45] Tian Y, Lawall J, Lo D. Identifying Linux bug fixing patches. In:Proc. of the Int'l Conf. on Software Engineering. 2012. 386-396.[doi:10.1109/ICSE.2012.6227176]
    [46] Wu R, Zhang H, Kim S, Cheung SC. ReLink:Recovering links between bugs and changes. In:Proc. of the ACM SIGSOFT Symp. and the European Conf. on Foundations of Software Engineering. 2011. 15-25.[doi:10.1145/2025113.2025120]
    [47] Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 1937,32(200):675-701.[doi:10.1080/01621459.1937.10503522]
    [48] Wilcoxon F. Individual comparisons by ranking methods. Biometrics, 1945,1(6):80-83.[doi:10.2307/3001968]
    [49] Cao Q, Sun Q, Cao Q, Tan H. Software defect prediction via transfer learning based neural network. In:Proc. of the Int'l Conf. on Reliability Systems Engineering. 2015. 1-10.[doi:10.1109/ICRSE.2015.7366475]
    [50] Xu Z, Xuan J, Liu J, Cui X. MICHAC:Defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering. In:Proc. of the IEEE Int'l Conf. on Software Analysis, Evolution, and Reengineering. 2016. 370-381.[doi:10.1109/SANER.2016.34]
    [51] Liu SL, Chen X, Liu WS, Chen JQ, Gu Q, Chen DX. FECAR:A feature selection framework for software defect prediction. In:Proc. of the Annual Int'l Computers, Software and Applications Conf. 2014. 426-435.[doi:10.1109/COMPSAC.2014.66]
    [52] Liu WS, Liu SL, Gu Q, Chen JQ, Chen X, Chen DX. Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans. on Reliability, 2016,65(1):38-53.[doi:10.1109/TR.2015.2461676]
    附中文参考文献:
    [6] 周明辉,郭长国.基于大数据的软件工程新思维.计算机学会通讯,2014,10(3):37-42.
    [12] 庄福振,罗平,何清,等.迁移学习研究进展.软件学报,2015,26(1):26-39. http://www.jos.org.cn/1000-9825/4631.htm[doi:10. 13328/j.cnki.jos.004631]
    [14] 陈翔,顾庆,刘望舒,刘树龙,倪超.静态软件缺陷预测方法研究.软件学报,2016,27(1):1-25. http://www.jos.org.cn/1000-9825/4923.htm[doi:10.13328/j.cnki.jos.004923]
    Comments
    Comments
    分享到微博
    Submit
Get Citation

何吉元,孟昭鹏,陈翔,王赞,樊向宇.一种半监督集成跨项目软件缺陷预测方法.软件学报,2017,28(6):1455-1473

Copy
Share
Article Metrics
  • Abstract:9603
  • PDF: 12431
  • HTML: 3321
  • Cited by: 0
History
  • Received:July 28,2016
  • Revised:October 11,2016
  • Online: February 21,2017
You are the first2033814Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063