采用多目标优化的深度学习测试优化方法
作者:
作者简介:

沐燕舟(1996-),男,硕士,CCF学生会员,主要研究领域为机器学习,并发程序分析,深度学习测试;
陈俊洁(1992-),男,博士,副教授,博士生导师,CCF专业会员,主要研究领域为软件分析与测试;
王赞(1979-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为软件测试,机器学习;
赵静珂(1997-),男,硕士生,主要研究领域为深度学习安全质量保证;
陈翔(1980-),男,博士,副教授,CCF高级会员,主要研究领域为软件缺陷预测,软件缺陷定位,回归测试,组合测试;
王建敏(1986-),男,博士,助理研究员,主要研究领域为智能软件测试,系统仿真.

通讯作者:

王赞,E-mail:wangzan@tju.edu.cn

中图分类号:

TP311

基金项目:

基金项目:国家自然科学基金(61872263);基础加强计划技术领域基金(2020-JCJQ-JJ-490);2020年天津市智能制造专项资金


Deep Learning Test Optimization Method Using Multi-objective Optimization
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [73]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    随着深度学习技术的快速发展,对其质量保障的研究也逐步增多.传感器等技术的迅速发展,使得收集测试数据变得不再困难,但对收集到的数据进行标记却需要花费高昂的代价.已有工作尝试从原始测试集中筛选出一个测试子集以降低标记成本,这些测试子集保证了与原始测试集具有相近的整体准确率(即待测深度学习模型在测试集全体测试输入上的准确率),但却不能保证在其他测试性质上与原始测试集相近.例如,不能充分覆盖原始测试集中各个类别的测试输入.提出了一种基于多目标优化的深度学习测试输入选择方法DMOS (deep multi-objective selection),其首先基于HDBSCAN (hierarchical density-based spatial clustering of applications with noise)聚类方法初步分析原始测试集的数据分布,然后基于聚类结果的特征设计多个优化目标,接着利用多目标优化求解出合适的选择方案.在8组经典的深度学习测试集和模型上进行了大量实验,结果表明,DMOS方法选出的最佳测试子集(性能最好的Pareto最优解对应的测试子集)不仅能够覆盖原始测试集中更多的测试输入类别,而且对各个类别测试输入的准确率估计非常接近原始测试集.同时,它还能保证在整体准确率以及测试充分性上的估计也接近于原始测试集:对整体准确率估计的平均误差仅为1.081%,比最新方法PACE (practical accuracy estimation)减小了0.845%的误差,提升幅度为43.87%;对各个类别测试输入的准确率估计的平均误差仅为5.547%,比最新方法PACE减小了2.926%的误差,提升幅度为34.53%;对5种测试充分性度量的平均估计误差仅为8.739%,比最新方法PACE减小了7.328%的误差,提升幅度为45.61%.

    Abstract:

    With the rapid development of deep learning technology, the research on its quality assurance is raising more attention. Meanwhile, it is no longer difficult to collect test data owing to the mature sensor technology, but it costs a lot to label the collected data. In order to reduce the cost of labeling, the existing work attempts to select a test subset from the original test set. They only ensure that the overall accuracy (the accuracy of the target deep learning model on all test inputs of the test set) of the test subset is similar to that of the original test set. However, existing work only focuses on estimating overall accuracy, ignoring other properties of the original test set. For example, it can not fully cover all kinds of test input in the original test set. This study proposes a method based on multi-objective optimization called DMOS (deep multi-objective selection). It firstly analyzes the data distribution of the original test set based on HDBSCAN (hierarchical density-based spatial clustering of applications with noise) clustering method. Then, it designs the optimization objective based on the characteristics of the clustering results and then carries out multi-objective optimization to find out the appropriate selection solution. A large number of experiments are carried out on 8 pairs of classic deep learning test sets and models. The results show that the best test subset selected by DMOS method (corresponding to the Pareto optimal solution with the best performance) can not only cover more test input categories in the original test set, but also estimate the accuracy of each test input category extremely close to the original test set. Meanwhile, it can also ensure that the overall accuracy and test adequacy are close to the original test set: The average error of the overall accuracy estimation is only 1.081%, which is 0.845% less than the PACE (practical accuracy estimation), with the improvement of 43.87%. The average error of the accuracy estimation of each category of test input is only 5.547%, which is 2.926% less than PACE, with the improvement of 34.53%. The average estimation error of the five test adequacy measures is only 8.739%, which is 7.328% lower than PACE, with the increase improvement of 45.61%.

    参考文献
    [1] Chen CY, Seff A, Kornhauser AL, Xiao JX. Deepdriving:Learning affordance for direct perception in autonomous driving. In:Proc. of the 2015 IEEE Int'l Conf. on Computer Vision. 2015. 2722-2730.
    [2] Sun Y, Chen YH, Wang XG, Tang XO. Deep learning face representation by joint identification-verification. In:Advances in Neural Information Processing Systems 27:Annual Conf. on Neural Information Processing Systems. 2014. 1988-1996.
    [3] Goodfellow IJ, Bengio Y, Courville AC. Deep Learning. In:Adaptive Computation and Machine Learning. MIT Press, 2016.
    [4] LeCun Y, Bengio Y, Hinton GE. Deep learning. Nature, 2015, 521(7553):436-444.
    [5] Obermeyer Z, Emanuel EJ. Predicting the future-Big data, machine learning, and clinical medicine. The New England Journal of Medicine, 2016, 375(13):1216.
    [6] Julian KD, Lopez J, Brush JS, Owen MP, Kochenderfer MJ. Policy compression for aircraft collision avoidance systems. In:Proc. of the 35th IEEE/AIAA Digital Avionics Systems Conf. 2016. 1-10.
    [7] Chen JJ, He XT, Lin QW, Xu Y, Zhang HY, Hao D, Gao F, Xu ZW, Dang YN, Zhang DM. An empirical investigation of incident triage for online service systems. In:Proc. of the 41st Int'l Conf. on Software Engineering:Software Engineering in Practice. 2019. 111-120.
    [8] Chen JJ, He XT, Lin QW, Zhang HY, Hao D, Gao F, Xu ZW, Dang YN, Zhang DM. Continuous incident triage for large-scale online service systems. In:Proc. of the 34th IEEE/ACM Int'l Conf. on Automated Software Engineering. 2019. 364-375.
    [9] Li X, Li W, Zhang YQ, Zhang LM. DeepFL:Integrating multiple fault diagnosis dimensions for deep fault localization. In:Proc. of the 28th ACM SIGSOFT Int'l Symp. on Software Testing and Analysis. 2019. 169-180.
    [10] Zhang X, Xu Y, Lin QW, Qiao B, Zhang HY, Dang YN, Xie CY, Yang XS, Cheng Q, Li Z, Chen JJ, He XT, Yao R, Lou JG, Chintalapati M, Shen FR, Zhang DM. Robust log-based anomaly detection on unstable log data. In:Proc. of the ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. 2019. 807-817.
    [11] Wang Z, You H, Chen J, Zhang Y, Dong X, Zhang W. Prioritizing test inputs for deep neural networks via mutation analysis. In:Proc. of the 43rd IEEE/ACM Int'l Conf. on Software Engineering. 2021. 397-409.
    [12] Taigman Y, Yang M, Ranzato MA, Wolf L. DeepFace:Closing the gap to human-level performance in face verification. In:Proc. of the 2014 IEEE Conf. on Computer Vision and Pattern Recognition. 2014. 1701-1708.
    [13] Tian YC, Pei KX, Jana S, Ray B. DeepTest:Automated testing of deep-neural-network-driven autonomous cars. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018. 303-314.
    [14] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang ZH, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li FF. Imagenet large scale visual recognition challenge. Int'l Journal of Computer Vision, 2015, 115(3):211-252.
    [15] Li ZN, Ma X, Xu C, Cao C, Xu J, Lu J. Boosting operational DNN testing efficiency through conditioning. In:Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. 2019. 499-509.
    [16] Zhou JY, Li F, Dong JH, Zhang H, Hao D. Cost-effective testing of a deep learning model through input reduction. In:Proc. of the 31st IEEE Int'l Symp. on Software Reliability Engineering. 2020. 289-300.
    [17] Chen JJ, Wu Z, Wang Z, You HM, Zhang L, Yan M. Practical accuracy estimation for efficient deep neural network testing. ACM Trans. on Software Engineering and Methodology, 2020, 29(4):1-35.
    [18] McInnes L, Healy J, Astels S. HDBScan:Hierarchical density based clustering. The Journal of Open Source Software, 2017, 2(11):Article No.205.
    [19] Kim B, Khanna R, Koyejo O. Examples are not enough, learn to criticize!Criticism for interpretability. In:Proc. of the 30th Int'l Conf. on Neural Information Processing Systems. 2016. 2288-2296.
    [20] Chen TY, Kuo FC, Merkel RG, Tse TH. Adaptive random testing:The ART of test case diversity. Journal of Systems and Software, 2010, 83(1):60-66.
    [21] Deb K, Agrawal S, Pratap A, Meyarivan T. A fast and elitist multiobjective genetic algorithm:NSGA-II. IEEE Trans. on Evolutionary Computation, 2002, 6(2):182-197.
    [22] Liu WB, Wang ZD, Liu XH, Zeng NY, Liu YR, Alsaadi F. A survey of deep neural network architectures and their applications. Neurocomputing, 2017, 234:11-26.
    [23] Gatys LA, Ecker AS, Bethge M. Image style transfer using convolutional neural networks. In:Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 2414-2423.
    [24] Lai SW, Xu L, Liu K, Zhao J. Recurrent convolutional neural networks for text classification. In:Proc. of the 29th AAAI Conf. on Artificial Intelligence. 2015. 2267-2273.
    [25] Ma L, Juefei-Xu F, Zhang FY, Sun JY, Xue MH, Li B, Chen C, Su T, Li L, Liu Y, Zhao JJ, Wang YD. DeepGauge:Multi-granularity testing criteria for deep learning systems. In:Proc. of the 33rd IEEE/ACM Int'l Conf. on Automated Software Engineering. 2018. 120-131.
    [26] Pei KX, Cao YZ, Yang J, Jana S. DeepXplore:Automated whitebox testing of deep learning systems. In:Proc. of the 26th Symp. on Operating Systems Principles. 2017. 1-18.
    [27] Feng Y, Shi QK, Gao XY, Wan J, Fang CR, Chen ZY. DeepGini:Prioritizing massive tests to enhance the robustness of deep neural networks. In:Proc. of the 29th ACM SIGSOFT Int'l Symp. on Software Testing and Analysis, Virtual Event. 2020. 177-188.
    [28] Quinlan JR. Induction of decision trees. Machine Learning, 2004, 1:81-106.
    [29] Zhang L, Sun XC, Li Y, Zhang Z. A noise-sensitivity-analysis-based test prioritization technique for deep neural networks. arXiv:1901.00054, 2019.
    [30] Ma W, Papadakis M, Tsakmalis A, Cordy M, Traon YL. Test selection for deep learning systems. ACM Trans. on Software Engineering and Methodology, 2021, 30(2):1-22.
    [31] Han Y, Gong DW, Jin Y, Pan QK. Evolutionary multiobjective blocking lot-streaming flow shop scheduling with machine breakdowns. IEEE Trans. on Cybernetics, 2019, 49(1):184-197.
    [32] Qi R, Yen G. Hybrid bi-objective portfolio optimization with pre-selection strategy. Information Sciences, 2017, 417:401-419.
    [33] Liu YP, Yen GG, Gong DW. A multimodal multiobjective evolutionary algorithm using two-archive and recombination strategies. IEEE Trans. on Evolutionary Computation, 2019, 23(4):660-674.
    [34] Deb K. Multi-objective genetic algorithms:Problem difficulties and construction of test problems. Evolutionary Computation, 1999, 7(3):205-230.
    [35] Zitzler E, Laumanns M, Thiele L. Spea2:Improving the strength Pareto evolutionary algorithm. Technical Report, 103, Computer Engineering and Networks Laboratory (TIK), 2001.
    [36] Zhang Q, Li H. MOEA/D:A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. on Evolutionary Computation, 2007, 11(6):712-731.
    [37] Shi QK, Chen Z, Fang CR, Feng Y, Xu B. Measuring the diversity of a test set with distance entropy. IEEE Trans. on Reliability, 2016, 65(1):19-27.
    [38] Hartigan JA, Wong MA. Algorithm as 136:A k-means clustering algorithm. Journal of the Royal Statistical Society:Series C (Applied Statistics), 1979, 28(1):100-108.
    [39] Warden P. Speech commands:A public dataset for single-word speech recognition. arXiv:1804.03209, 2017.
    [40] Kurakin A, Goodfellow IJ, Bengio S. Adversarial examples in the physical world. arXiv:1607.02533, 2016.
    [41] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
    [42] Kim JH, Feldt R, Yoo S. Guiding deep learning system testing using surprise adequacy. In:Proc. of the 41st IEEE/ACM Int'l Conf. on Software Engineering. 2019. 1039-1049.
    [43] Wilcoxon F. Individual comparisons by ranking methods. Breakthroughs in Statistics, 1992:196-202.
    [44] Kocaguneli E, Menzies T, Keung JW, Cok D, Madachy R. Active learning and effort estimation:Finding the essential content of software effort estimation data. IEEE Trans. on Software Engineering, 2013, 39(8):1040-1053.
    [45] Li M, Zhang H, Wu RX, Zhou Z. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 2011, 19(2):201-230.
    [46] Valentini G, Dietterich TG. Low bias bagged support vector machines. In:Proc. of the 20th Int'l Conf. on Machine Learning. 2003. 752-759.
    [47] Jelihovschi EG, Faria JC, Allaman I. ScottKnott:A package for performing the scott-knott clustering algorithm in R. Trends in Applied and Computational Mathematics, 2014, 15(3):3-17.
    [48] Keras. Accessed. 2021. https://keras.io
    [49] Tensorflow. 2021. https://www.tensorflow.org/
    [50] Fastica. 2021. https://scikit-learn.org/stable/modules/classes.html\#module-sklearn.decomposition
    [51] Hdbscan. 2021. https://pypi.org/project/hdbscan/
    [52] Geatpy. 2021. http://geatpy.com/
    [53] Zhang J, Harman M, Ma L, Liu Y. Machine learning testing:Survey, landscapes and horizons. arXiv:1906.10742, 2019.
    [54] Sun Y, Huang X, Kroening D. Testing deep neural networks. arXiv:1803.04792, 2018.
    [55] Gopinath D, Wang KY, Zhang MS, Pasareanu C, Khurshid S. Symbolic execution for deep neural networks. arXiv:1807.10439, 2018.
    [56] Odena A, Goodfellow IJ. Tensorfuzz:Debugging neural networks with coverage-guided fuzzing. In:Proc. of the 36th Int'l Conf. on Machine Learning. 2019. 4901-4911.
    [57] Xie XF, Ma L, Juefei-Xu F, Xue MH, Chen HX, Liu Y, Zhao JJ, Li B, Yin JX, See S. DeepHunter:A coverage-guided fuzz testing framework for deep neural networks. In:Proc. of the 28th ACM SIGSOFT Int'l Symp. on Software Testing and Analysis. 2019. 146-157.
    [58] Gerasimou S, Eniser HF, Sen A, Cakan A. Importance-driven deep learning system testing. In:Proc. of the 42nd IEEE/ACM Int'l Conf. on Software Engineering:Companion Proceeding. 2020. 322-323.
    [59] Guo JM, Jiang Y, Zhao Y, Chen Q, Sun J. DLFuzz:Differential fuzzing testing of deep learning systems. In:Proc. of the 26th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. 2018. 739-743.
    [60] Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R. Intriguing properties of neural networks. arXiv:1312.6199, 2014.
    [61] Xiao CW, Zhu JY, Li B, He W, Liu M, Song D. Spatially transformed adversarial examples. arXiv:1801.02612, 2018.
    [62] Zhang L, Marinov D, Khurshid S. An empirical study of JUnit test-suite reduction. In:Proc. of the 22nd IEEE Int'l Symp. on Software Reliability Engineering. 2011. 170-179.
    [63] Shi A, Yung T, Gyori A, Marinov D. Comparing and combining test-suite reduction and regression test selection. In:Proc. of the 10th Joint Meeting on Foundations of Software Engineering. 2015. 237-247.
    [64] Rothermel G, Harrold MJ, von Ronne J, Hong C. Empirical studies of test-suite reduction. Software Testing, 2002, 12(4):219-249.
    [65] Chen J, Bai Y, Hao D, Zhang L, Xie B. How do assertions impact coverage-based test-suite reduction?In:Proc. of the 2017 IEEE Int'l Conf. on Software Testing. 2017. 418-423.
    [66] Chen T, Lau M. A new heuristic for test suite reduction. Information and Software Technology, 1998, 40(5-6):347-354.
    [67] Cruciani E, Miranda B, Verdecchia R, Bertolino A. Scalable approaches for test suite reduction. In:Proc. of the 41st IEEE/ACM Int'l Conf. on Software Engineering. 2019. 419-429.
    [68] Harrold MJ, Gupta R, Soffa M. A methodology for controlling the size of a test suite. ACM Trans. on Software Engineering and Methodology, 1993, 2(3):270-285.
    [69] Hua L, Wang CY, Gu Q, Cheng H. Test case set reduction based on genetic ant colony algorithm. Journal of Engineering Mathematics, 2012, 29(4):486-492(in Chinese with English abstract).
    [70] Nie CH, Xu BW. A method of generating minimum test case set. Journal of Computer Science, 2003, 26(12):1690-1695(in Chinese with English abstract).
    附中文参考文献:
    [69] 华丽,王成勇,谷琼,程虹.基于遗传蚁群算法的测试用例集约简.工程数学学报, 2012, 29(4):486-492.
    [70] 聂长海,徐宝文.一种最小测试用例集生成方法.计算机学报, 2003, 26(12):1690-1695.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

沐燕舟,王赞,陈翔,陈俊洁,赵静珂,王建敏.采用多目标优化的深度学习测试优化方法.软件学报,2022,33(7):2499-2524

复制
分享
文章指标
  • 点击次数:1673
  • 下载次数: 4827
  • HTML阅读次数: 3482
  • 引用次数: 0
历史
  • 收稿日期:2021-09-05
  • 最后修改日期:2021-10-14
  • 在线发布日期: 2022-01-28
  • 出版日期: 2022-07-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号