结合主动学习和半监督学习的软件可追踪性恢复框架
作者:
中图分类号:

TP311

基金项目:

国家自然科学基金 (62072227, 62202219); 国家重点研发计划 (2019YFE0105500); 江苏省重点研发计划(BE2021002-2); 南京大学计算机软件新技术国家重点实验室创新项目(ZZKT2022A25); 海外开放课题(KFKT2022A09)


Software Traceability Recovery Framework Based on Active Learning and Semi-supervised Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [77]
  • | | | |
  • 文章评论
    摘要:

    软件可追踪性被认为是软件开发过程可信的一个重要因素, 确保对软件开发过程的可见性并进行全面追踪, 从而提高软件的可信度和可靠性. 近年来, 自动化的软件可追踪性恢复方法取得了显著进展, 但在企业项目中的应用仍面临挑战. 通过调研研究和实验案例分析, 发现工业界场景中可追踪性模型表现不佳的3个关键挑战: 原始数据低质量、样本稀疏性和不平衡性, 并提出一种结合主动学习和半监督学习的软件可追踪性恢复框架STRACE(AL+SSL). 该框架通过选择有价值的标注样本和生成高质量的伪标签样本, 有效利用未标注的样本, 克服数据低质量和稀疏性挑战. 实验基于10个样本规模在几万至近百万个issue-commit跟踪对实例的企业项目, 进行多组对比实验, 结果表明该框架在当前真实企业项目软件可追踪性恢复任务上具有有效性. 其中消融实验结果表明STRACE(AL+SSL)中主动学习模块所选择的无标签样本在可追踪性恢复任务中发挥了更为重要的作用. 此外, 还验证各个模块最佳的样本选择策略组合, 包括调整后的半监督类平衡自训练样本选择策略CBST-Adjust和低成本高效率的主动学习子模块互信息SMI_Flqmi样本选择策略.

    Abstract:

    Software traceability is considered critical to trustworthy software engineering, ensuring software reliability through the tracking of the software development process. Despite significant progress in automatic software traceability recovery techniques in recent years, their application in real-world commercial software projects does not meet expectations. An investigation into the application of learning-based software traceability recovery classifier models in commercial software projects is conducted. It uncovers three critical challenges faced in industrial settings. These challenges contribute to underperforming traceability models: low-quality raw data, data sparsity, and class imbalance. In response to these challenges, STRACE(AL+SSL) is proposed. It is a software traceability recovery framework that integrates active learning and semi-supervised learning. By strategically selecting valuable annotated samples and generating high-quality pseudo-labeled samples, STRACE(AL+SSL) effectively harnesses unlabeled data to address data-related challenges. Multiple comparative experiments are conducted with nearly one million issue-commit trace pair samples from 10 different enterprise projects. The results of these experiments validate the effectiveness of the proposed framework for real-world software traceability recovery tasks. The ablation results show that the unlabeled samples selected by the active learning in STRACE(AL+SSL) play a crucial role in the traceability recovery task. Additionally, the optimal combination of sample selection strategies in STRACE(AL+SSL) is confirmed. This includes CBST-Adjust for the semi-supervised sample rebalancing strategy and SMI_Flqmi, which is recognized for its cost-effectiveness and efficiency in active learning.

    参考文献
    [1] Watkins R, Neal M. Why and how of requirements tracing. IEEE Software, 1994, 11(4): 104–106.
    [2] Kukkanen J, Väkeväinen K, Kauppinen M, Uusitalo E. Applying a systematic approach to link requirements and testing: A case study. In: Proc. of the 16th Asia-Pacific Software Engineering Conf. Batu Ferringhi: IEEE, 2009. 482–488. [doi: 10.1109/apsec.2009.62]
    [3] De Toledo SS, Martini A, Sjøberg DIK. Identifying architectural technical debt, principal, and interest in microservices: A multiple-case study. Journal of Systems and Software, 2021, 177: 110968.
    [4] Dasanayake S, Aaramaa S, Markkula J, Oivo M. Impact of requirements volatility on software architecture: How do software teams keep up with ever-changing requirements? Journal of Software: Evolution and Process, 2019, 31(6): e2160.
    [5] Fucci D, Alégroth E, Axelsson T. When traceability goes awry: An industrial experience report. Journal of Systems and Software, 2022, 192: 111389.
    [6] Cleland-Huang J, Chang CK, Christensen M. Event-based traceability for managing evolutionary change. IEEE Trans. on Software Engineering, 2003, 29(9): 796–810.
    [7] Hayes JH, Dekhtyar A, Sundaram SK, Holbrook EA, Vadlamudi S, April A. Requirements tracing on target (retro): Improving software maintenance through traceability recovery. Innovations in Systems and Software Engineering, 2007, 3(3): 193–202.
    [8] Mills C, Escobar-Avila J, Haiduc S. Automatic traceability maintenance via machine learning classification. In: Proc. of the 2018 IEEE Int’l Conf. on Software Maintenance and Evolution. Madrid: IEEE, 2018. 369–380. [doi: 10.1109/icsme.2018.00045]
    [9] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357.
    [10] Rath M, Rendall J, Guo JLC, Cleland-Huang J, Mäder P. Traceability in the wild: Automatically augmenting incomplete trace links. In: Proc. of the 40th Int’l Conf. on Software Engineering. Gothenburg: ACM, 2018. 834–845. [doi: 10.1145/3180155.3180207]
    [11] Kaushik N, Tahvildari L, Moore M. Reconstructing traceability between bugs and test cases: An experimental study. In: Proc. of the 18th Working Conf. on Reverse Engineering. Limerick: IEEE, 2011. 411–414. [doi: 10.1109/wcre.2011.58]
    [12] Moran K, Palacio DN, Bernal-Cárdenas C, McCrystal D, Poshyvanyk D, Shenefiel C, Johnson J. Improving the effectiveness of traceability link recovery using hierarchical Bayesian networks. In: Proc. of the 42nd Int’l Conf. on Software Engineering. Seoul: ACM, 2020. 873–885. [doi: 10.1145/3377811.3380418]
    [13] Guo J, Cheng JH, Cleland-Huang J. Semantically enhanced software traceability using deep learning techniques. In: Proc. of the 39th IEEE/ACM Int’l Conf. on Software Engineering. Buenos Aires: IEEE, 2017. 3–14. [doi: 10.1109/icse.2017.9]
    [14] Rodriguez AD, Cleland-Huang J, Falessi D. Leveraging intermediate artifacts to improve automated trace link retrieval. In: Proc. of the 2021 IEEE Int’l Conf. on Software Maintenance and Evolution. Luxembourg: IEEE, 2021. 81–92.
    [15] Gotel OCZ, Finkelstein CW. An analysis of the requirements traceability problem. In: Proc. of the 1994 IEEE Int’l Conf. on Requirements Engineering. Colorado Springs: IEEE, 1994. 94–101. [doi: 10.1109/icre.1994.292398]
    [16] Gotel O, Cleland-Huang J, Hayes JH, Zisman A, Egyed A, Grünbacher P, Dekhtyar A, Antoniol G, Maletic J. The grand challenge of traceability (v1.0). In: Cleland-Huang J, Gotel O, Zisman A, eds. Software and Systems Traceability. London: Springer, 2012. 343–409. [doi: 10.1007/978-1-4471-2239-5_16]
    [17] Rempel P, Mäder P. Preventing defects: The impact of requirements traceability completeness on software quality. IEEE Trans. on Software Engineering, 2017, 43(8): 777–797.
    [18] Cleland-Huang J. Traceability in agile projects. In: Cleland-Huang J, Gotel O, Zisman A, eds. Software and Systems Traceability. London: Springer, 2012. 265–275. [doi: 10.1007/978-1-4471-2239-5_12]
    [19] Neumuller C, Grunbacher P. Automating software traceability in very small companies: A case study and lessons learned. In: Proc. of the 21st IEEE/ACM Int’l Conf. on Automated Software Engineering. Tokyo: IEEE, 2006. 145–156. [doi: 10.1109/ase.2006.25]
    [20] Panis MC. Successful deployment of requirements traceability in a commercial engineering organization … really. In: Proc. of the 18th IEEE Int’l Requirements Engineering Conf. Sydney: IEEE, 2010. 303–307. [doi: 10.1109/re.2010.43]
    [21] Rath M, Lo D, Mäder P. Analyzing requirements and traceability information to improve bug localization. In: Proc. of the 15th Int’l Conf. on Mining Software Repositories. Gothenburg: ACM, 2018. 442–453. [doi: 10.1145/3196398.3196415]
    [22] Parizi RM, Lee SP, Dabbagh M. Achievements and challenges in state-of-the-art software traceability between test and code artifacts. IEEE Trans. on Reliability, 2014, 63(4): 913–926.
    [23] Ali NB, Petersen K. A consolidated process for software process simulation: State of the art and industry experience. In: Proc. of the 38th Euromicro Conf. on Software Engineering and Advanced Applications. Cesme: IEEE, 2012. 327–336. [doi: 10.1109/seaa.2012.69]
    [24] Li Y, Zhang H, Dong LM, Liu BH, Ma JY. Constructing a hybrid software process simulation model in practice: An exemplar from industry. In: Proc. of the 2020 Int’l Conf. on Software and System Processes. Seoul: ACM, 2020. 135–144.
    [25] Chen XF, Grundy J. Improving automated documentation to code traceability by combining retrieval techniques. In: Proc. of the 26th IEEE/ACM Int’l Conf. on Automated Software Engineering. Lawrence: IEEE, 2011. 223–232. [doi: 10.1109/ase.2011.6100057]
    [26] Winkler S, Von Pilgrim J. A survey of traceability in requirements engineering and model-driven development. Software & Systems Modeling, 2010, 9(4): 529–565.
    [27] Rath M, Mäder P. The SEOSS 33 dataset——Requirements, bug reports, code history, and trace links for entire projects. Data in Brief, 2019, 25: 104005.
    [28] Egyed A, Graf F, Grünbacher P. Effort and quality of recovering requirements-to-code traces: Two exploratory experiments. In: Proc. of the 18th IEEE Int’l Requirements Engineering Conf. Sydney: IEEE, 2010. 221–230. [doi: 10.1109/re.2010.34]
    [29] Borg M, Runeson P, Ardö A. Recovering from a decade: A systematic mapping of information retrieval approaches to software traceability. Empirical Software Engineering, 2014, 19(6): 1565–1616.
    [30] Corallo A, Latino ME, Menegoli M, Pontrandolfo P. A systematic literature review to explore traceability and lifecycle relationship. Int’l Journal of Production Research, 2020, 58(15): 4789–4807.
    [31] Zogaan W, Sharma P, Mirahkorli M, Arnaoudova V. Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality. In: Proc. of the 25th IEEE Int’l Requirements Engineering Conf. Lisbon: IEEE, 2017. 110–121. [doi: 10.1109/re.2017.80]
    [32] Aung TWW, Huo H, Sui YL. A literature review of automatic traceability links recovery for software change impact analysis. In: Proc. of the 28th Int’l Conf. on Program Comprehension. Seoul: ACM, 2020. 14–24. [doi: 10.1145/3387904.3389251]
    [33] Antoniol G, Canfora G, De Lucia A, Merlo E. Recovering code to documentation links in OO systems. In: Proc. of the 6th Working Conf. on Reverse Engineering. Atlanta: IEEE, 1999. 136–144. [doi: 10.1109/wcre.1999.806954]
    [34] 翟宇鹏, 洪玫, 杨秋辉. 功能需求到测试用例的可追溯性研究. 计算机科学, 2017, 44(11A): 480–484.
    Zhai YP, Hong M, Yang QH. Research on traceability of functional requirements to test case. Computer Science, 2017, 44(11A): 480–484 (in Chinese with English abstract).
    [35] Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E. Recovering traceability links between code and documentation. IEEE Trans. on Software Engineering, 2002, 28(10): 970–983.
    [36] Marcus A, Maletic JI, Sergeyev A. Recovery of traceability links between software documentation and source code. Int’l Journal of Software Engineering and Knowledge Engineering, 2005, 15(5): 811–836.
    [37] Asuncion HU, Asuncion AU, Taylor RN. Software traceability with topic modeling. In: Proc. of the 32nd ACM/IEEE Int’l Conf. on Software Engineering. Cape Town: ACM, 2010. 95–104. [doi: 10.1145/1806799.1806817]
    [38] Abadi A, Nisenson M, Simionovici Y. A traceability technique for specifications. In: Proc. of the 16th IEEE Int’l Conf. on Program Comprehension. Amsterdam: IEEE, 2008. 103–112. [doi: 10.1109/icpc.2008.30]
    [39] Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J. A machine learning approach for tracing regulatory codes to product specific requirements. In: Proc. of the 32nd ACM/IEEE Int’l Conf. on Software Engineering. Cape Town: ACM, 2010. 155–164.
    [40] Lin JF, Liu YL, Zeng QK, Jiang M, Cleland-Huang J. Traceability transformed: Generating more accurate links with pre-trained BERT models. In: Proc. of the 43rd IEEE/ACM Int’l Conf. on Software Engineering. Madrid: IEEE, 2021. 324–335.
    [41] Ruan H, Chen BH, Peng X, Zhao WY. DEEPLINK: Recovering issue-commit links based on deep learning. Journal of Systems and Software, 2019, 158: 110406.
    [42] Hammoudi M, Mayr-Dorn C, Mashkoor A, Egyed A. A traceability dataset for open source systems. In: Proc. of the 18th IEEE/ACM Int’l Conf. on Mining Software Repositories. Madrid: IEEE, 2021. 555–559. [doi: 10.1109/msr52588.2021.00073]
    [43] Maro S, Staron M, Steghöfer JP. Challenges of establishing traceability in the automotive domain. In: Proc. of the 9th Int’l Conf. on Software Quality. Vienna: Springer, 2017. 153–172. [doi: 10.1007/978-3-319-49421-0_11]
    [44] Dong LM, Zhang H, Liu W, Weng ZL, Kuang HY. Semi-supervised pre-processing for learning-based traceability framework on real-world software projects. In: Proc. of the 30th ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Singapore: ACM, 2022. 570–582. [doi: 10.1145/3540250.3549151]
    [45] Le TDB, Linares-Vasquez M, Lo D, Poshyvanyk D. RCLinker: Automated linking of issue reports and commits leveraging rich contextual information. In: Proc. of the 23rd IEEE Int’l Conf. on Program Comprehension. Florence: IEEE, 2015. 36–47.
    [46] Cavnar WB. Using an n-gram-based document representation with a vector processing retrieval model. In: Harman DK, ed. Proc. of the 3rd Text Retrieval Conf. (TREC-3). Gaithersburg: National Institute of Standards and Technology, 1994. 269–278.
    [47] Gethers M, Oliveto R, Poshyvanyk D, De Lucia A. On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proc. of the 27th IEEE Int’l Conf. on Software Maintenance. Williamsburg: IEEE, 2011. 133–142.
    [48] Chen BH, Chen LL, Zhang C, Peng X. BuildFast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In: Proc. of the 35th IEEE/ACM Int’l Conf. on Automated Software Engineering. Melbourne: ACM, 2020. 42–53. [doi: 10.1145/3324884.3416616]
    [49] Sun Y, Wang Q, Yang Y. FRLink: Improving the recovery of missing issue-commit links by revisiting file relevance. Information and Software Technology, 2017, 84: 33–47.
    [50] Sohn K, Berthelot D, Li CL, Zhang ZZ, Carlini N, Cubuk ED, Kurakin A, Zhang H, Raffel C. FixMatch: Simplifying semi-supervised learning with consistency and confidence. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 596–608.
    [51] Zou Y, Yu ZD, Kumar BVK, Wang JS. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 297–313. [doi: 10.1007/978-3-030-01219-9_18]
    [52] Zou Y, Yu ZD, Liu XF, Kumar BVKV, Wang JS. Confidence regularized self-training. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 5981–5990. [doi: 10.1109/iccv.2019.00608]
    [53] Wei C, Sohn K, Mellina C, Yuille A, Yang F. CReST: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 10852–10861.
    [54] Chen H, Fan Y, Wang YD, Wang JD, Schiele B, Xie X, Savvides M, Raj B. An embarrassingly simple baseline for imbalanced semi-supervised learning. arXiv:2211.11086, 2022.
    [55] Xu Y, Shang L, Ye JX, Qian Q, Li YF, Sun BG, Li H, Jin R. Dash: Semi-supervised learning with dynamic thresholding. In: Proc. of the 38th Int’l Conf. on Machine Learning. ICML, 2021. 11525–11536.
    [56] Zhang BW, Wang YD, Hou WX, Wu H, Wang JD, Okumura M, Shinozaki T. FlexMatch: Boosting semi-supervised learning with curriculum pseudo labeling. In: Proc. of the 34th Annual Conf. on Neural Information Processing Systems. 2021. 18408–18419.
    [57] Mills C, Escobar-Avila J, Bhattacharya A, Kondyukov G, Chakraborty S, Haiduc S. Tracing with less data: Active learning for classification-based traceability link recovery. In: Proc. of the 2019 IEEE Int’l Conf. on Software Maintenance and Evolution. Cleveland: IEEE, 2019. 103–113. [doi: 10.1109/icsme.2019.00020]
    [58] Du TB, Shen GH, Huang ZQ, Yu YS, Wu DX. Automatic traceability link recovery via active learning. Frontiers of Information Technology & Electronic Engineering, 2020, 21(8): 1217–1225.
    [59] Tharwat A, Schenck W. A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics, 2023, 11(4): 820.
    [60] Prenner JA, Robbes R. Making the most of small software engineering datasets with modern machine learning. IEEE Trans. on Software Engineering, 2022, 48(12): 5050–5067.
    [61] Lewis DD, Catlett J. Heterogeneous uncertainty sampling for supervised learning. In: Cohen WW, Hirsh H, eds. Machine Learning: Proc. of the 11th Int’l Conf. New Brunswick: Elsevier, 1994. 148–156.
    [62] Scheffer T, Decomain C, Wrobel S. Active hidden Markov models for information extraction. In: Proc. of the 4th Int’l Symp. on Intelligent Data Analysis. Cascais: Springer, 2001. 309–318. [doi: 10.1007/3-540-44816-0_31]
    [63] Kothawade S, Reddy PK, Ramakrishnan G, Iyer R. BASIL: Balanced active semi-supervised learning for class imbalanced datasets. arXiv:2203.05651, 2022.
    [64] Kothawade S, Ghosh S, Shekhar S, Xiang Y, Iyer R. Talisman: Targeted active learning for object detection with rare classes and slices using submodular mutual information. In: Proc. of the 17th European Conf. on Computer Vision. Tel Aviv: Springer, 2022. 1–16. [doi: 10.1007/978-3-031-19839-7_1]
    [65] Gupta A, Levin R. The online submodular cover problem. In: Proc. of the 2020 ACM-SIAM Symp. on Discrete Algorithms. Salt Lake City: SIAM, 2020. 1525–1537. [doi: 10.1137/1.9781611975994.94]
    [66] Iyer RK, Khargoankar N, Bilmes JA, Asanani H. Submodular combinatorial information measures with applications in machine learning. In: Proc. of the 32nd Algorithmic Learning Theory. ALT, 2021. 722–754.
    [67] Kothawade S, Savarkar A, Iyer V, Ramakrishnan G, Iyer R. CLINICAL: Targeted active learning for imbalanced medical image classification. In: Proc. of the 1st Workshop on Medical Image Learning with Limited and Noisy Data. Singapore: Springer, 2022. 119–129. [doi: 10.1007/978-3-031-16760-7_12]
    [68] Chen H, Tao R, Fan Y, Wang YD, Wang JD, Schiele B, Xie X, Raj B, Savvides M. SoftMatch: Addressing the quantity-quality trade-off in semi-supervised learning. arXiv:2301.10921, 2023.
    [69] Sener O, Savarese S. Active learning for convolutional neural networks: A core-set approach. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
    [70] Belouadah E, Popescu A, Aggarwal U, Saci L. Active class incremental learning for imbalanced datasets. In: Proc. of the 2020 European Conf. on Computer Vision. Glasgow: Springer, 2020. 146–162. [doi: 10.1007/978-3-030-65414-6_12]
    [71] Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv:2001.08361, 2020.
    [72] Cao KD, Wei C, Gaidon A, Arechiga N, Ma TY. Learning imbalanced datasets with label-distribution-aware margin loss. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 140.
    [73] Goh HW, Mueller J. ActiveLab: Active learning with re-labeling by multiple annotators. arXiv:2301.11856, 2023.
    [74] Goh HW, Tkachenko U, Mueller J. CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. arXiv:2210.06812, 2022.
    [75] Northcutt C, Jiang L, Chuang I. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 2021, 70: 1373–1411.
    [76] Chong D, Hong J, Manning C. Detecting label errors by using pre-trained language models. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu Dhabi: ACL, 2022. 9074–9091. [doi: 10.18653/v1/2022.emnlp-main.618]
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

董黎明,张贺,孟庆龙,匡宏宇.结合主动学习和半监督学习的软件可追踪性恢复框架.软件学报,2025,36(5):1924-1948

复制
分享
文章指标
  • 点击次数:303
  • 下载次数: 1582
  • HTML阅读次数: 98
  • 引用次数: 0
历史
  • 收稿日期:2023-06-01
  • 最后修改日期:2023-08-13
  • 在线发布日期: 2024-09-04
文章二维码
您是第20043545位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号