TP311
国家自然科学基金 (62072227, 62202219); 国家重点研发计划 (2019YFE0105500); 江苏省重点研发计划(BE2021002-2); 南京大学计算机软件新技术国家重点实验室创新项目(ZZKT2022A25); 海外开放课题(KFKT2022A09)
软件可追踪性被认为是软件开发过程可信的一个重要因素, 确保对软件开发过程的可见性并进行全面追踪, 从而提高软件的可信度和可靠性. 近年来, 自动化的软件可追踪性恢复方法取得了显著进展, 但在企业项目中的应用仍面临挑战. 通过调研研究和实验案例分析, 发现工业界场景中可追踪性模型表现不佳的3个关键挑战: 原始数据低质量、样本稀疏性和不平衡性, 并提出一种结合主动学习和半监督学习的软件可追踪性恢复框架STRACE(AL+SSL). 该框架通过选择有价值的标注样本和生成高质量的伪标签样本, 有效利用未标注的样本, 克服数据低质量和稀疏性挑战. 实验基于10个样本规模在几万至近百万个issue-commit跟踪对实例的企业项目, 进行多组对比实验, 结果表明该框架在当前真实企业项目软件可追踪性恢复任务上具有有效性. 其中消融实验结果表明STRACE(AL+SSL)中主动学习模块所选择的无标签样本在可追踪性恢复任务中发挥了更为重要的作用. 此外, 还验证各个模块最佳的样本选择策略组合, 包括调整后的半监督类平衡自训练样本选择策略CBST-Adjust和低成本高效率的主动学习子模块互信息SMI_Flqmi样本选择策略.
Software traceability is considered critical to trustworthy software engineering, ensuring software reliability through the tracking of the software development process. Despite significant progress in automatic software traceability recovery techniques in recent years, their application in real-world commercial software projects does not meet expectations. An investigation into the application of learning-based software traceability recovery classifier models in commercial software projects is conducted. It uncovers three critical challenges faced in industrial settings. These challenges contribute to underperforming traceability models: low-quality raw data, data sparsity, and class imbalance. In response to these challenges, STRACE(AL+SSL) is proposed. It is a software traceability recovery framework that integrates active learning and semi-supervised learning. By strategically selecting valuable annotated samples and generating high-quality pseudo-labeled samples, STRACE(AL+SSL) effectively harnesses unlabeled data to address data-related challenges. Multiple comparative experiments are conducted with nearly one million issue-commit trace pair samples from 10 different enterprise projects. The results of these experiments validate the effectiveness of the proposed framework for real-world software traceability recovery tasks. The ablation results show that the unlabeled samples selected by the active learning in STRACE(AL+SSL) play a crucial role in the traceability recovery task. Additionally, the optimal combination of sample selection strategies in STRACE(AL+SSL) is confirmed. This includes CBST-Adjust for the semi-supervised sample rebalancing strategy and SMI_Flqmi, which is recognized for its cost-effectiveness and efficiency in active learning.
董黎明,张贺,孟庆龙,匡宏宇.结合主动学习和半监督学习的软件可追踪性恢复框架.软件学报,2025,36(5):1924-1948
复制