结合主动学习和半监督学习的软件可追踪性恢复框架
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金 (62072227, 62202219); 国家重点研发计划 (2019YFE0105500); 江苏省重点研发计划(BE2021002-2); 南京大学计算机软件新技术国家重点实验室创新项目(ZZKT2022A25); 海外开放课题(KFKT2022A09)


Software Traceability Recovery Framework Based on Active Learning and Semi-supervised Learning
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    软件可追踪性被认为是软件开发过程可信的一个重要因素, 确保对软件开发过程的可见性并进行全面追踪, 从而提高软件的可信度和可靠性. 近年来, 自动化的软件可追踪性恢复方法取得了显著进展, 但在企业项目中的应用仍面临挑战. 通过调研研究和实验案例分析, 发现工业界场景中可追踪性模型表现不佳的3个关键挑战: 原始数据低质量、样本稀疏性和不平衡性, 并提出一种结合主动学习和半监督学习的软件可追踪性恢复框架STRACE(AL+SSL). 该框架通过选择有价值的标注样本和生成高质量的伪标签样本, 有效利用未标注的样本, 克服数据低质量和稀疏性挑战. 实验基于10个样本规模在几万至近百万个issue-commit跟踪对实例的企业项目, 进行多组对比实验, 结果表明该框架在当前真实企业项目软件可追踪性恢复任务上具有有效性. 其中消融实验结果表明STRACE(AL+SSL)中主动学习模块所选择的无标签样本在可追踪性恢复任务中发挥了更为重要的作用. 此外, 还验证各个模块最佳的样本选择策略组合, 包括调整后的半监督类平衡自训练样本选择策略CBST-Adjust和低成本高效率的主动学习子模块互信息SMI_Flqmi样本选择策略.

    Abstract:

    Software traceability is considered critical to trustworthy software engineering, ensuring software reliability through the tracking of the software development process. Despite significant progress in automatic software traceability recovery techniques in recent years, their application in real-world commercial software projects does not meet expectations. An investigation into the application of learning-based software traceability recovery classifier models in commercial software projects is conducted. It uncovers three critical challenges faced in industrial settings. These challenges contribute to underperforming traceability models: low-quality raw data, data sparsity, and class imbalance. In response to these challenges, STRACE(AL+SSL) is proposed. It is a software traceability recovery framework that integrates active learning and semi-supervised learning. By strategically selecting valuable annotated samples and generating high-quality pseudo-labeled samples, STRACE(AL+SSL) effectively harnesses unlabeled data to address data-related challenges. Multiple comparative experiments are conducted with nearly one million issue-commit trace pair samples from 10 different enterprise projects. The results of these experiments validate the effectiveness of the proposed framework for real-world software traceability recovery tasks. The ablation results show that the unlabeled samples selected by the active learning in STRACE(AL+SSL) play a crucial role in the traceability recovery task. Additionally, the optimal combination of sample selection strategies in STRACE(AL+SSL) is confirmed. This includes CBST-Adjust for the semi-supervised sample rebalancing strategy and SMI_Flqmi, which is recognized for its cost-effectiveness and efficiency in active learning.

    参考文献
    相似文献
    引证文献
引用本文

董黎明,张贺,孟庆龙,匡宏宇.结合主动学习和半监督学习的软件可追踪性恢复框架.软件学报,,():1-25

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-06-01
  • 最后修改日期:2023-08-13
  • 录用日期:
  • 在线发布日期: 2024-09-04
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号