Abstract:Software traceability is considered critical to trustworthy software engineering, ensuring software reliability through the tracking of the software development process. Despite significant progress in automatic software traceability recovery techniques in recent years, their application in real-world commercial software projects does not meet expectations. An investigation into the application of learning-based software traceability recovery classifier models in commercial software projects is conducted. It uncovers three critical challenges faced in industrial settings. These challenges contribute to underperforming traceability models: low-quality raw data, data sparsity, and class imbalance. In response to these challenges, STRACE(AL+SSL) is proposed. It is a software traceability recovery framework that integrates active learning and semi-supervised learning. By strategically selecting valuable annotated samples and generating high-quality pseudo-labeled samples, STRACE(AL+SSL) effectively harnesses unlabeled data to address data-related challenges. Multiple comparative experiments are conducted with nearly one million issue-commit trace pair samples from 10 different enterprise projects. The results of these experiments validate the effectiveness of the proposed framework for real-world software traceability recovery tasks. The ablation results show that the unlabeled samples selected by the active learning in STRACE(AL+SSL) play a crucial role in the traceability recovery task. Additionally, the optimal combination of sample selection strategies in STRACE(AL+SSL) is confirmed. This includes CBST-Adjust for the semi-supervised sample rebalancing strategy and SMI_Flqmi, which is recognized for its cost-effectiveness and efficiency in active learning.