基于主动学习的代码异味检测实证研究
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

胡文华,E-mail:whu10@whut.edu.cn

中图分类号:

基金项目:

国家自然科学基金(61977021)基于知识图谱的按专业招生高考志愿智能推荐方法研究;国家自然科学基金青年项目(62202350)基于认知心理学的软件需求错误自动化检测模型及工具研发


Empirical Study of Code Smell Detection on Active Learning
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    基于机器学习和深度学习的代码异味检测方法需要依赖大量的标注数据集,而在代码异味领域标注数据集数量稀缺,并且与此同时存在着大量的未标注数据。因此,可以将主动学习的方法应用于代码异味检测。以往的研究表明,在软件工程领域,主动学习可以在花费更少的标注和训练成本条件下得到性能更高的模型。然而,主动学习对代码异味检测模型性能的具体影响尚未明确。盲目在代码异味检测任务中应用在其他领域中表现良好的主动学习策略可能适得其反。本文旨在评估主动学习对代码异味检测模型性能的影响,为此,本文在代码异味数据集MLCQ上进行了广泛分析,包括5种查询策略的11种实现方式、8种分类器及10种不同的查询比率,以探究它们对代码异味检测模型性能的具体影响。结果表明:(1)在本研究涉及的11种查询策略中,基于不确定性的查询策略与基于委员会的查询策略表现均优于其他策略。特别是,边缘查询(基于不确定性)和投票熵查询(基于委员会)表现尤为突出。(2)在本研究涉及的8种分类器中,随机森林分类器综合表现最好(3)在主动学习查询比率方面,查询比率从0%增加至25%过程中,模型性能随查询比率增加提升明显,查询比率从25%增加至50%过程中,模型性能随查询比率增加提升放缓,且可能出现性能下降。

    Abstract:

    The detection of code smells using machine learning and deep learning approaches relies heavily on extensive annotated datasets. However, such annotated datasets are scarce in the field of code smells, and there is a prevalence of unannotated data. Consequently, active learning methods can be applied to the detection of code smells. Previous research has demonstrated that in the field of software engineering, active learning can yield models with superior performance while requiring less annotation and training costs. Nonetheless, the specific impact of active learning on the performance of code smell detection models remains unclear. Applying active learning strategies that are effective in other domains to code smell detection tasks without adaptation may lead to adverse effects. This paper aims to evaluate the impact of active learning on the performance of code smell detection models. To this end, an extensive analysis was conducted on the code smell dataset MLCQ, involving 11 implementations of 5 query strategies, 8 classifiers, and 10 different query ratios to explore their specific impacts on model performance. The results indicate: (1) Among the 11 query strategies involved in this study, those based on uncertainty and committee-based strategies performed better than others, with margin querying (based on uncertainty) and vote entropy querying (based on committee) being particularly notable. (2) Among the 8 classifiers explored, the Random Forest classifier exhibited the best overall performance. (3) Regarding the active learning query ratios, model performance improved significantly as the query ratio increased from 0% to 25%. However, as the query ratio continued to increase from 25% to 50%, the enhancement in model performance slowed and could potentially decline.

    参考文献
    相似文献
    引证文献
引用本文

陈浩轩,刘磊,黄若煊,张一卓,胡文华,马传香.基于主动学习的代码异味检测实证研究.软件学报,2025,36(7):0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-24
  • 最后修改日期:2024-10-15
  • 录用日期:
  • 在线发布日期: 2024-12-10
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号