开源软件缺陷的跨项目相关问题推荐方法
作者:
基金项目:

科技创新2030—“新一代人工智能”重大项目(2021ZD0112901);国家自然科学基金(62177003)


Cross-project Issue Recommendation Method for Open-source Software Defects
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [27]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    GitHub是著名的开源软件开发社区, 支持开发人员在开源项目中使用问题追踪系统来处理问题. 在软件缺陷问题的讨论过程中, 开发人员可能指出与该缺陷问题相关的其他项目问题(称为跨项目相关问题), 为缺陷问题的修复提供参考信息. 然而, GitHub平台中托管了超过2亿的开源项目和12亿个问题, 导致人工识别和获取跨项目相关问题的工作极其耗时. 提出为缺陷问题自动化推荐跨项目相关问题的方法CPIRecom. 为了构建预选集, 采用项目之间历史相关问题对的数量和问题发布时间间隔筛选问题. 其次, 为了精准推荐, 采用BERT预训练模型提取文本特征, 分析项目特征. 然后使用随机森林算法计算预选问题与缺陷问题的相关概率, 最终根据相关概率排名得到推荐列表. 模拟CPIRecom方法在GitHub平台的使用情况. CPIRecom方法的平均倒数排名达到0.603, 前5项查全率达到0.715.

    Abstract:

    GitHub is a well-known open-source software development community that supports developers using the issue tracking system in each open-source project on GitHub to address issues. During the discussion of an issue about a defect, the developer may point out issues from other projects correlated to the defect, which are called cross-project issues, so as to provide reference information for fixing the defect. However, there are more than 200 million open-source projects and 1.2 billion issues on the GitHub platform, making it time-consuming to identify and acquire cross-project issues manually. This study presents a cross-project issue recommendation method CPIRecom for open-source software defects. This study builds a pre-selection set by filtering issues based on the number of historical issue pairs and the time interval for reporting issues. Then, the study also proposes an accurate recommendation model, which extracts textual features based on the pre-trained model of BERT, analyzes features of projects, calculates the relevant probability between defects and issues from the pre-selection set based on a random forest classifier, and obtains the recommendation list according to the ranking. This study simulates the application of CPIRecom method on GitHub platform. The mean reciprocal rank of CPIRecom method reaches 0.603, and the Recall@5 reaches 0.715 on the simulative test set.

    参考文献
    [1] 董瑞志, 李必信, 王璐璐, 李宏伟, 陈海雷, Tan J. 软件生态系统研究综述. 计算机学报, 2020, 43(2):250-271.[doi:10.11897/SP.J.1016.2020.00250]
    Dong RZ, Li BX, Wang LL, Li HW, Chen HL, Tan J. Review of research on software ecosystems. Chinese Journal of Computers, 2020, 43(2):250-271 (in Chinese with English abstract).[doi:10.11897/SP.J.1016.2020.00250]
    [2] 何熙巽, 张玉清, 刘奇旭. 软件供应链安全综述. 信息安全学报, 2020, 5(1):57-73.[doi:10.19363/J.cnki.cn10-1380/tn.2020.01.06]
    He XX, Zhang YQ, Liu QX. Survey of software supply chain security. Journal of Cyber Security, 2020, 5(1):57-73 (in Chinese with English abstract).[doi:10.19363/J.cnki.cn10-1380/tn.2020.01.06]
    [3] Dabbish L, Stuart C, Tsay J, Herbsleb J. Social coding in GitHub:Transparency and collaboration in an open software repository. In:Proc. of the 2012 ACM Conf. on Computer Supported Cooperative Work. Seattle:ACM, 2012. 1277-1286.
    [4] Lima A, Rossi L, Musolesi M. Coding together at scale:GitHub as a collaborative social network. Proceedings of the International AAAI Conference on Web and Social Media, 2014, 8(1):295-304.[doi:10.1609/icwsm.v8i1.14552]
    [5] 杨波, 于茜, 张伟, 吴际, 刘超. GitHub开源软件开发过程中影响因素的相关性分析. 软件学报, 2017, 28(6):1330-1342. http://www.jos.org.cn/1000-9825/5222.htm
    Yang B, Yu Q, Zhang W, Wu J, Liu C. Influence factors correlation analysis in GitHub open source software development process. Ruan Jian Xue Bao/Journal of Software, 2017, 28(6):1330-1342 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5222.htm
    [6] Bissyandé TF, Lo D, Jiang LX, Reveillere L, Klein J, Le Traon Y. Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. In:Proc. of the 24th IEEE Int'l Symp. on Software Reliability Engineering (ISSRE). Pasadena:IEEE, 2013. 188-197.
    [7] Bertram D, Voida A, Greenberg S, Walker R. Communication, collaboration, and bugs:The social nature of issue tracking in small, collocated teams. In:Proc. of the 2010 ACM Conf. on Computer Supported Cooperative Work. Savannah:ACM, 2010. 291-300.
    [8] Zhang Y, Yu Y, Wang HM, Vasilescu B, Filkov V. Within-ecosystem issue linking:A large-scale study of rails. In:Proc. of the 7th Int'l Workshop on Software Mining. Montpellier:ACM, 2018. 12-19.
    [9] Li LS, Ren ZL, Li XC, Zou WQ, Jiang H. How are issue units linked? Empirical study on the linking behavior in GitHub. In:Proc. of the 25th Asia-Pacific Software Engineering Conf. Nara:IEEE, 2018. 386-395.
    [10] Zhang Y, Wu YW, Wang T, Wang HM. iLinker:A novel approach for issue knowledge acquisition in GitHub projects. World Wide Web, 2020, 23(3):1589-1619.[doi:10.1007/s11280-019-00770-1]
    [11] Ma WWY, Chen L, Zhang XY, Zhou YM, Xu BW. How do developers fix cross-project correlated bugs? A case study on the GitHub scientific Python ecosystem. In:Proc. of the 39th IEEE/ACM Int'l Conf. on Software Engineering. Buenos Aires:ACM, 2017. 381-392.
    [12] Alipour A, Hindle A, Stroulia E. A contextual approach towards more accurate duplicate bug report detection. In:Proc. of the 10th Working Conf. on Mining Software Repositories (MSR). San Francisco:IEEE, 2013. 183-192.
    [13] Thung F, Kochhar PS, Lo D. DupFinder:Integrated tool support for duplicate bug report detection. In:Proc. of the 29th ACM/IEEE Int'l Conf. on Automated Software Engineering. Vasteras:ACM, 2014. 871-874.
    [14] Wang QY, Xu BW, Xia X, Wang T, Li SP. Duplicate pull request detection:When time matters. In:Proc. of the 11th Asia-Pacific Symp. on Internetware. Fukuoka:ACM, 2019. 8.
    [15] Yu Y, Li ZX, Yin G, Wang T, Wang HM. A dataset of duplicate pull-requests in GitHub. In:Proc. of the 15th Int'l Conf. on Mining Software Repositories. Gothenburg:ACM, 2018. 22-25.
    [16] Kallis R, Di Sorbo A, Canfora G, Panichella S. Predicting issue types on GitHub. Science of Computer Programming, 2021, 205:102598.[doi:10.1016/j.scico.2020.102598]
    [17] 边根庆, 张文敬, 邵必林, 龚培娇. 基于帕累托最优的云资源调度研究. 计算机工程与应用, 2014, 50(19):70-73.[doi:10.3778/j.issn.1002-8331.1211-0078]
    Bian GQ, Zhang WJ, Shao BL, Gong PJ. Research of cloud resource allocation based on Pareto optimality. Computer Engineering and Applications, 2014, 50(19):70-73 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.1211-0078]
    [18] Devlin J, Chang MW, Lee K, Toutanova K. BERT:Pre-training of deep bidirectional Transformers for language understanding. In:Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis:Association for Computational Linguistics, 2019. 4171-4186.
    [19] Zhang JS, Chen SJ, Wang XK. Sustainable treatment of antibiotic wastewater using combined process of microelectrolysis and struvite crystallization. Water, Air, & Soil Pollution, 2015, 226(9):315.[doi:10.1007/s11270-015-2581-5]
    [20] Liu BC, Zhang L, Jiang J, Wang L. A method for identifying references between projects in GitHub. Science of Computer Programming, 2022, 222:102858.[doi:10.1016/j.scico.2022.102858]
    [21] Noble WS. What is a support vector machine?. Nature Biotechnology, 2006, 24(12):1565-1567.[doi:10.1038/nbt1206-1565]
    [22] Pal M. Random forest classifier for remote sensing classification. International Journal of Remote Sensing, 2005, 26(1):217-222.[doi:10.1080/01431160412331269698]
    [23] Li ZX, Zhong H. An empirical study on obsolete issue reports. In:Proc. of the 36th IEEE/ACM Int'l Conf. on Automated Software Engineering (ASE). Melbourne:IEEE, 2021. 1317-1321.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

刘宝川,张莉,刘桢炜,蒋竞.开源软件缺陷的跨项目相关问题推荐方法.软件学报,2024,35(5):2340-2358

复制
分享
文章指标
  • 点击次数:1177
  • 下载次数: 1782
  • HTML阅读次数: 951
  • 引用次数: 0
历史
  • 收稿日期:2022-11-03
  • 最后修改日期:2023-01-07
  • 在线发布日期: 2023-10-25
  • 出版日期: 2024-05-06
文章二维码
您是第20049904位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号