基于代码克隆差异分析的函数模板挖掘和检索方法

doi:10.13328/j.cnki.jos.007228

微信服务号

微信订阅号

2025年7月16日 3:28 星期三

首页 > 过刊浏览>2025年第36卷第6期 >2774-2793. DOI:10.13328/j.cnki.jos.007228

PDF HTML阅读 XML下载导出引用引用提醒

基于代码克隆差异分析的函数模板挖掘和检索方法
DOI:
                        10.13328/j.cnki.jos.007228
                    
CSTR:
                        32375.14.jos.007228
                    
作者:
                        肖泉彬肖泉彬
复旦大学 计算机科学技术学院, 上海 200438;上海市数据科学重点实验室(复旦大学), 上海 201203
在期刊界中查找
在百度中查找
在本站中查找
陈源陈源
复旦大学 计算机科学技术学院, 上海 200438;上海市数据科学重点实验室(复旦大学), 上海 201203
在期刊界中查找
在百度中查找
在本站中查找
吴毅坚吴毅坚
复旦大学 计算机科学技术学院, 上海 200438;上海市数据科学重点实验室(复旦大学), 上海 201203
在期刊界中查找
在百度中查找
在本站中查找
彭鑫彭鑫
复旦大学 计算机科学技术学院, 上海 200438;上海市数据科学重点实验室(复旦大学), 上海 201203
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP311
基金项目:国家自然科学基金(62172099)

Function Template Mining and Retrieval Based on Code Clone Difference Analysis

Author:

XIAO Quan-Bin
XIAO Quan-Bin
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Data Science (Fudan University), Shanghai 201203, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Yuan
CHEN Yuan
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Data Science (Fudan University), Shanghai 201203, China
在期刊界中查找
在百度中查找
在本站中查找
WU Yi-Jian
WU Yi-Jian
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Data Science (Fudan University), Shanghai 201203, China
在期刊界中查找
在百度中查找
在本站中查找
PENG Xin
PENG Xin
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Data Science (Fudan University), Shanghai 201203, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [31]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

在软件工程领域, 代码库承载着丰富的知识资源, 可以为开发者提供编程实践的案例参考. 源代码中频繁出现的模式化重复片段, 若能以代码模板的形式有效提炼, 就能显著提升编程效率. 当前实践中, 开发者常常通过源代码搜索复用现有解决方案, 然而此方法往往产生大量相似且冗余的结果, 增加了后续筛选工作的负担. 与此同时, 以克隆代码为基础的模板挖掘技术往往未能涵盖由分散小克隆片段构成的广泛模式, 进而限制了模板的实用性. 提出了一种基于代码克隆检测的代码模板提取和检索方法, 通过拼接多个片段级克隆以及提取和聚合方法级克隆的共享部分, 实现了更高效的函数级代码模板提取, 并解决了模板质量问题. 基于所挖掘的代码模板, 提出了一种代码结构特征的三元组表示法, 有效地对纯文本特征进行补充, 并实现了高效而简洁的结构表示. 此外, 提出了一种结构和文本检索相结合的模板特征检索方法, 以便通过匹配编程上下文的特征来检索这些模板. 基于该方法实现的工具CodeSculptor, 在包含45个高质量Java开源项目的代码库测试中展现了其提取高质量代码模板的显著能力. 结果表明, 该工具挖掘的模板平均可实现减少60.87%的代码量, 且有92.09%是通过拼接片段级克隆产生的, 这一比例的模板在传统方法中是无法识别出来的, 这印证了该方法在识别和构建代码模板方面的卓越性能. 在代码模板检索和推荐的实验中, Top-5检索结果精确度达到了96.87%. 通过对随机选择的9600个模板进行的初步案例研究, 讨论了模板的实用性, 并发现大多数抽样代码模板在语义上是完整的, 少数无意义的模板表明该模板提取工作未来的潜力. 用户研究进一步表明, 使用CodeSculptor能够更有效率地完成代码开发任务.

关键词:克隆检测;代码检索;特征表示;软件开发;代码复用

Abstract:

In the field of software engineering, code repositories contain a wealth of knowledge resources, which can provide developers with examples of programming practices. If repetitive patterns, frequently occurring in source code, can be effectively extracted in the form of code templates, programming efficiency could be significantly improved. In current practice, developers often reuse existing solutions by searching through source code. However, this method typically generates a large number of similar and redundant results, increasing the burden of subsequent filtering. Moreover, template mining techniques based on cloned code often fail to cover extensive patterns constructed from dispersed small clones, thereby limiting the practicality of the templates. A method is proposed for extracting and retrieving code templates based on code clone detection. This method achieves more efficient function-level code template extraction by stitching together multiple fragment-level clones and extracting and aggregating the shared parts of method-level clones and addresses the issue of template quality. Based on the mined code templates, this study comes up with a triplet representation method for code structural features that effectively supplements plain text features, and implements an efficient and concise structural representation. In addition, this study presents a template feature retrieval method that combines structural and textual search to retrieve these templates by matching features of the programming context. The tool implemented based on this method, CodeSculptor, demonstrates its significant capability to extract high-quality code templates in a test against a codebase containing 45 high-quality Java open-source projects. The results show that the templates mined by the tool achieve an average code reduction of 60.87%, with 92.09% produced by stitching fragment-level clones, a proportion of templates that is not identifiable by traditional method. It proves the superior performance of the method in recognizing and constructing code templates. Furthermore, the accuracy of the top-5 search results in the code template search and recommendation is 96.87%. A preliminary case study on 9600 randomly selected templates reveals that most of the sampled code templates are complete and coherent in semantics, thus affirming their practicality. Nonetheless, there are a few meaningless templates, highlighting the future potential to refine the proposed template extraction strategy. The user research further shows that code development tasks can be done more efficiently with CodeSculptor.

Key words:clone detection;code search;feature representation;software development;code reuse

参考文献

[1] Luan SF, Yang D, Barnaby C, Sen K, Chandra S. Aroma: Code recommendation via structural code search. Proc. of the ACM on Programming Languages, 2019, 3(OOPSLA): 152.

[2] Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek J. DéjàVu: A map of code duplicates on GitHub. Proc. of the ACM on Programming Languages, 2017, 1(OOPSLA): 84.

[3] Syriani E, Luhunu L, Sahraoui H. Systematic mapping study of template-based code generation. Computer Languages, Systems & Structures, 2018, 52: 43–62. [doi: 10.1016/j.cl.2017.11.003]

[4] 魏敏, 张丽萍. 代码搜索方法研究进展. 计算机应用研究, 2021, 38(11): 3215–3221, 3230.

Wei M, Zhang LQ. Research progress of code search methods. Application Research of Computers, 2021, 38(11): 3215–3221, 3230 (in Chinese with English abstract).

[5] Kim K, Kim D, Bissyandé TF, Choi E, Li L, Klein J, Le Traon Y. FaCoY—A code-to-code search engine. In: Proc. of the 40th IEEE/ACM Int’l Conf. on Software Engineering. Gothenburg: IEEE, 2018. 946–957. [doi: 10.1145/3180155.3180187]

[6] Silavong F, Moran S, Georgiadis A, Saphal R, Otter R. Senatus—A fast and accurate code-to-code recommendation engine. In: Proc. of the 19th IEEE/ACM Int’l Conf. on Mining Software Repositories. Pittsburgh: IEEE, 2022. 511–523.

[7] Nguyen AT, Nguyen TT, Nguyen HA, Tamrawi A, Nguyen HV, Al-Kofahi J, Nguyen TN. Graph-based pattern-oriented, context-sensitive source code completion. In: Proc. of the 34th Int’l Conf. on Software Engineering (ICSE). Zurich: IEEE, 2012. 69–79. [doi: 10.1109/ICSE.2012.6227205]

[8] Nguyen TT, Nguyen HA, Pham NH, Al-Kofahi JM, Nguyen TN. Graph-based mining of multiple object usage patterns. In: Proc. of the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Engineering. Amsterdam: Association for Computing Machinery, 2009. 383–392. [doi: 10.1145/1595696.1595767]

[9] Wang PC, Svajlenko J, Wu YZ, Xu Y, Roy CK. CCAligner: A token based large-gap clone detector. In: Proc. of the 40th Int’l Conf. on Software Engineering. Gothenburg: IEEE, 2018. 1066–1077. [doi: 10.1145/3180155.3180179]

[10] Robertson S, Zaragoza H. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends^® in Information Retrieval, 2009, 3(4): 333–389.

[11] Li GH, Wu YJ, Roy CK, Sun J, Peng X, Zhan NJ, Hu B, Ma JY. SAGA: Efficient and large-scale detection of near-miss clones with GPU acceleration. In: Proc. of the 27th IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). London: IEEE, 2020. 272–283. [doi: 10.1109/SANER48275.2020.9054832]

[12] Grefenstette G. Tokenization. In: Halteren H, ed. Syntactic Wordclass Tagging. Dordrecht: Springer, 1999. 117–133. [doi: 10.1007/978-94-015-9273-4]

[13] Hirschberg DS. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 1975, 18(6): 341–343.

[14] Lin Y, Meng GZ, Xue YX, Xing ZC, Sun J, Peng X, Liu Y, Zhao WY, Dong JS. Mining implicit design templates for actionable code reuse. In: Proc. of the 32nd IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Urbana: IEEE, 2017. 394–404.

[15] Lu ML, Sun XB, Wang SW, Lo D, Duan YC. Query expansion via WordNet for effective code search. In: Proc. of the 22nd IEEE Int’l Conf. on Software Analysis, Evolution, and Reengineering (SANER). Montreal: IEEE, 2015. 545–549.

[16] Church KW. Word2Vec. Natural Language Engineering, 2017, 23(1): 155–162.

[17] Buckland M, Gey F. The relationship between recall and precision. Journal of the American Society for Information Science, 1994, 45(1): 12–19.

[18] 徐杨, 陈晓杰, 汤德佑, 黄翰. 面向代码搜索的函数功能多重图嵌入. 软件学报, 2024, 35(8): 3809–3823. http://www.jos.org.cn/1000-9825/6940.htm

Xu Y, Chen XJ, Tang DY, Huang H. Code-search-oriented function multigraph embedding. Ruan Jian Xue Bao/Journal of Software, 2024, 35(8): 3809–3823 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6940.htm

[19] 刘志伟, 邢永旭, 于澔, 李涛, 张晓东. 企业级海量代码的检索与管理技术. 软件学报, 2019, 30(5): 1498–1509. http://www.jos.org.cn/1000-9825/5718.htm

Liu ZY, Xing YX, Yu H, Li T, Zhang XD. Retrieval and management technology for industrial-scale massive code. Ruan Jian Xue Bao/Journal of Software, 2019, 30(5): 1498–1509 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5718.htm

[20] Hill R, Rideout J. Automatic method completion. In: Proc. of the 19th Int’l Conf. on Automated Software Engineering. Linz: IEEE, 2004. 228–235. [doi: 10.1109/ASE.2004.1342740]

[21] Bruch M, Monperrus M, Mezini M. Learning from examples to improve code completion systems. In: Proc. of the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Engineering. Amsterdam: ACM, 2009. 213–222. [doi: 10.1145/1595696.1595728]

[22] Robbes R, Lanza M. How program history can improve code completion. In: Proc. of the 23rd IEEE/ACM Int’l Conf. on Automated Software Engineering. L’Aquila: IEEE, 2008. 317–326. [doi: 10.1109/ASE.2008.42]

[23] Holmes R, Murphy GC. Using structural context to recommend source code examples. In: Proc. of the 27th Int’l Conf. on Software Engineering. St. Louis: IEEE, 2005. 117–125. [doi: 10.1109/ICSE.2005.1553554]

[24] 陈秋远, 李善平, 鄢萌, 夏鑫. 代码克隆检测研究进展. 软件学报, 2019, 30(4): 962–980. http://www.jos.org.cn/1000-9825/5711.htm

Chen QY, Li SP, Yan M, Xia X. Code clone detection: A literature review. Ruan Jian Xue Bao/Journal of Software, 2019, 30(4): 962–980 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5711.htm

[25] Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV. SourcererCC: Scaling code clone detection to big-code. In: Proc. of the 38th Int’l Conf. on Software Engineering. Austin: IEEE, 2016. 1157–1168. [doi: 10.1145/2884781.2884877]

[26] Cordy JR, Roy CK. The NiCad clone detector. In: Proc. of the 19th Int’l Conf. on Program Comprehension. Kingston: IEEE, 2011. 219–220. [doi: 10.1109/ICPC.2011.26]

[27] Xing ZC, Xue YX, Jarzabek S. Distilling useful clones by contextual differencing. In: Proc. of the 20th Working Conf. on Reverse Engineering (WCRE). Koblenz: IEEE, 2013. 102–111. [doi: 10.1109/WCRE.2013.6671285]

引用本文

肖泉彬,陈源,吴毅坚,彭鑫.基于代码克隆差异分析的函数模板挖掘和检索方法.软件学报,2025,36(6):2774-2793

复制

文章指标

点击次数:219
下载次数: 1745
HTML阅读次数: 222
引用次数: 0

历史

收稿日期:2024-01-26
最后修改日期:2024-04-07
录用日期:
在线发布日期: 2024-07-03
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码