面向代码搜索的函数功能多重图嵌入
作者:
作者简介:

徐杨(1970-), 男, 博士, 讲师, CCF专业会员, 主要研究领域为智能化软件工程, 机器学习, 分布式计算;陈晓杰(1995-), 男, 硕士, 主要研究领域为深度学习, 智能化软件工程;汤德佑(1976-), 男, 博士, 副教授, CCF专业会员, 主要研究领域为数据库, 高性能计算, 软件优化;黄翰(1980-), 男, 博士, 教授, 博士生导师, CCF杰出会员, 主要研究领域为微计算方法的理论基础及应用, 智能化软件工程.

通讯作者:

黄翰, E-mail: hhan@scut.edu.cn

中图分类号:

TP311

基金项目:

广东省自然科学基金面上项目(2020A1515010696, 2022A1515011491); 国家自然科学基金面上项目(61876207, 62276103); 中央高校面上项目(2020ZYGXZR014); 广东省财税大数据重点实验室开放基金(2019B121203012)


Code-search-oriented Function Multigraph Embedding
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [44]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    如何提高异构的自然语言查询输入和高度结构化程序语言源代码的匹配准确度, 是代码搜索的一个基本问题. 代码特征的准确提取是提高匹配准确度的关键之一. 代码语句表达的语义不仅与其本身有关, 还与其所处的上下文相关. 代码的结构模型为理解代码功能提供了丰富的上下文信息. 提出一个基于函数功能多重图嵌入的代码搜索方法. 在所提方法中, 使用早期融合的策略, 将代码语句的数据依赖关系融合到控制流图中, 构建函数功能多重图来表示代码. 该多重图通过数据依赖关系显式表达控制流图中缺乏的非直接前驱后继节点的依赖关系, 增强语句节点的上下文信息. 同时, 针对多重图的边的异质性, 采用关系图卷积网络方法从函数多重图中提取代码的特征. 在公开数据集的实验表明, 相比现有基于代码文本和结构模型的方法, 所提方法的MRR提高5%以上. 通过消融实验也表明控制流图较数据依赖图在搜索准确度上贡献较大.

    Abstract:

    How to improve the accuracy of matching between natural language query input and highly structured programming language source code is a fundamental concern in code search. Accurate extraction of code features is one of the key challenges to improving matching accuracy. The semantics expressed by statements in codes is not only relevant to themselves but also to their contexts. The structural model of the code provides rich contextual information for understanding code functions. This study proposes a code search method based on function multigraph embedding. By using an early fusion strategy, the study fuses the data dependencies of code statements into a control flow graph and constructs a function multigraph to represent the code. The multigraph explicitly expresses the dependency relationships of indirect predecessor and successor nodes that are lacking in the control flow graph through data dependencies and enhances the contextual information of statement nodes. At the same time, in view of the edge heterogeneity of the multigraph, this study uses the relational graph convolutional network to extract the features of the code from the function multigraph. Experiments on a public dataset show that the proposed method can improve the MRR by more than 5% compared with the existing methods based on code text and structural models. The ablation experiments also show that the control flow graph contributes more to the search accuracy than the data dependence graph.

    参考文献
    [1] Liu C, Xia X, Lo D, Gao CY, Yang XH, Grundy J. Opportunities and challenges in code search tools. ACM Computing Surveys, 2021, 54(9): 196. [doi: 10.1145/3480027]
    [2] 魏敏, 张丽萍. 代码搜索方法研究进展. 计算机应用研究, 2021, 38(11): 3215–3221, 3230. [doi: 10.19734/j.issn.1001-3695.2021.04.0096]
    Wei M, Zhang LP. Research progress of code search methods. Application Research of Computers, 2021, 38(11): 3215–3221, 3230 (in Chinese with English abstract). [doi: 10.19734/j.issn.1001-3695.2021.04.0096]
    [3] 李佳静, 梁知音, 韦韬, 毛剑. 一种基于语义的恶意行为分析方法. 北京大学学报(自然科学版), 2008, 44(4): 537–542. [doi: 10.13209/j.0479-8023.2008.084]
    Li JJ, Liang ZY, Wei T, Mao J. A malicious behavior analysis method based on program semantic. Acta Scientiarum Naturalium Universitatis Pekinensis, 2008, 44(4): 537–542 (in Chinese with English abstract). [doi: 10.13209/j.0479-8023.2008.084]
    [4] Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P. Sourcerer: Mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 2009, 18(2): 300–336. [doi: 10.1007/s10618-008-0118-x]
    [5] Martie L, LaToza TD, van der Hoek A. CodeExchange: Supporting reformulation of internet-scale code queries in context (T). In: Proc. of the 30th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Lincoln: IEEE, 2015. 24–35.
    [6] Zhang F, Niu HR, Keivanloo I, Zou Y. Expanding queries for code search using semantically related API class-names. IEEE Transactions on Software Engineering, 2018, 44(11): 1070–1082. [doi: 10.1109/TSE.2017.2750682]
    [7] Gu XD, Zhang HY, Kim S. Deep code search. In: Proc. of the 40th Int’l Conf. on Software Engineering. Gothenburg: ACM, 2018. 933–944.
    [8] Cambronero J, Li HY, Kim S, Sen K, Chandra S. When deep learning met code search. In: Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Tallinn: ACM, 2019. 964–974.
    [9] Shuai JH, Xu L, Liu C, Yan M, Xia X, Lei Y. Improving code search with co-attentive representation learning. In: Proc. of the 28th Int’l Conf. on Program Comprehension. Seoul: ACM, 2020. 196–207.
    [10] Wan Y, Shu JD, Sui YL, Xu GD, Zhao Z, Wu J, Yu P. Multi-modal attention network learning for semantic source code retrieval. In: Proc. of the 34th IEEE/ACM Int’l Conf. on Automated Software Engineering. San Diego: IEEE, 2019. 13–25.
    [11] Gu WC, Li ZJ, Gao CY, Wang CZ, Zhang HY, Xu ZL, Lyu MR. CRaDLe: Deep code retrieval based on semantic dependency learning. Neural Networks, 2021, 141: 385–394. [doi: 10.1016/j.neunet.2021.04.019]
    [12] Gu J, Chen ZM, Martin M. Multimodal representation for neural code search. In: Proc. of the 2021 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME). Luxembourg: IEEE, 2021. 483–494.
    [13] Allen FE. Control flow analysis. ACM SIGPLAN Notices, 1970, 5(7): 1–19. [doi: 10.1145/390013.808479]
    [14] Ferrante J, Ottenstein KJ, Warren JD. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 1987, 9(3): 319–349. [doi: 10.1145/24039.24041]
    [15] Huo X, Li M, Zhou ZH. Control flow graph embedding based on multi-instance decomposition for bug localization. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 4223–4230. [doi: 10.1609/aaai.v34i04.5844]
    [16] Ramachandram D, Taylor GW. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 2017, 34(6): 96–108. [doi: 10.1109/MSP.2017.2738401]
    [17] Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M. Modeling relational data with graph convolutional networks. In: Proc. of the 15th Int’l Conf. on the Semantic Web. Heraklion: Springer, 2018. 593–607.
    [18] Wilde N, Huitt R, Huitt S. Dependency analysis tools: Reusable components for software maintenance. In: Proc. of the 1989 Conf. on Software Maintenance. Miami: IEEE, 1989. 126–131.
    [19] Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the Web. Technical Report, Stanford: Stanford University, 1998.
    [20] Lv F, Zhang HY, Lou JG, Wang SW, Zhang DM, Zhao JJ. CodeHow: Effective code search based on api understanding and extended boolean model (E). In: Proc. of the 30th IEEE/ACM Int’l Conf. on Automated Software 1725.
    ng (ASE). Lincoln: IEEE, 2015. 260–270.
    [21] Rahman MM, Roy C. Effective reformulation of query for code search using crowdsourced knowledge and extra-large data analytics. In: Proc. of the 2018 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME). Madrid: IEEE, 2018. 473–484.
    [22] Nie LM, Jiang H, Ren ZL, Sun ZY, Li XC. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 2016, 9(5): 771–783. [doi: 10.1109/TSC.2016.2560165]
    [23] Huang Q, Wu HG. QE-integrating framework based on Github knowledge and SVM ranking. Science China Information Sciences, 2019, 62(5): 52102. [doi: 10.1007/s11432-017-9465-9]
    [24] 黎宣, 王千祥, 金芝. 基于增强描述的代码搜索方法. 软件学报, 2017, 28(6): 1405–1417. http://www.jos.org.cn/1000-9825/5226.htm
    Li X, Wang QX, Jin Z. Description reinforcement based code search. Ruan Jian Xue Bao/Journal of Software, 2017, 28(6): 1405–1417 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5226.htm
    [25] Sachdev S, Li HY, Luan SF, Kim S, Sen K, Chandra S. Retrieval on source code: A neural code search. In: Proc. of the 2nd ACM SIGPLAN Int’l Workshop on Machine Learning and Programming Languages. Philadelphia: ACM, 2018. 31–41.
    [26] Fang S, Tan YS, Zhang T, Liu YP. Self-attention networks for code search. Information and Software Technology, 2021, 134: 106542. [doi: 10.1016/j.infsof.2021.106542]
    [27] Feng ZY, Guo DY, Tang DY, Duan N, Feng XC, Gong M, Shou LJ, Qin B, Liu T, Jiang DX, Zhou M. CodeBERT: A pre-trained model for programming and natural languages. arXiv:2002.08155v4, 2020.
    [28] Guo DY, Ren S, Lu S, Feng ZY, Tang DY, Liu SJ, Zhou L, Duan N, Svyatkovskiy A, Fu SY, Tufano M, Deng SK, Clement C, Drain D, Sundaresan N, Yin J, Jiang DX, Zhou M. GraphCodeBERT: Pre-training code representations with data flow. arXiv:2009.08366, 2021.
    [29] Tai KS, Socher R, Manning CD. Improved semantic representations from tree-structured long short-term memory networks. In: Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int’l Joint Conf. on Natural Language Processing (Vol. 1: Long Papers). Beijing: ACL, 2015. 1556–1566.
    [30] Li YJ, Tarlow D, Brockschmidt M, Zemel R. Gated graph sequence neural networks. arXiv:1511.05493v4, 2017.
    [31] Ling X, Wu LF, Wang SZ, Pan GN, Ma TF, Xu FL, Liu AX, Wu CM, Ji SL. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data, 2021, 15(5): 88. [doi: 10.1145/3447571]
    [32] Allamanis M, Brockschmidt M, Khademi M. Learning to represent programs with graphs. arXiv:1711.00740v3, 2018.
    [33] 凌春阳, 邹艳珍, 林泽琦, 谢冰, 赵俊峰. 基于图嵌入的软件项目源代码检索方法. 软件学报, 2019, 30(5): 1481-1497. http://www.jos.org.cn/1000-9825/5721.htm
    Ling CY, Zou YZ, Lin ZQ, Xie B, Zhao JF. Approach to searching software source code with graph embedding. Ruan Jian Xue Bao/Journal of Software, 2019, 30(5): 1481-1497 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5721.htm
    [34] 黄思远, 赵宇海, 梁燚铭. 融合图嵌入和注意力机制的代码搜索. 计算机科学与探索, 2022, 16(4): 844–854. [doi: 10.3778/j.issn.1673-9418.2010087]
    Huang SY, Zhao YH, Liang YM. Code search combining graph embedding and attention mechanism. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 844–854 (in Chinese with English abstract). [doi: 10.3778/j.issn.1673-9418.2010087]
    [35] Tang J, Qu M, Wang MZ, Zhang M, Yan J, Mei QZ. Line: Large-scale information network embedding. In: Proc. of the 24th Int’l Conf. on World Wide Web. Florence: International World Wide Web Conferences Steering Committee, 2015. 1067–1077.
    [36] Husain H, Wu HH, Gazit T, Allamanis M, Brockschmidt M. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv:1909.09436, 2020.
    [37] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2017.
    [38] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Berlin: ACL, 2016. 1715–
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

徐杨,陈晓杰,汤德佑,黄翰.面向代码搜索的函数功能多重图嵌入.软件学报,2024,35(8):3809-3823

复制
分享
文章指标
  • 点击次数:607
  • 下载次数: 1694
  • HTML阅读次数: 659
  • 引用次数: 0
历史
  • 收稿日期:2022-05-09
  • 最后修改日期:2022-10-04
  • 在线发布日期: 2023-07-26
  • 出版日期: 2024-08-06
文章二维码
您是第19936840位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号