重新审视代码补全中的检索增强策略
作者:
作者单位:

1.复旦大学计算机科学技术学院;2.阿里巴巴集团

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [40]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    软件开发者在编写代码时,常常会参考项目中实现了相似功能的代码. 代码生成模型在生成代码时也具有类似特点,会以输入中给出的代码上下文信息作为参考. 基于检索增强的代码补全技术与这一思想类似,该技术从检索库中检索到外部代码作为额外信息,对生成模型起到提示的作用,从而生成目标代码. 现有的基于检索增强的代码补全方法将输入代码和检索结果直接拼接到一起作为生成模型的输入,这种方法带来了一个风险,即检索到的代码片段可能并不能对模型起到提示作用,反而有可能会误导模型,导致生成的代码结果不准确. 此外,由于无论检索到的外部代码是否与输入代码完全相关,都会被与输入代码拼接起来输入到模型,这导致该方法的效果在很大程度上依赖于代码检索阶段的准确性. 如果检索阶段不能返回可用的代码片段,那么后续的代码补全效果可能也会受到影响. 首先,本文针对现有的代码补全方法中的检索增强策略进行了经验研究,通过定性和定量实验分析检索增强的各个阶段对于代码补全效果的影响,在经验研究中重点识别了代码粒度、代码检索方法、代码后处理方法这三种影响检索增强效果的因素. 接着,本文基于经验研究的结论设计改进方法,提出一种通过分阶段优化代码检索策略来改进检索增强的代码补全方法MAGIC(Multi-stAGe optImization for retrieval augmentated Code completion),设计了代码切分、二次检索精排、模板提示生成等改进策略,可以有效地提升检索增强对代码补全模型的辅助生成作用,并减少模型在代码生成阶段受到的噪声干扰,提升生成代码的质量. 最后,本文在Java代码数据集上的实验结果表明: 与现有的基于检索增强的代码补全方法相比, 该方法在编辑相似度和完全匹配指标上分别提升了6.76个百分点和7.81个百分点. 与6B参数量的代码大模型相比,该方法能够在节省94.5%的显存和73.8%的推理时间的前提下,在编辑相似度和完全匹配指标上分别提升5.62个百分点和4.66个百分点.

    Abstract:

    When writing code, software developers often refer to code snippets that implement similar functions in the project. The code generation model has similar characteristics when generating code fragments, and will use the code context provided in the input as a reference. The external code retrieved from the retrieval library is used as additional context information to prompt the generation model so as to complete the unfinished code fragments. This method is called code completion technology based on retrieval augmentation. The existing code completion method based on retrieval augmentation directly splices the input code and retrieval results together as the input of the generated model. Although this method can provide more context information for the model, it also brings a risk that the retrieved code fragments may not prompt the model, but may mislead the model, resulting in inaccurate or irrelevant code results. In addition, whether the retrieved external code is completely related to the input code or not, it will be spliced with the input code and input to the model, which leads to the effect of this method largely depends on the accuracy of the code retrieval stage. If the retrieval phase cannot return the available code fragments, the subsequent code completion effect may also be affected. This paper conducts an empirical study on the retrieval augmentation strategies in the existing code completion methods. Through qualitative and quantitative experiments, it analyzes the impact of each stage of retrieval augmentation on the effect of code completion. In the empirical study, it focuses on identifying three factors that affect the effect of retrieval augmentation, namely, code granularity, code retrieval methods, and post-processing methods. Based on the conclusion of empirical research, an improved method is designed, and a code completion method MAGIC (multi-stage optimization for retrieval augmented code completion) is proposed to improve the retrieval augmentation by optimizing the code retrieval strategy in stages. The improved strategies such as code segmentation, retrieval-reranking, template prompt generation are designed, which can effectively enhance the auxiliary generation effect of the code retrieval module on the code completion model, and reduce the interference of irrelevant code in the code generation phase of the model, and improve the quality of generated code. The experimental results on Java code dataset show that: compared with the existing code completion methods based on retrieval augmentation, this method improves the editing similarity and perfect matching index by 6.76% and 7.81% respectively. Compared with the large code model with 6B parameters, this method can save 94.5% of the display memory and 73.8% of the reasoning time, and improve the editing similarity and complete matching index by 5.62% and 4.66% respectively.

    参考文献
    [1] Yang B, Zhang N, Li SP, Xia X. Survey of intelligent code completion. Ruan Jian Xue Bao/Journal of Software, 2020,31(5):1435?1453 (in Chinese). http://www.jos.org.cn/1000-9825/5966.htm [doi: 10.13328/j.cnki.jos.005966]
    [2] Hindle A, Barr ET, Gabel M, Su Z, Devanbu P. On the naturalness of software. Communications of the ACM, 2016, 59(5): 122-131.
    [3] Raychev V, Vechev M, Yahav E. Code completion with statistical language models. InProceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, 2014. 419-428.
    [4] Liu BB, Dong W, Wang J. Survey on intelligent search and construction methods of program. Ruan Jian Xue Bao/Journal of Software, 2018,29(8):2180?2197 (in Chinese). http://www.jos.org.cn/1000-9825/5529.htm [doi: 10.13328/j.cnki.jos.005529]
    [5] Bhoopchand A, Rockt?schel T, Barr E, Riedel S. Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307, 2016.
    [6] Chen M, Tworek J, Jun H, Yuan Q, Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
    [7] Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih WT, Rockt?schel T, Riedel S. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020,33:9459-74.
    [8] Zhang T, Yang D, Lopes C, Kim M. Analyzing and supporting adaptation of online code examples. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019. 316-327.
    [9] Hashimoto TB, Guu K, Oren Y, Liang PS. A retrieve-and-edit framework for predicting structured outputs. Advances in Neural Information Processing Systems. 2018, 31.
    [10] Lu S, Duan N, Han H, Guo D, Hwang SW, Svyatkovskiy A. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722, 2022.
    [11] Zhang F, Chen B, Zhang Y, Liu J, Zan D, Mao Y, Lou JG, Chen W. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023.
    [12] Luan S, Yang D, Barnaby C, Sen K, Chandra S. Aroma: Code recommendation via structural code search. Proceedings of the ACM on Programming Languages. 2019, 3(OOPSLA):1-28.
    [13] Reimers N, Gurevych I. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813, 2020.
    [14] Pe?a FJ, Gonzalez AL, Pashami S, Al-Shishtawy A, Payberah AH. Siambert: Siamese Bert-based Code Search. In2022 Swedish Artificial Intelligence Society Workshop (SAIS), 2022. 1-7.
    [15] Ishtiaq, Abdullah Al, Masum Hasan, Md Mahim Anjum Haque, Kazi Sajeed Mehrab, Tanveer Muttaqueen, Tahmid Hasan, Anindya Iqbal, and Rifat Shahriyar. Bert2code: Can pretrained language models be leveraged for code search?. arXiv preprint arXiv:2104.08017, 2021.
    [16] Neelakantan A, Xu T, Puri R, Radford A, Han JM, Tworek J, Yuan Q, Tezak N, Kim JW, Hallacy C, Heidecke J. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005. 2022.
    [17] Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
    [18] Hayati SA, Olivier R, Avvaru P, Yin P, Tomasic A, Neubig G. Retrieval-based neural code generation. arXiv preprint arXiv:1808.10025, 2018.
    [19] Xu FF, Jiang Z, Yin P, Vasilescu B, Neubig G. Incorporating external knowledge through pre-training for natural language to code generation. arXiv preprint arXiv:2004.09015, 2020.
    [20] Parvez, Md Rizwan, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601, 2021.
    [21] Liu, Shangqing, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405, 2020.
    [22] Loukas, Andreas. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199, 2019.
    [23] Li, Jia, Yongmin Li, Ge Li, Xing Hu, Xin Xia, and Zhi Jin. Editsum: A retrieve-and-edit framework for source code summarization. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2022. 155-166.
    [24] Yu, Chi, Guang Yang, Xiang Chen, Ke Liu, and Yanlin Zhou. Bashexplainer: Retrieval-augmented bash code comment generation based on fine-tuned codebert. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2022. 82-93.
    [25] Liu, Shangqing, Cuiyun Gao, Sen Chen, Lun Yiu Nie, and Yang Liu. "ATOM: Commit message generation based on abstract syntax tree and hybrid ranking." IEEE Transactions on Software Engineering, 2020. 48(5):1800-1817.
    [26] Wang, Haoye, Xin Xia, David Lo, Qiang He, Xinyu Wang, and John Grundy. "Context-aware retrieval-based deep commit message generation." ACM Transactions on Software Engineering and Methodology (TOSEM), 2021. 30(4): 1-30.
    [27] Shi, Ensheng, Yanlin Wang, Wei Tao, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. RACE: Retrieval-Augmented Commit Message Generation. arXiv preprint arXiv:2203.02700, 2022.
    [28] Kadilierakis, Giorgos, Pavlos Fafalios, Panagiotis Papadakos, and Yannis Tzitzikas. Keyword search over RDF using document-centric information retrieval systems. In The Semantic Web: 17th International Conference (ESWC), 2020. 121-137.
    [29] Konrad Beiske. Bm25 vs lucene default similarity. https://www.elastic.co/cn/blog/found-bm-vs-lucene-default-similarity, 2014.
    [30] Alibaba. Proxima bilin engine. https://github.com/alibaba/proxima, 2022.
    [31] Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. "Mpnet: Masked and permuted pre-training for language understanding." Advances in Neural Information Processing Systems, 2020. 33: 16857-16867.
    [32] Lu, Shuai, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
    [33] Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language models are unsupervised multitask learners, 2019. 1(8): 9.
    [34] Nijkamp, Erik, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
    [35] Chen, Tong, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hongming Zhang. Dense X Retrieval: What Retrieval Granularity Should We Use?. arXiv preprint arXiv:2312.06648, 2023.
    [36] Ram, Ori, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
    [37] Borgeaud, Sebastian, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206-2240. PMLR, 2022.
    [38] Reimers. SentenceTransformers Documentation. https://www.sbert.net/docs/pretrained_models.html, 2022.
    [39] Feng, Zhangyin, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
    [40] Wang, Yue, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:94
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2023-12-21
  • 最后修改日期:2024-05-10
  • 录用日期:2024-05-15
文章二维码
您是第19750209位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号