重新审视代码补全中的检索增强策略
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(62302099)


Revisiting Retrieval-augmentation Strategy in Code Completion
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    软件开发者在编写代码时, 常常会参考项目中实现了相似功能的代码. 代码生成模型在生成代码时也具有类似特点, 会以输入中给出的代码上下文信息作为参考. 基于检索增强的代码补全技术与这一思想类似, 该技术从检索库中检索到外部代码作为额外信息, 对生成模型起到提示的作用, 从而生成目标代码. 现有的基于检索增强的代码补全方法将输入代码和检索结果直接拼接到一起作为生成模型的输入, 这种方法带来了一个风险, 即检索到的代码片段可能并不能对模型起到提示作用, 反而有可能会误导模型, 导致生成的代码结果不准确. 此外, 由于无论检索到的外部代码是否与输入代码完全相关, 都会被与输入代码拼接起来输入到模型, 这导致该方法的效果在很大程度上依赖于代码检索阶段的准确性. 如果检索阶段不能返回可用的代码片段, 那么后续的代码补全效果可能也会受到影响. 首先, 本文针对现有的代码补全方法中的检索增强策略进行了经验研究, 通过定性和定量实验分析检索增强的各个阶段对于代码补全效果的影响, 在经验研究中重点识别了代码粒度、代码检索方法、代码后处理方法这三种影响检索增强效果的因素. 接着, 本文基于经验研究的结论设计改进方法, 提出一种通过分阶段优化代码检索策略来改进检索增强的代码补全方法MAGIC(Multi-stAGe optImization for retrieval augmentated Code completion), 设计了代码切分、二次检索精排、模板提示生成等改进策略, 可以有效地提升检索增强对代码补全模型的辅助生成作用, 并减少模型在代码生成阶段受到的噪声干扰, 提升生成代码的质量. 最后, 本文在Java代码数据集上的实验结果表明: 与现有的基于检索增强的代码补全方法相比, 该方法在编辑相似度和完全匹配指标上分别提升了6.76个百分点和7.81个百分点. 与6B参数量的代码大模型相比, 该方法能够在节省94.5%的显存和73.8%的推理时间的前提下, 在编辑相似度和完全匹配指标上分别提升5.62个百分点和4.66个百分点.

    Abstract:

    When writing code, software developers often refer to code snippets that implement similar functions in the project. The code generation model shares similar features when generating code fragments and uses the code context provided in the input as a reference. The code completion technology based on retrieval augmentation is akin to this idea. The external code retrieved from the retrieval library is used as additional context information to prompt the generation model so as to complete the unfinished code fragments. The existing code completion method based on retrieval augmentation directly splices the input code and retrieval results together as the input of the generated model. This method brings a risk that the retrieved code fragments may not prompt the model, but mislead the model, resulting in inaccurate or irrelevant code results. In addition, whether the retrieved external code is completely related to the input code or not, it will be spliced with the input code and input to the model。 Consequently, the effect of this method largely depends on the accuracy of the code retrieval stage. If the available code fragments cannot be returned in the retrieval phase, the subsequent code completion effect may also be affected. An empirical study is conducted on the retrieval augmentation strategies in the existing code completion methods. Through qualitative and quantitative experiments, the impact of each stage of retrieval augmentation on the code completion effect is analyzed. The empirical study focuses on identifying three factors for the effect of retrieval augmentation, namely, code granularity, code retrieval methods, and post-processing methods. Based on the conclusion of the empirical research, an improved method is designed, and a code completion method MAGIC (multi-stage optimization for retrieval augmented code completion) is proposed to improve the retrieval augmentation by optimizing the code retrieval strategy in stages. The improved strategies such as code segmentation, retrieval-reranking, and template prompt generation are designed to effectively enhance the auxiliary generation effect of the code retrieval module on the code completion model. Meanwhile, those strategies can also reduce the interference of irrelevant code in the code generation phase of the model and improve the quality of generated code. The experimental results on the Java code dataset show that, compared with the existing code completion methods based on retrieval augmentation, this method increases the editing similarity and perfect matching index by 6.76% and 7.81%, respectively. Compared with the large code model with 6B parameters, this method can save 94.5% of the video memory and 73.8% of the inference time, and improve the editing similarity and complete matching index by 5.62% and 4.66% respectively.

    参考文献
    相似文献
    引证文献
引用本文

邹佰翰,汪莹,彭鑫,娄一翎,刘力华,张昕东,林帆,刘名威.重新审视代码补全中的检索增强策略.软件学报,,():1-28

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-12-21
  • 最后修改日期:2024-03-08
  • 录用日期:
  • 在线发布日期: 2024-07-03
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号