大模型在代码优化任务的能力探究及改进方法

doi:10.13328/j.cnki.jos.007325

微信服务号

微信订阅号

首页 > 过刊浏览>2025年第36卷第6期 >0-0. DOI:10.13328/j.cnki.jos.007325

PDF HTML阅读 XML下载导出引用引用提醒

大模型在代码优化任务的能力探究及改进方法
DOI:
                        10.13328/j.cnki.jos.007325
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:何铁科,E-mail:hetieke@nju.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金(62306137)

Exploration and Improvement of Capabilities of LLMs in Code Refinement Task

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

代码优化任务作为自动化代码审查的关键环节,有助于提高开发效率和代码质量.随着大语言模型在软件工程领域中展现出远胜于传统小规模预训练模型的性能,本研究旨在探讨两类模型在自动代码优化任务的表现,以评估大语言模型的综合优势.通过使用传统代码质量评估指标(e.g., BLEU, CodeBLEU, Edit Progress)对四种主流大语言模型和四种代表性小规模预训练模型在代码优化任务的表现进行评估,发现大语言模型在审查前代码优化子任务的优化质量劣于小规模预训练模型.由于现有代码质量评估指标难以解释上述现象,本研究提出基于Unidiff的代码优化评估指标,量化优化过程中的变更操作,以解释劣势原因并揭示模型执行变更操作的倾向性:(1)审查前代码优化任务难度较大,模型执行正确变更操作的准确度极低,且大语言模型比小规模预训练模型表现更为“激进”,即倾向于执行更多的代码变更操作,导致其表现不佳;(2)相比小规模预训练模型,大语言模型在代码优化任务倾向于执行更多插入(ADD)和修改(MODIFY)变更操作且ADD变更操作平均插入的代码行数较多,进一步证明其“激进”性.为缓解大语言模型在审查前优化任务中的劣势,本研究基于大语言模型和集成学习提出LLM-Voter方法,包含Inference-based(基于模型推理)和Confidence-based(基于置信度选择)两种子方案,旨在集成不同基模型的优势以提升代码优化质量.在此基础上,进一步引入优化判定机制,以增强模型的决策稳定性与可靠性.实验证明:基于置信度选择的LLM-Voter方法能够在大幅提高EM (Exact Match)值的同时获得优于所有基模型的优化质量,从而有效缓解大语言模型的劣势.

Abstract:

As a crucial subtask of automated code review, code refinement plays a significant role in improving efficiency and code quality. With large language models (LLMs) demonstrating superior performance over small pretrained models in software engineering, this study aims to explore the performance of these two types of models in automated code refinement task to evaluate the comprehensive advantages of LLMs. Traditional code quality metrics (e.g., BLEU, CodeBLEU, Edit Progress) are used to evaluate the performance of four mainstream LLMs and four representative small pretrained models in automated code review. Findings indicate that LLMs underperform small pretrained models in code refinement before review (CRB) subtask. Given the limitations of existing code quality metrics in explaining this phenomenon, this study proposes Unidiff-based code refinement metrics to quantify changes during the refinement process. These new metrics elucidate the reasons for the observed disadvantage and reveal the models' tendencies in executing changes: (1) CRB task is highly challenging, with models exhibiting extremely low accuracy in executing correct changes. LLMs, compared to small pretrained models, exhibit more "aggressive" behavior, tending to execute more code changes, resulting in poor performance; (2) Compared to small pretrained models, LLMs are inclined to perform more ADD and MODIFY change operations, with ADD operations typically involving more lines of insertion on average, further demonstrating their "aggressive" nature. To mitigate the disadvantages of LLMs in CRB task, this study introuces LLM-Voter, which is based on large language models and ensemble learning. This method includes two sub-schemes: Inference-based and Confidence-based, aimed at integrating the strengths of different base models to enhance code quality. Furthermore, an refinment determination mechanism is introduced to improve the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the EM (Exact Match) score while achieving refinement quality superior to all base models, thereby effectively alleviating the disadvantages of LLMs.

参考文献

相似文献

引证文献

引用本文

王志鹏,何铁科,赵若愚,郑滔.大模型在代码优化任务的能力探究及改进方法.软件学报,2025,36(6):0

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-08-25
最后修改日期:2024-10-14
录用日期:
在线发布日期: 2024-12-10
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码