Exploration and Improvement of Capabilities of LLMs in Code Refinement Task
Author:
Affiliation:

Clc Number:

TP311

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    As a crucial subtask of automated code review, code refinement plays a significant role in improving efficiency and code quality. With large language models (LLMs) demonstrating superior performance over small pretrained models in software engineering, this study aims to explore the performance of these two types of models in automated code refinement task to evaluate the comprehensive advantages of LLMs. Traditional code quality metrics (e.g., BLEU, CodeBLEU, Edit Progress) are used to evaluate the performance of four mainstream LLMs and four representative small pretrained models in automated code review. Findings indicate that LLMs underperform small pretrained models in code refinement before review (CRB) subtask. Given the limitations of existing code quality metrics in explaining this phenomenon, this study proposes Unidiff-based code refinement metrics to quantify changes during the refinement process. These new metrics elucidate the reasons for the observed disadvantage and reveal the models' tendencies in executing changes: (1) CRB task is highly challenging, with models exhibiting extremely low accuracy in executing correct changes. LLMs, compared to small pretrained models, exhibit more "aggressive" behavior, tending to execute more code changes, resulting in poor performance; (2) Compared to small pretrained models, LLMs are inclined to perform more ADD and MODIFY change operations, with ADD operations typically involving more lines of insertion on average, further demonstrating their "aggressive" nature. To mitigate the disadvantages of LLMs in CRB task, this study introuces LLM-Voter, which is based on large language models and ensemble learning. This method includes two sub-schemes: Inference-based and Confidence-based, aimed at integrating the strengths of different base models to enhance code quality. Furthermore, an refinment determination mechanism is introduced to improve the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the EM (Exact Match) score while achieving refinement quality superior to all base models, thereby effectively alleviating the disadvantages of LLMs.

    Reference
    Related
    Cited by
Get Citation

王志鹏,何铁科,赵若愚,郑滔.大模型在代码优化任务的能力探究及改进方法.软件学报,2025,36(6):0

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 25,2024
  • Revised:October 14,2024
  • Adopted:
  • Online: December 10,2024
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063