Abstract:As a crucial subtask of automated code review, code refinement plays a significant role in improving efficiency and code quality. With large language models (LLMs) demonstrating superior performance over small pretrained models in software engineering, this study aims to explore the performance of these two types of models in automated code refinement task to evaluate the comprehensive advantages of LLMs. Traditional code quality metrics (e.g., BLEU, CodeBLEU, Edit Progress) are used to evaluate the performance of four mainstream LLMs and four representative small pretrained models in automated code review. Findings indicate that LLMs underperform small pretrained models in code refinement before review (CRB) subtask. Given the limitations of existing code quality metrics in explaining this phenomenon, this study proposes Unidiff-based code refinement metrics to quantify changes during the refinement process. These new metrics elucidate the reasons for the observed disadvantage and reveal the models' tendencies in executing changes: (1) CRB task is highly challenging, with models exhibiting extremely low accuracy in executing correct changes. LLMs, compared to small pretrained models, exhibit more "aggressive" behavior, tending to execute more code changes, resulting in poor performance; (2) Compared to small pretrained models, LLMs are inclined to perform more ADD and MODIFY change operations, with ADD operations typically involving more lines of insertion on average, further demonstrating their "aggressive" nature. To mitigate the disadvantages of LLMs in CRB task, this study introuces LLM-Voter, which is based on large language models and ensemble learning. This method includes two sub-schemes: Inference-based and Confidence-based, aimed at integrating the strengths of different base models to enhance code quality. Furthermore, an refinment determination mechanism is introduced to improve the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the EM (Exact Match) score while achieving refinement quality superior to all base models, thereby effectively alleviating the disadvantages of LLMs.