大语言模型在代码优化任务中的能力探究及改进方法

doi:10.13328/j.cnki.jos.007325

微信服务号

微信订阅号

2025年6月16日 6:58 星期一

首页 > 过刊浏览>2025年第36卷第6期 >2477-2500. DOI:10.13328/j.cnki.jos.007325

PDF HTML阅读 XML下载导出引用引用提醒

大语言模型在代码优化任务中的能力探究及改进方法
DOI:
                        10.13328/j.cnki.jos.007325
                    
CSTR:
                        32375.14.jos.007325
                    
作者:
                        王志鹏王志鹏
南京大学 软件学院, 江苏 南京 210093
在期刊界中查找
在百度中查找
在本站中查找
何铁科何铁科
南京大学 软件学院, 江苏 南京 210093
在期刊界中查找
在百度中查找
在本站中查找
赵若愚赵若愚
南京大学 软件学院, 江苏 南京 210093
在期刊界中查找
在百度中查找
在本站中查找
郑滔郑滔
南京大学 软件学院, 江苏 南京 210093
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:何铁科,E-mail:hetieke@nju.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金(62306137)

Exploration and Improvement of Capabilities of LLMs in Code Refinement Task

Author:

WANG Zhi-Peng
WANG Zhi-Peng
Software Institute, Nanjing University, Nanjing 210093, China
在期刊界中查找
在百度中查找
在本站中查找
HE Tie-Ke
HE Tie-Ke
Software Institute, Nanjing University, Nanjing 210093, China
在期刊界中查找
在百度中查找
在本站中查找
ZHAO Ruo-Yu
ZHAO Ruo-Yu
Software Institute, Nanjing University, Nanjing 210093, China
在期刊界中查找
在百度中查找
在本站中查找
ZHENG Tao
ZHENG Tao
Software Institute, Nanjing University, Nanjing 210093, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [49]

相似文献 [9]

引证文献

资源附件

文章评论

摘要:

代码优化任务作为自动化代码审查的关键环节, 有助于提高开发效率和代码质量. 随着大语言模型在软件工程领域中展现出远胜于传统小规模预训练模型的性能, 旨在探讨两类模型在自动代码优化任务的表现, 以评估大语言模型的综合优势. 通过使用传统代码质量评估指标(例如, BLEU, CodeBLEU, edit progress)对4种主流大语言模型和4种代表性小规模预训练模型在代码优化任务的表现进行评估, 发现大语言模型在审查前代码优化子任务的优化质量劣于小规模预训练模型. 由于现有代码质量评估指标难以解释上述现象, 提出基于Unidiff的代码优化评估指标, 量化优化过程中的变更操作, 以解释劣势原因并揭示模型执行变更操作的倾向性: (1)审查前代码优化任务难度较大, 模型执行正确变更操作的准确度极低, 且大语言模型比小规模预训练模型表现更为“激进”, 即倾向于执行更多的代码变更操作, 导致其表现不佳; (2)相比小规模预训练模型, 大语言模型在代码优化任务倾向于执行更多插入(ADD)和修改(MODIFY)变更操作且ADD变更操作平均插入的代码行数较多, 进一步证明其“激进”性. 为缓解大语言模型在审查前优化任务中的劣势, 基于大语言模型和集成学习提出LLM-Voter方法, 包含Inference-based (基于模型推理)和Confidence-based (基于置信度选择)两种子方案, 旨在集成不同基模型的优势以提升代码优化质量. 在此基础上, 进一步引入优化判定机制, 以增强模型的决策稳定性与可靠性. 实验证明: 基于置信度选择的LLM-Voter方法能够在大幅提高EM (exact match)值的同时获得优于所有基模型的优化质量, 从而有效缓解大语言模型的劣势.

关键词:代码审查;自动代码优化;大语言模型;统一差异格式;集成学习

Abstract:

As a crucial part of automated code review, the code refinement task is of great significance for improving development efficiency and code quality. Since large language models (LLMs) have shown far better performance than traditional small-scale pre-trained models in the field of software engineering, this study aims to explore the performance of these two types of models in the task of automatic code refinement, so as to evaluate the comprehensive advantages of LLMs. The traditional code quality evaluation metrics (e.g., BLEU, CodeBLEU, edit progress) are used to evaluate the performance of four mainstream LLMs and four representative small-scale pre-trained models in the code refinement task. Findings indicate that the refinement quality of LLMs in the pre-review code refinement subtask is inferior to that of small-scale pre-trained models. Due to the difficulty of the existing code quality evaluation metrics in explaining the above phenomenon, this study proposes Unidiff-based code refinement evaluation metrics to quantify the change operations in the refinement process, in order to explain the reasons for the inferiority and reveal the tendency of the models to perform change operations: (1) The pre-review code refinement task is rather difficult, the accuracy of the models in performing correct change operations is extremely low, and LLMs are more “aggressive” than small-scale pre-trained models, that is, they tend to perform more code change operations, resulting in their poor performance; (2) Compared with small-scale pretrained models, LLMs tend to perform more ADD and MODIFY change operations in the code refinement task, and the average number of inserted code lines in ADD change operations is larger, further proving their “aggressive” nature. To alleviate the disadvantages of LLMs in the pre-review refinement task, this study introduces the LLM-Vote method based on LLMs and ensemble learning, which includes two sub-schemes: Inference-based and Confidence-based, aiming to integrate the advantages of different base models to improve the code refinement quality. On this basis, a refinement determination mechanism is further introduced to enhance the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the exact match (EM) value and obtains a refinement quality better than all base models, thus effectively alleviating the disadvantages of large language models.

Key words:code review;automated code refinement;large language model (LLM);unified diff format (Unidiff);ensemble learning

参考文献

[1] Sadowski C, Söderberg E, Church L, Sipko M, Bacchelli A. Modern code review: A case study at Google. In: Proc. of the 40th Int’l Conf. on Software Engineering: Software Engineering in Practice. Gothenburg: ACM, 2018. 181–190. [doi: 10.1145/3183519.3183525]

[2] Bacchelli A, Bird C. Expectations, outcomes, and challenges of modern code review. In: Proc. of the 35th Int’l Conf. on Software Engineering (ICSE). San Francisco: IEEE, 2013. 712–721. [doi: 10.1109/ICSE.2013.6606617]

[3] Kononenko O, Baysal O, Godfrey MW. Code review quality: How developers see it. In: Proc. of the 38th Int’l Conf. on Software Engineering. Austin: ACM, 2016. 1028–1038. [doi: 10.1145/2884781.2884840]

[4] McIntosh S, Kamei Y, Adams B, Hassan AE. The impact of code review coverage and code review participation on software quality: A case study of the Qt, VTK, and ITK projects. In: Proc. of the 11th Working Conf. on Mining Software Repositories. Hyderabad: ACM, 2014. 192–201. [doi: 10.1145/2597073.2597076]

[5] McIntosh S, Kamei Y, Adams B, Hassan AE. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering, 2016, 21(5): 2146–2189.

[6] 花子涵, 杨立, 陆俊逸, 左春. 代码审查自动化研究综述. 软件学报, 2024, 35(7): 3265–3290. http://www.jos.org.cn/1000-9825/7112.htm

Hua ZH, Yang L, Lu JY, Zuo C. Survey on code review automation research. Ruan Jian Xue Bao/Journal of Software, 2024, 35(7): 3265–3290 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/7112.htm

[7] 姜佳君, 陈俊洁, 熊英飞. 软件缺陷自动修复技术综述. 软件学报, 2021, 32(9): 2665–2690. http://www.jos.org.cn/1000-9825/6274.htm

Jiang JJ, Chen JJ, Xiong YF. Survey of automatic program repair techniques. Ruan Jian Xue Bao/Journal of Software, 2021, 32(9): 2665–2690 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6274.htm

[8] Tufano R, Pascarella L, Tufano M, Poshyvanyk D, Bavota G. Towards automating code review activities. In: Proc. of the 43rd IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Madrid: IEEE, 2021. 163–174. [doi: 10.1109/ICSE43902.2021.00027]

[9] Tufano R, Masiero S, Mastropaolo A, Pascarella L, Poshyvanyk D, Bavota G. Using pre-trained models to boost code review automation. In: Proc. of the 44th Int’l Conf. on Software Engineering. Pittsburgh: ACM, 2022. 2291–2302. [doi: 10.1145/3510003.3510621]

[10] Hou XY, Zhao YJ, Liu Y, Yang Z, Wang KL, Li L, Luo XP, Lo D, Grundy J, Wang HY. Large language models for software engineering: A systematic literature review. ACM Trans. on Software Engineering and Methodology, 2024, 33(8): 220.

[11] Fan A, Gokkaya B, Harman M, Lyubarskiy M, Sengupta S, Yoo S. Large language models for software engineering: Survey and open problems. In: Proc. of the 2023 IEEE/ACM Int’l Conf. on Software Engineering: Future of Software Engineering (ICSE-FoSE). Melbourne: IEEE, 2023. 31–53. [doi: 10.1109/ICSE-FoSE59343.2023.00008]

[12] Lu JY, Yu L, Li XJ, Yang L, Zuo C. LLaMA-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In: Proc. of the 34th IEEE Int’l Symp. on Software Reliability Engineering (ISSRE). Florence: IEEE, 2023. 647–658. [doi: 10.1109/ISSRE59848.2023.00026]

[13] Dong XB, Yu ZW, Cao WM, Shi YF, Ma QL. A survey on ensemble learning. Frontiers of Computer Science, 2020, 14(2): 241–258.

[14] Ganaie MA, Hu MH, Malik AK, Tanveer M, Suganthan PN. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 2022, 115: 105151.

[15] Fagan M. A history of software inspections. In: Broy M, Denert E, eds. Software Pioneers: Contributions to Software Engineering. Berlin, Heidelberg: Springer, 2002. 562–573. [doi: 10.1007/978-3-642-59412-0_34]

[16] Badampudi D, Unterkalmsteiner M, Britto R. Modern code reviews-survey of literature and practice. ACM Trans. on Software Engineering and Methodology, 2023, 32(4): 107.

[17] Bosu A, Carver JC, Bird C, Orbeck J, Chockley C. Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at Microsoft. IEEE Trans. on Software Engineering, 2017, 43(1): 56–75.

[18] Bellegarda JR. Statistical language model adaptation: Review and perspectives. Speech Communication, 2004, 42(1): 93–108.

[19] De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proc. of the 23rd ACM Int’l Conf. on Information and Knowledge Management. Shanghai: ACM, 2014. 1819–1822. [doi: 10.1145/2661829.2661974]

[20] Min BN, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E, Heintz I, Roth D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 2024, 56(2): 30.

[21] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]

[22] Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020, 30(4): 681–694.

[23] Kalla D, Smith N, Kuraku S, Samaah F. Study and analysis of chat GPT and its impact on different fields of study. Int’l Journal of Innovative Science and Research Technology, 2023, 8(3): 827–833.

[24] Biswas SS. Role of chat GPT in public health. Annals of Biomedical Engineering, 2023, 51(5): 868–869.

[25] Firat M. How chat GPT can transform autodidactic experiences and open education? 2023. https://osf.io/preprints/osf/9ge8m_v1

[26] Wu TY, He SZ, Liu JP, Sun SQ, Liu K, Han QL, Tang Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 2023, 10(5): 1122–1136.

[27] Friha O, Ferrag MA, Kantarci B, Cakmak B, Ozgun A, Ghoualmi-Zine N. LLM-based edge intelligence: A comprehensive survey on architectures, applications, security and trustworthiness. IEEE Open Journal of the Communications Society, 2024, 5: 5799–5856.

[28] Lv K, Yang YQ, Liu TX, Guo QP, Qiu XP. Full parameter fine-tuning for large language models with limited resources. In: Proc. of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Bangkok: Association for Computational Linguistics, 2024. 8187–8198. [doi: 10.18653/v1/2024.acl-long.445]

[29] Zhang RR, Han JM, Liu C, Zhou AJ, Lu P, Qiao Y, Li HS, Gao P. LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In: Proc. of the 12th Int’l Conf. on Learning Representations. Vienna: OpenReview.net, 2024.

[30] Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. In: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing (Vol. 1: Long Papers). Association for Computational Linguistics, 2021. 4582–4597. [doi: 10.18653/v1/2021.acl-long.353]

[31] Jia ML, Tang LM, Chen BC, Cardie C, Belongie S, Hariharan B, Lim SN. Visual prompt tuning. In: Proc. of the 17th European Conf. on Computer Vision. Tel Aviv: Springer, 2022. 709–727. [doi: 10.1007/978-3-031-19827-4_41]

[32] Rozière B, Gehring J, Gloeckle F, Sootla S, Gat I, Tan XE, Adi Y, Liu JY, Sauvestre R, Remez T, Rapin J, Kozhevnikov A, Evtimov I, Bitton J, Bhatt M, Ferrer CC, Grattafiori A, Xiong WH, Défossez A, Copet J, Azhar F, Touvron H, Martin L, Usunier N, Scialom T, Synnaeve G. Code Llama: Open foundation models for code. arXiv:2308.12950, 2023.

[33] Huang W, Ma XD, Qin HT, Zheng XY, Lv CT, Chen H, Luo J, Qi XJ, Liu XL, Magno M. How good are low-bit quantized LLaMA3 models? An empirical study. arXiv:2404.14047, 2024.

[34] Pinnaparaju N, Adithyan R, Phung D, Tow J, Baicoianu J, Datta A, Zhuravinskyi M, Mahan D, Bellagente M, Riquelme C, Cooper N. Stable code technical report. arXiv:2404.01226, 2024.

[35] Wang Y, Wang WS, Joty S, Hoi SCH. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021. 8696–8708. [doi: 10.18653/v1/2021.emnlp-main.685]

[36] Li ZY, Lu S, Guo DY, Duan N, Jannu S, Jenks G, Majumder D, Green J, Svyatkovskiy A, Fu SY, Sundaresan N. Automating code review activities by large-scale pre-training. In: Proc. of the 30th ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Singapore: ACM, 2022. 1035–1047. [doi: 10.1145/3540250.3549081]

[37] Feng ZY, Guo DY, Tang DY, Duan N, Feng XC, Gong M, Shou LJ, Qin B, Liu T, Jiang DX, Zhou M. CodeBERT: A pre-trained model for programming and natural languages. In: Proc. of the 2020 Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 1536–1547. [doi: 10.18653/v1/2020.findings-emnlp.139]

[38] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318. [doi: 10.3115/1073083.1073135]

[39] Ren S, Guo DY, Lu S, Zhou L, Liu SJ, Tang DY, Sundaresan N, Zhou M, Blanco A, Ma S. CodeBLEU: A method for automatic evaluation of code synthesis. arXiv:2009.10297, 2020.

[40] Zhou X, Kim K, Xu BW, Han D, He JD, Lo D. Generation-based code review automation: How far are we? In: Proc. of the 31st IEEE/ACM Int’l Conf. on Program Comprehension (ICPC). Melbourne: IEEE, 2023. 215–226. [doi: 10.1109/ICPC58990.2023.00036]

[41] Guo Q, Cao JM, Xie XF, Liu SQ, Li XH, Chen BH, Peng X. Exploring the potential of ChatGPT in automated code refinement: An empirical study. In: Proc. of the 46th IEEE/ACM Int’l Conf. on Software Engineering. Lisbon: ACM, 2024. 34. [doi: 10.1145/3597503.3623306]

[42] Polikar R. Ensemble learning. In: Zhang C, Ma YQ, eds. Ensemble Machine Learning: Methods and Applications. New York: Springer, 2012. 1–34. [doi: 10.1007/978-1-4419-9326-7_1]

[43] Wang YZ, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H. Self-instruct: Aligning language models with self-generated instructions. In: Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Toronto: Association for Computational Linguistics, 2023. 13484–13508. [doi: 10.18653/v1/2023.acl-long.754]

[44] Li LW, Yang L, Jiang HX, Yan J, Luo TJ, Hua ZH, Liang G, Zuo C. AUGER: Automatically generating review comments with pre-training models. In: Proc. of the 30th ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Singapore: ACM, 2022. 1009–1021. [doi: 10.1145/3540250.3549099]

[45] Allamanis M, Barr ET, Bird C, Sutton C. Learning natural coding conventions. In: Proc. of the 22nd ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. Hong Kong: ACM, 2014. 281–293. [doi: 10.1145/2635868.2635883]

[46] Markovtsev V, Long WR, Mougard H, Slavnov K, Bulychev E. Style-analyzer: Fixing code style inconsistencies with interpretable unsupervised algorithms. In: Proc. of the 16th IEEE/ACM Int’l Conf. on Mining Software Repositories (MSR). Montreal: IEEE, 2019. 468–478. [doi: 10.1109/MSR.2019.00073]

[47] Thongtanunam P, Pornprasit C, Tantithamthavorn C. AutoTransform: Automated code transformation to support modern code review process. In: Proc. of the 44th Int’l Conf. on Software Engineering. Pittsburgh: ACM, 2022. 237–248. [doi: 10.1145/3510003.3510067]

引用本文

王志鹏,何铁科,赵若愚,郑滔.大语言模型在代码优化任务中的能力探究及改进方法.软件学报,2025,36(6):2477-2500

复制

文章指标

点击次数:1250
下载次数: 714
HTML阅读次数: 84
引用次数: 0

历史

收稿日期:2024-08-25
最后修改日期:2024-10-14
录用日期:
在线发布日期: 2024-12-10
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码