ZHAO Xian-Lin
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China;School of Computer Science, Peking University, Beijing 100871, ChinaPAN Xing-Lu
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China;School of Computer Science, Peking University, Beijing 100871, ChinaZOU Yan-Zhen
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China;School of Computer Science, Peking University, Beijing 100871, ChinaLIU Chen-Xiao
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China;School of Computer Science, Peking University, Beijing 100871, ChinaXIE Bing
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China;School of Computer Science, Peking University, Beijing 100871, ChinaTP311
Code comment generation is an important research task in software engineering. Mainstream methods for comment generation train deep learning models to generate comments, relying on metrics such as BLEU to evaluate comment quality on open code comment datasets. These evaluations mainly reflect the similarity between generated comments and manual reference comments in the datasets. However, the quality of the manual reference comments in open comment datasets varies widely, which leads to more and more doubts about the effectiveness of these metrics. Therefore, for code comment generation tasks, there is an urgent need for direct and effective methods to evaluate code comment quality. Such methods can improve the quality of open comment datasets and enhance the evaluation of generated comments. This study conducts research and analysis on existing quantifiable methods for code comment quality evaluation and applies a set of multi-dimensional metrics to directly evaluate the quality of code comments in mainstream open datasets, comments generated by traditional methods, and comments generated by ChatGPT. The study reveals the following findings. 1) The quality of code comments in mainstream open datasets needs improvement, with issues such as inaccuracy, poor readability, excessive simplicity, and a lack of useful information. 2) Comments generated by traditional methods are more lexically and semantically similar to the code but lack information that is more useful to developers, such as high-level intentions of the code. 3) One important reason for the low BLEU scores of generated comments is the large number of poor-quality reference comments in datasets, which lack relevance with the code or exhibit poor naturalness. These kinds of reference comments should be filtered or improved. 4) Comments generated by LLMs like ChatGPT are rich in content but tend to be lengthy. Their quality evaluation needs to be tailored to developer intentions and specific scenarios. Based on these findings, this study provides several suggestions for future research in code comment generation and comment quality evaluation.
赵衔麟,潘兴禄,邹艳珍,刘陈晓,谢冰.面向代码注释生成任务的注释质量评价研究.软件学报,,():1-25
Copy