面向代码注释生成任务的注释质量评价研究
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

科技创新2030—“新一代人工智能”重大项目(2021ZD0110303)


Research on Comment Quality Evaluation for Code Comment Generation Tasks
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    代码注释生成是软件工程领域的重要研究任务. 当前主流的注释生成方法训练深度学习模型以生成注释, 依靠在开放的代码注释数据集上采用BLEU等指标来进行注释质量评价, 主要反映生成注释与数据集中人工参考注释的相似性. 但由于开放注释数据集中人工参考注释的质量难以保障, 其有效性受到越来越多质疑. 因此, 面向代码注释生成任务, 亟需一种直观有效的代码注释质量评价方法, 一方面改进开放注释数据集的质量, 另一方面提升生成注释的评价效果. 针对该问题, 对现有量化的注释质量评价方法进行调研和分析, 并将一套多维度注释质量评价指标用于对主流开放数据集、典型注释生成方法以及ChatGPT生成代码注释的质量评价, 由此给出一些具有参考价值的研究发现: 1)现有主流开放数据集中的代码注释质量俱有待提高, 均存在不同程度的不准确、可读性差、过于简短、缺乏有用信息等问题; 2)现有方法生成的注释普遍在词汇和语义上与代码更接近, 缺乏代码高层意图等对开发者更有用的信息; 3)生成注释的BLEU值较低, 一个重要原因是数据集中大量的参考注释本身质量不佳, 譬如与代码缺乏关联、自然性较差等, 应过滤或改进此种参考注释; 4)大语言模型ChatGPT生成的代码注释内容丰富但较为冗长, 其质量评价需要根据开发者意图与具体场景进行针对性改进. 基于这些发现, 也对未来代码注释生成任务及注释质量评价研究给出若干建议.

    Abstract:

    Code comment generation is an important research task in software engineering. Mainstream methods for comment generation train deep learning models to generate comments, relying on metrics such as BLEU to evaluate comment quality on open code comment datasets. These evaluations mainly reflect the similarity between generated comments and manual reference comments in the datasets. However, the quality of the manual reference comments in open comment datasets varies widely, which leads to more and more doubts about the effectiveness of these metrics. Therefore, for code comment generation tasks, there is an urgent need for direct and effective methods to evaluate code comment quality. Such methods can improve the quality of open comment datasets and enhance the evaluation of generated comments. This study conducts research and analysis on existing quantifiable methods for code comment quality evaluation and applies a set of multi-dimensional metrics to directly evaluate the quality of code comments in mainstream open datasets, comments generated by traditional methods, and comments generated by ChatGPT. The study reveals the following findings. 1) The quality of code comments in mainstream open datasets needs improvement, with issues such as inaccuracy, poor readability, excessive simplicity, and a lack of useful information. 2) Comments generated by traditional methods are more lexically and semantically similar to the code but lack information that is more useful to developers, such as high-level intentions of the code. 3) One important reason for the low BLEU scores of generated comments is the large number of poor-quality reference comments in datasets, which lack relevance with the code or exhibit poor naturalness. These kinds of reference comments should be filtered or improved. 4) Comments generated by LLMs like ChatGPT are rich in content but tend to be lengthy. Their quality evaluation needs to be tailored to developer intentions and specific scenarios. Based on these findings, this study provides several suggestions for future research in code comment generation and comment quality evaluation.

    参考文献
    相似文献
    引证文献
引用本文

赵衔麟,潘兴禄,邹艳珍,刘陈晓,谢冰.面向代码注释生成任务的注释质量评价研究.软件学报,,():1-25

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-11-07
  • 最后修改日期:2024-04-01
  • 录用日期:
  • 在线发布日期: 2024-12-04
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号