基于指针生成网络的代码注释自动生成模型
作者:
作者简介:

牛长安(1997-),男,学士,CCF学生会员,主要研究领域为软件工程,自然语言处理.
李传艺(1991-),男,博士,助理研究员,CCF专业会员,主要研究领域为软件工程,业务过程管理,自然语言处理.
葛季栋(1978-),男,博士,副教授,CCF高级会员,主要研究领域为软件工程,分布式计算与边缘计算,业务过程管理,自然语言处理.
周宇(1981-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为智能化软件技术,云计算,大数据.
唐泽(1994-),男,硕士,主要研究领域为代码摘要,API补全.
骆斌(1967-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为软件工程,人工智能.

通讯作者:

李传艺,E-mail:lcy@nju.edu.cn

基金项目:

国家自然科学基金(61802167,61972197,61802095);江苏省自然科学基金(BK20201250);华为-南京大学下一代程序设计创新实验室合作协议子项目


Automatic Generation of Source Code Comments Model Based on Pointer-generator Network
Author:
Fund Project:

National Natural Science Foundation of China (61802167, 61972197, 61802095); Natural Science Foundation of Jiangsu Province, China (BK20201250); Cooperation Fund of Huawei-NJU Creative Laboratory for the Next Programming

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [54]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    代码注释在软件质量保障中发挥着重要的作用,它可以提升代码的可读性,使代码更易理解、重用和维护.但是出于各种各样的原因,有时开发者并没有添加必要的注释,使得在软件维护的过程中,往往需要花费大量的时间来理解代码,大大降低了软件维护的效率.近年来,多项工作利用机器学习技术自动生成代码注释,这些方法从代码中提取出语义和结构化信息后,输入序列到序列的神经网络模型生成相应的注释,均取得了不错的效果.然而,当前最好的代码注释生成模型Hybrid-DeepCom仍然存在两方面的不足.一是其在预处理时可能破坏代码结构导致不同实例的输入信息不一致,使得模型学习效果欠佳;二是由于序列到序列模型的限制,其无法在注释中生成词库之外的单词(out-of-vocabulary word,简称OOV word).例如在源代码中出现次数极少的变量名、方法名等标识符通常都为OOV词,缺少了它们,注释将难以理解.为解决上述问题,提出了一种新的代码注释生成模型CodePtr.一方面,通过添加完整的源代码编码器解决代码结构被破坏的问题;另一方面,引入指针生成网络(pointer-generator network)模块,在解码的每一步实现生成词和复制词两种模式的自动切换,特别是遇到在输入中出现次数极少的标识符时模型可以直接将其复制到输出中,以此解决无法生成OOV词的问题.最后,在大型数据集上通过实验对比了CodePtr和Hybrid-DeepCom模型,结果表明,当词库大小为30 000时,CodePtr的各项翻译效果指标平均提升6%,同时,处理OOV词的效果提升近50%,充分说明了CodePtr模型的有效性.

    Abstract:

    Code comments plays an important role in software quality assurance, which can improve the readability of source code and make it easier to understand, reuse, and maintain. However, for various reasons, sometimes developers do not add the necessary comments, which make developers always waste a lot of time understanding the source code and greatly reduces the efficiency of software maintenance. In recent years, lots of work using machine learning to automatically generate corresponding comments for the source code. These methods extract such information as code sequence and structure, and then utilize sequence to sequence (seq2seq) neural model to generate the corresponding comments, which have achieved sound results. However, Hybrid-DeepCom, the state-of-the-art code comment generation model, is still deficient in two aspects. The first is that it may break the code structure during preprocessing, resulting in inconsistent input information of different instances, making the model learning effect poor; the second is that due to the limitations of the seq2seq model, it is not able to generate out-of-vocabulary word (OOV word) in the comment. For example, variable names, method names, and other identifiers that appear very infrequently in the source code are usually OOV words, without them, comments would be difficult to be understood. In order to solve this problem, the automatic comment generation model named CodePtr is proposed in this study. On the one hand, a complete source code encoder is added to solve the problem of code structure being broken; on the other hand, the pointer-generator network module is introduced to realize the automatic switch between the generated word mode and the copy word mode in each step of decoding, especially when encountering the identifier with few times in the input, the model can directly copy it to the output, so as to solve the problem of not being able to generate OOV word. Finally, this study compares the CodePtr and Hybrid-DeepCom models through experiments on large data sets. The results show that when the size of the vocabulary is 30 000, CodePtr is increased by 6% on average in translation performance metrics, and the effect of OOV word processing is improved by nearly 50%, which fully demonstrates the effectiveness of CodePtr model.

    参考文献
    [1] Tenny T. Procedures and comments vs. the banker's algorithm. ACM SIGCSE Bulletin, 1985,17(3):44-53.[doi:10.1145/382208. 382523]
    [2] Tashtoush Y, Odat Z, Alsmadi IM, Yatim M. Impact of programming features on code readability. Int'l Journal of Software Engineering and Its Applications, 2013,7(6):441-458.[doi:10.14257/ijseia.2013.7.6.38]
    [3] Xia X, Bao L, Lo D, Xing Z, Hassan AE, Li S. Measuring program comprehension:A large-scale field study with professionals. IEEE Trans. on Software Engineering, 2017,44(10):951-976.[doi:10.1109/TSE.2017.2734091]
    [4] Fluri B, Wursch M, Gall HC. Do code and comments co-evolve? on the relation between source code and comment changes. In:Proc. of the 14th Working Conf. on Reverse Engineering (WCRE 2007). IEEE, 2007. 70-79.[doi:10.1109/WCRE.2007.21]
    [5] Song X, Sun H, Wang X, Yan J. A survey of automatic generation of source code comments:Algorithms and techniques. IEEE Access, 2019,7:111411-111428.[doi:10.1109/ACCESS.2019.2931579]
    [6] Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering, 2019, 1-39.[doi:10.1007/s10664-019-09730-9]
    [7] Marcus A, Maletic JI. Recovering documentation-to-source-code traceability links using latent semantic indexing. In:Proc. of the 25th Int'l Conf. on Software Engineering. IEEE, 2003. 125-135.[doi:10.1109/ICSE.2003.1201194]
    [8] Kuhn A, Ducasse S, Gírba T. Semantic clustering:Identifying topics in source code. Information and Software Technology, 2007, 49(3):230-243.[doi:10.1016/j.infsof.2006.10.017]
    [9] Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In:Proc. of the 17th Working Conf. on Reverse Engineering. IEEE, 2010. 35-44.[doi:10.1109/WCRE.2010.13]
    [10] Wong E, Liu T, Tan L. Clocom:Mining existing source code for automatic comment generation. In:Proc. of the 22nd IEEE Int'l Conf. on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 2015. 380-389.[doi:10.1109/SANER.2015.7081848]
    [11] Iyer S, Konstas I, Cheung A, Zettlemoyer L. Summarizing source code using a neural attention model. In:Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016. 2073-2083.[doi:10.18653/v1/P16-1195]
    [12] Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation. In:Proc. of the 26th Conf. on Program Comprehension. 2018. 200-210.[doi:10.1145/3196321.3196334]
    [13] Hu X, Li G, Xia X, Lo D, Lu S, Jin Z. Summarizing source code with transferred API knowledge. In:Proc. of the 27th Int'l Joint Conf. on Artificial Intelligence. 2018. 2269-2275.[doi:10.5555/3304889.3304975]
    [14] Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS. Improving automatic source code summarization via deep reinforcement learning. In:Proc. of the 33rd ACM/IEEE Int'l Conf. on Automated Software Engineering. 2018. 397-407.[doi:10.1145/3238147. 3238206]
    [15] Wei B, Li G, Xia X, Fu Z, Jin Z. Code generation as a dual task of code summarization. In:Advances in Neural Information Processing Systems. 2019. 6559-6569.
    [16] Wang X, Pollock L, Vijay-Shanker K. Automatically generating natural language descriptions for object-related statement sequences. In:Proc. of the 24th IEEE Int'l Conf. on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2017. 205-216.[doi:10.1109/SANER.2017.7884622]
    [17] Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. In:Proc. of the Int'l Conf. on Machine Learning. 2016. 2091-2100.
    [18] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In:Advances in Neural Information Processing Systems. 2014. 3104-3112.[doi:10.5555/2969033.2969173]
    [19] Deissenboeck F, Pizka M. Concise and consistent naming. Software Quality Journal, 2006,14(3):261-282.[doi:10.1007/s11219-006-9219-1]
    [20] Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. In:Proc. of the 2015 Conf. on Empirical Methods in Natural Language Processing. 2015. 1412-1421.
    [21] See A, Liu PJ, Manning CD. Get to the point:Summarization with pointer-generator networks. In:Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017. 1073-1083.[doi:10.18653/v1/P17-1099]
    [22] Movshovitz-Attias D, Cohen W. Natural language models for predicting programming comments. In:Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2013. 35-40.
    [23] Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E. Moses:Open source toolkit for statistical machine translation. In:Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 2007. 177-180.[doi:10.5555/1557769.1557821]
    [24] Rush AM, Chopra S, Weston J. A neural attention model for abstractive sentence summarization. In:Proc. of the 2015 Conf. on Empirical Methods in Natural Language Processing. 2015. 379-389.[doi:10.18653/v1/D15-1044]
    [25] Eriguchi A, Hashimoto K, Tsuruoka Y. Tree-to-sequence attentional neural machine translation. In:Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016. 823-833.[doi:10.18653/v1/P16-1078]
    [26] Zhou Y, Yan X, Yang W, Chen T, Huang Z. Augmenting Java method comments generation with context information based on neural networks. Journal of Systems and Software, 2019,156:328-340.[doi:10.1016/j.jss.2019.07.087]
    [27] Huang Y, Huang S, Chen H, Chen X, Zheng Z, Luo X, Jia N, Hu X, Zhou X. Towards automatically generating block comments for code snippets. Information and Software Technology, 2020, 106373.[doi:10.1016/j.infsof.2020.106373]
    [28] Wang W, Zhang Y, Sui Y, Wan Y, Zhao Z, Wu J, Yu P, Xu G. Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Trans. on Software Engineering, 2020.[doi:10.1109/TSE.2020.2979701]
    [29] LeClair A, Jiang S, McMillan C. A neural model for generating natural language summaries of program subroutines. In:Proc. of the 41st IEEE/ACM Int'l Conf. on Software Engineering (ICSE). IEEE, 2019. 795-806.[doi:10.1109/ICSE.2019.00087]
    [30] Yashu L, Zhihai W, Yueran H, Hanbing Y. A method of extracting malware features based on probabilistic topic model. Journal of Computer Research and Development, 2019,56(11):2339(in Chinese with English abstract).[doi:10.7544/issn1000-1239.2019. 20190393]
    [31] Gao Y, Liu H, Fan XZ, Niu ZD. Method name recommendation based on source code depository and feature matching. Ruan Jian Xue Bao/Journal of Software, 2015,26(12):3062-3074(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4817.htm[doi:10.13328/j.cnki.jos.004817]
    [32] Broder AZ. On the resemblance and containment of documents. In:Proc. of the Compression and Complexity of SEQUENCES 1997(Cat. No. 97TB100171). IEEE, 1997. 21-29.[doi:10.1109/SEQUEN.1997.666900]
    [33] Huang Y, Liu ZY, Chen XP, Xiong YF, Luo XN. Auxiliary method for code commit comprehension based on core-class identification. Ruan Jian Xue Bao/Journal of Software, 2017,28(6):1418-1434(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5225.htm[doi:10.13328/j.cnki.jos.005225]
    [34] Huang Y, Jia N, Zhou Q, Chen XP, Xiong YF, Luo XN. Method combining structural and semantic features to support code commenting decision. Ruan Jian Xue Bao/Journal of Software, 2018,29(8):2226-2242(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5528.htm[doi:10.13328/j.cnki.jos.005528]
    [35] Lin ZQ, Zou YZ, Zhao JF, Cao YK, Xie B. Software text semantic search approach based on code structure knowledge. Ruan Jian Xue Bao/Journal of Software, 2019,30(12):3714-3729(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5609.htm[doi:10.13328/j.cnki.jos.005609]
    [36] Mou L, Li G, Zhang L, Wang T, Jin Z. Convolutional neural networks over tree structures for programming language processing. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence. 2016. 1287-1293.[doi:10.5555/3015812.3016002]
    [37] Cao Y, Zou Y, Luo Y, Xie B, Zhao J. Toward accurate link between code and software documentation. Science China Information Sciences, 2018,61(5):050105.[doi:10.1007/s11432-017-9402-3]
    [38] Chen C, Peng X, Sun J, Xing Z, Wang X, Zhao Y, Zhang H, Zhao W. Generative API usage code recommendation with parameter concretization. Science China Information Sciences, 2019,62(9):192103.[doi:10.1007/s11432-018-9821-9]
    [39] Tai KS, Socher R, Manning CD. Improved semantic representations from tree-structured long short-term memory networks. In:Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int'l Joint Conf. on Natural Language Processing (Volume 1:Long Papers). 2015. 1556-1566.[doi:10.3115/v1/P15-1150]
    [40] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In:Proc. of the 3rd Int'l Conf. on Learning Representations, ICLR 2015. 2015.
    [41] Vinyals O, Fortunato M, Jaitly N. Pointer networks. In:Proc. of the 28th Int'l Conf. on Neural Information Processing Systems-Volume 2. 2015. 2692-2700.[doi:10.5555/2969442.2969540]
    [42] Kingma DP, Ba J. ADAM:A method for stochastic optimization. In:Proc. of the 3rd Int'l Conf. on Learning Representations, ICLR 2015. 2015.
    [43] Loshchilov I, Hutter F. Decoupled weight decay regularization. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [44] Papineni K, Roukos S, Ward T, et al. BLEU:A method for automatic evaluation of machine translation. In:Proc. of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002. 311-318.[doi:10. 3115/1073083.1073135]
    [45] Denkowski M, Lavie A. Meteor universal:Language specific translation evaluation for any target language. In:Proc. of the 9th Workshop on Statistical Machine Translation. 2014. 376-380.[doi:10.3115/v1/W14-3348]
    [46] Tu Z, Lu Z, Liu Y, Liu X, Li H. Modeling coverage for neural machine translation. In:Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016. 76-85.[doi:10.18653/v1/P16-1008]
    [47] Mi H, Sankaran B, Wang Z, Ittycheriah A. Coverage embedding models for neural machine translation. In:Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. 2016. 955-960.[doi:10.18653/v1/D16-1096]
    [48] Gong C, He D, Tan X, Qin T, Wang L, Liu T. Y. Frage:Frequency-agnostic word representation. In:Proc. of the 32nd Int'l Conf. on Neural Information Processing Systems. 2018. 1341-1352.[doi:10.5555/3326943.3327066]
    附中文参考文献:
    [30] 刘亚姝,王志海,侯跃然,严寒冰.一种基于概率主题模型的恶意代码特征提取方法.计算机研究与发展,2019,56(11):2339-2348.[doi:10.7544/issn1000-1239.2019.20190393]
    [31] 高原,刘辉,樊孝忠,牛振东.基于代码库和特征匹配的函数名称推荐方法.软件学报,2015,26(12):3062-3074. http://www.jos.org.cn/1000-9825/4817.htm[doi:10.13328/j.cnki.jos.004817]
    [33] 黄袁,刘志勇,陈湘萍,熊英飞,罗笑南.基于关键类判定的代码提交理解辅助方法.软件学报,2017,28(6):1418-1434. http://www.jos.org.cn/1000-9825/5225.htm[doi:10.13328/j.cnki.jos.005225]
    [34] 黄袁,贾楠,周强,陈湘萍,熊英飞,罗笑南.融合结构与语义特征的代码注释决策支持方法.软件学报,2018,29(8):2226-2242. http://www.jos.org.cn/1000-9825/5528.htm[doi:10.13328/j.cnki.jos.005528]
    [35] 林泽琦,邹艳珍,赵俊峰,曹英魁,谢冰.基于代码结构知识的软件文档语义搜索方法.软件学报,2019,30(12):3714-3729. http://www.jos.org.cn/1000-9825/5609.htm[doi:10.13328/j.cnki.jos.005609]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

牛长安,葛季栋,唐泽,李传艺,周宇,骆斌.基于指针生成网络的代码注释自动生成模型.软件学报,2021,32(7):2142-2165

复制
相关视频

分享
文章指标
  • 点击次数:2857
  • 下载次数: 8008
  • HTML阅读次数: 5229
  • 引用次数: 0
历史
  • 收稿日期:2020-09-15
  • 最后修改日期:2020-10-26
  • 在线发布日期: 2021-01-22
  • 出版日期: 2021-07-06
文章二维码
您是第20674823位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号