代码自然性及其应用研究进展
作者:
作者简介:

陈浙哲(1997-),女,学士,主要研究领域为智能软件工程,软件仓库挖掘;
刘忠鑫(1994-),男,CCF专业会员,主要研究领域为智能化软件工程,软件文档自动生成;
鄢萌(1989-),男,博士,研究员,博士生导师,CCF专业会员,主要研究领域为智能软件工程,软件仓库挖掘,软件维护与演化;
徐洲(1990-),男,博士,助理研究员,CCF专业会员,主要研究领域为软件仓库挖掘,软件缺陷预测;
夏鑫(1986-),男,博士,讲师,博士生导师,CCF专业会员,主要研究领域为软件仓库挖掘,经验软件工程;
雷晏(1985-),男,博士,副教授,CCF专业会员,主要研究领域为软件错误定位,软件自动修复.

通讯作者:

鄢萌,E-mail:mengy@cqu.edu.cn

基金项目:

国家自然科学基金(62002034);中央高校基本科研业务费(2020CDCGRJ072,2020CDJQYA021,2021CDJKYJH032);国防基础科研计划(WDZC20205500308);中国博士后基金(2020M673137);重庆市自然科学基金(cstc2020jcyj-bshX0114)


Research Progress of Code Naturalness and Its Application
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [94]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    代码自然性(code naturalness)研究是自然语言处理领域和软件工程领域共同的研究热点之一,旨在通过构建基于自然语言处理技术的代码自然性模型,以解决各种软件工程任务.近年来,随着开源软件社区中源代码和数据规模的不断扩大,越来越多的研究人员注重钻研源代码中蕴藏的信息,并且取得了一系列研究成果.但与此同时,代码自然性研究在代码语料库构建、模型构建和任务应用等环节面临许多挑战.鉴于此,从代码自然性技术的代码语料库构建、模型构建和任务应用等方面对近年来代码自然性研究及应用进展进行梳理和总结.主要内容包括:(1)介绍了代码自然性的基本概念及其研究概况;(2)归纳目前代码自然性研究的语料库,并对代码自然性模型建模方法进行分类与总结;(3)总结代码自然性模型的实验验证方法和模型评价指标;(4)总结并归类了目前代码自然性的应用现状;(5)归纳代码自然性技术的关键问题;(6)展望代码自然性技术的未来发展.

    Abstract:

    The study of code naturalness is one of the common research hotspots in the field of natural language processing and software engineering, aiming to solve various software engineering tasks by building a code naturalness model based on natural language processing techniques. In recent years, as the size of source code and data in the open source software community continues to grow, more and more researchers are focusing on the information contained in the source code, and a series of research results have been achieved. While at the same time, code naturalness research faces many challenges in code corpus construction, model building, and task application. In view of this, this paper reviews and summarizes the progress of code naturalness research and application in recent years in terms of code corpus construction, model construction, and task application. The main contents include:(1) Introducing the basic concept of code naturalness and its research overview; (2) The current corpus of code naturalness research is summarized, and the modeling methods for code naturalness are classified and summarized; (3) Summarizing the experimental validation methods and model evaluation metrics of code naturalness models; (4) Summarizing and categorizing the current application status of code naturalness; (5) Summarizing the key issues of code naturalness techniques; (6) Prospecting the future development of code naturalness techniques.

    参考文献
    [1] Hindle A, Barr ET, Su Z, et al.On the naturalness of software.In:Proc.of the 34th Int'l Conf.on Software Engineering (ICSE).IEEE, 2012.837-847.
    [2] Hirschberg J, Manning CD.Advances in natural language processing.Science, 2015, 349(6245):261-266.
    [3] Cambria E, White B.Jumping NLP curves:A review of natural language processing research.IEEE Computational Intelligence Magazine, 2014, 9(2):48-57.
    [4] Khurana D, Koli A, Khatter K, et al.Natural language processing:State of the art, current trends and challenges.arXiv:1708.05148, 2017.
    [5] Sharma A, Tian Y, Lo D.NIRMAL:Automatic identification of software relevant tweets leveraging language model.In:Proc.of the 22nd IEEE Int'l Conf.on Software Analysis, Evolution, and Reengineering (SANER).IEEE, 2015.449-458.
    [6] Gabel M, Su Z.A study of the uniqueness of source code.In:Proc.of the 18th ACM SIGSOFT Int'l Symp.on Foundations of Software Engineering (FSE 2010).ACM, 2010.147-156.
    [7] Casalnuovo C, Lee K, Wang H, et al.Do people prefer "natural" code?2019.
    [8] Tu Z, Su Z, Devanbu P.On the localness of software.In:Proc.of the 22nd ACM SIGSOFT Int'l Symp.on Foundations of Software Engineering.ACM, 2014.269-280.
    [9] Yang Y, Jiang Y, Gu M, et al.A language model for statements of software code.In:Proc.of the 32nd IEEE/ACM Int'l Conf.on Automated Software Engineering (ASE).IEEE, 2017.682687.
    [10] Allamanis M, Tarlow D, Gordon AD, et al.Bimodal modelling of source code and natural language.In:Proc.of the 32nd Int'l Conf.on Machine Learning, Vol.37.2015.2123-2132.
    [11] Wang S, Chollak D, Movshovitz-Attias D, et al.Bugram:Bug detection with n-gram language models.In:Proc.of the 31st IEEE/ACM Int'l Conf.on Automated Software Engineering (ASE 2016).ACM, 2016.708-719.
    [12] Franks C, Tu Z, Devanbu P, et al.CACHECA:A cache language model based code suggestion tool.In:Proc.of the 2015 IEEE/ACM 37th IEEE Int'l Conf.on Software Engineering.Florence:IEEE, 2015.705-708.
    [13] Raychev V, Vechev M, Yahav E.Code completion with statistical language models.In:Proc.of the 35th ACM SIGPLAN Conf.on Programming Language Design and Implementation (PLDI 2014).ACM, 2013.419-428.
    [14] Tonella P, Tiella R, Nguyen CD.Interpolated n-grams for model based testing.In:Proc.of the 36th Int'l Conf.on Software Engineering (ICSE 2014).Hyderabad:ACM, 2014.562-572.
    [15] Allamanis M, Sutton C.Mining source code repositories at massive scale using language modeling.In:Proc.of the 10th Working Conf.on Mining Software Repositories (MSR).IEEE, 2013.207-216.
    [16] Movshovitz-Attias D, Cohen WW.Natural language models for predicting programming comments.In:Proc.of the 51st Annual Meeting of the Association for Computational Linguistics (Vol.2:Short Papers).2013.35-40.
    [17] Rahman M, Palani D, Rigby PC.Natural software revisited.In:Proc.of the 41st IEEE/ACM Int'l Conf.on Software Engineering (ICSE).IEEE, 2019.37-48.
    [18] Devanbu P.New initiative:The naturalness of software.In:Proc.of the 37th IEEE/ACM Int'l Conf.on Software Engineering.IEEE, 2015.543-546.
    [19] Ray B, Hellendoorn V, Godhane S, et al.On the"naturalness "of buggy code.In:Proc.of the 38th Int'l Conf.on Software Engineering (ICSE 2016).ACM, 2016.428-439.
    [20] Santos EA, Campbell JC, Patel D, et al.Syntax and sensibility:Using language models to detect and correct syntax errors.In:Proc.of the 25th IEEE Int'l Conf.on Software Analysis, Evolution and Reengineering (SANER).IEEE, 2018.311-322.
    [21] Campbell JC, Hindle A, Amaral JN.Syntax errors just aren't natural:improving error reporting with language models.In:Proc.of the 11th Working Conf.on Mining Software Repositories (MSR 2014).ACM, 2014.252?261.
    [22] Nguyen TT, Nguyen AT, Nguyen HA, et al.A statistical semantic language model for source code.In:Proc.of the 9th Joint Meeting on Foundations of Software Engineering.2013.532-542.
    [23] Yin P, Neubig G.A syntactic neural model for general-purpose code generation.2017.
    [24] Oda Y, Fudaba H, Neubig G, et al.Learning to generate pseudo-code from source code using statistical machine translation (t).In:Proc.of the 30th IEEE/ACM Int'l Conf.on Automated Software Engineering (ASE).IEEE, 2015.574-584.
    [25] Nguyen AT, Nguyen TT, Nguyen TN.Lexical statistical machine translation for language migration.In:Proc.of the 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013).ACM, 2013.651.
    [26] Lin B, Nagy C, Bavota G, et al.On the impact of refactoring operations on code naturalness.In:Proc.of the 26th IEEE Int'l Conf.on Software Analysis, Evolution and Reengineering (SANER).IEEE, 2019.594-598.
    [27] Jimenez M, Maxime C, Le Traon Y, et al.On the impact of tokenizer and parameters on n-gram based code analysis.In:Proc.of the 2018 IEEE Int'l Conf.on Software Maintenance and Evolution (ICSME).IEEE, 2018.437-448.
    [28] Karaivanov S, Raychev V, Vechev M.Phrase-based statistical translation of programming languages.In:Proc.of the 2014 ACM Int'l Symp.on New Ideas, New Paradigms, and Reflections on Programming&Software (Onward!2014).ACM, 2014.173-184.
    [29] Hellendoorn VJ, Devanbu PT, Bacchelli A.Will they like this?Evaluating code contributions with language models.In:Proc.of the 12th IEEE/ACM Working Conf.on Mining Software Repositories.IEEE, 2015.157-167.
    [30] Nguyen AT, Nguyen TD, Phan HD, et al.A deep neural network language model with contexts for source code.In:Proc.of the 25th IEEE Int'l Conf.on Software Analysis, Evolution and Reengineering (SANER).IEEE, 2018.323-334.
    [31] Hellendoorn VJ, Devanbu P.Are deep neural networks the best choice for modeling source code?In:Proc.of the 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017).ACM, 2017.763-773.
    [32] Zhang X, Ben KR, Zeng J.A code naturalness based defect prediction method at slice level.Ruan Jian Xue Bao/Journal of Software, 2021,32(7):2219-2241(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/6261.htm[doi:10.13328/j.cnki.jos.006261]
    [33] Nguyen AT, Nguyen TN.Graph-based statistical language model for code.In:Proc.of the 37th IEEE/ACM Int'l Conf.on Software Engineering.IEEE, 2015.858-868.
    [34] Guerrouj L, Bourque D, Rigby PC.Leveraging informal documentation to summarize classes and methods in context.In:Proc.of the 37th IEEE/ACM Int'l Conf.on Software Engineering.IEEE, 2015.639-642.
    [35] Dam HK, Tran T, Pham T.A deep language model for software code.In:Proc.of the Int'l Symp.on Foundations Software Engineering.2016.1-4.
    [36] Loyola P, Marrese-Taylor E, Matsuo Y.A neural architecture for generating natural language descriptions from source code.In:Proc.of the 55th Annual Meeting of the Association for Computational Linguistics.2017.287-292.
    [37] Bhatia S, Singh R.Automated correction for syntax errors in programming assignments using recurrent neural networks.arXiv:1603.06129, 2016.
    [38] Karampatsis RM, Babii H, Robbes R, et al.Big code!=big vocabulary:Open-vocabulary models for source code.In:Proc.of the ACM/IEEE 42nd Int'l Conf.on Software Engineering.ACM, 2020.1073-1085.
    [39] Li J, Wang Y, Lyu MR, et al.Code completion with neural attention and pointer networks.In:Proc.of the 27th Int'l Joint Conf.on Artificial Intelligence.2018.4159-4165.
    [40] Mou L, Li G, Zhang L, et al.Convolutional neural networks over tree structures for programming language processing.In:Proc.of the AAAI Conf.on Artificial Intelligence.2015.1287-1293.
    [41] Gu X, Zhang H, Zhang D, et al.Deep api learning.In:Proc.of the 24th ACM SIGSOFT Int'l Symp.on Foundations of Software Engineering.2016.631-642.
    [42] Li G, Liu H, Jin J, et al.Deep learning based identification of suspicious return statements.In:Proc.of the 27th IEEE Int'l Conf.on Software Analysis, Evolution and Reengineering (SANER).2020.480-491.
    [43] Arisoy E, Sainath TN, Kingsbury B, et al.Deep neural network language models.In:Proc.of the NAACL-HLT 2012 Workshop:Will We Ever Really Replace the N-gram Model?On the Future of Language Modeling for HLT.2012.20-28.
    [44] Gu X, Zhang H, Zhang D, et al.DeepAM:Migrate apis with multi-modal sequence to sequence learning.In:Proc.of the IJCAI Int'l Joint Conf.on Artificial Intelligence.2017.3675-3681.
    [45] Pradel M, Sen K.DeepBugs:A learning approach to name-based bug detection.Proc.of the ACM on Programming Languages, 2018, 2(OOPSLA):1-25.
    [46] Hammad M, Babur Ö, Basit HA, et al.DeepClone:Modeling clones to generate code predictions.In:Proc.of the Int'l Conf.on Software and Systems Reuse.2020.135-151.
    [47] Lanchantin J, Gao J.Exploring the naturalness of buggy code with recurrent neural networks.arXiv:1803.08793, 2018.
    [48] Corley CS, Damevski K, Kraft NA.Exploring the use of deep learning for feature location.In:Proc.of the 2015 IEEE Int'l Conf.on Software Maintenance and Evolution (ICSME).IEEE, 2015.556-560.
    [49] White M, Vendome C, Linares-Vasquez M, et al.Toward deep learning software repositories.In:Proc.of the 12th IEEE/ACM Working Conf.on Mining Software Repositories.IEEE, 2015.334-345.
    [50] Rahman MdM, Watanobe Y, Nakamura K.A bidirectional LSTM language model for code evaluation and repair.Symmetry, 2021, 13(2):247.
    [51] Hu X, Li G, Xia X, et al.Deep code comment generation.In:Proc.of the 26th IEEE/ACM Int'l Conf.on Program Comprehension (ICPC).2018.
    [52] Allamanis M, Brockschmidt M, Khademi M.Learning to represent programs with graphs.arXiv:1711.00740, 2017.
    [53] Karampatsis RM, Sutton C.Maybe deep neural networks are the best choice for modeling source code.arXiv:1903.05734, 2019.
    [54] Liu F, Li G, Zhao Y, et al.Multi-task learning based pre-trained language model for code completion.In:Proc.of the 35th IEEE/ACM Int'l Conf.on Automated Software Engineering (ASE).2020.
    [55] Ben-Nun T, Jakobovits AS, Hoefler T.Neural code comprehension:A learnable representation of code semantics.arXiv:1806.07336, 2018.
    [56] Malik RS, Patra J, Pradel M.NL2Type:Inferring javascript function types from natural language information.In:Proc.of the 41st IEEE/ACM Int'l Conf.on Software Engineering (ICSE).IEEE, 2019.304-315.
    [57] Allamanis M, Barr ET, Bird C, et al.Suggesting accurate method and class names.In:Proc.of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015).ACM, 2015.38-49.
    [58] Iyer S, Konstas I, Cheung A, et al.Summarizing source code using a neural attention model.In:Proc.of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1:Long Papers).Association for Computational Linguistics, 2016.2073-2083.
    [59] Cummins C, Petoumenos P, Wang Z, et al.Synthesizing benchmarks for predictive modeling.In:Proc.of the 2017 IEEE/ACM Int'l Symp.on Code Generation and Optimization (CGO).IEEE, 2017.86-99.
    [60] Gaunt AL, Brockschmidt M, Singh R, et al.TerpreT:A probabilistic programming language for program induction.arXiv:1612.00817, 2016.
    [61] Bengio Y, Ducharme R, Vincent P, et al.A neural probabilistic language model.Journal of Machine Learning Research, 2003, 3(6):1137-1155.
    [62] Allamanis M, Barr ET, Devanbu P, et al.A survey of machine learning for big code and naturalness.ACM Computing Surveys, 2018, 51(4):1-37.
    [63] Bhoopchand A, Rocktäschel T, Barr E, et al.Learning python code suggestion with a sparse pointer network.arXiv:1611.08307, 2016.
    [64] Bielik P, Raychev V, Vechev M.PHOG:Probabilistic model for code.In:Proc.of the Int'l Conf.on Machine Learning.2016.2933-2942.
    [65] Liu C, Wang X, Shin R, et al.NEURAL code completion.2017.https://openreview.net/pdf?id=rJbPBt9lg
    [66] Raychev V, Bielik P, Vechev M.Probabilistic model for code with decision trees.ACM SIGPLAN Notices, 2016, 51(10):731-747.
    [67] Li J, He P, Zhu J, et al.Software defect prediction via convolutional neural network.In:Proc.of the 2017 IEEE Int'l Conf.on Software Quality, Reliability and Security (QRS).IEEE, 2017.318-328.
    [68] Allamanis M, Sutton C.Mining idioms from source code.In:Proc.of the 22nd ACM SIGSOFT Int'l Symp.on Foundations of Software Engineering (FSE 2014).2014.472-483.
    [69] Nguyen TT, Pham HV, Vu PM, et al.Learning API usages from bytecode:A statistical approach.In:Proc.of the 38th Int'l Conf.on Software Engineering.2016.416-427.
    [70] Rabinovich M, Stern M, Klein D.Abstract syntax networks for code generation and semantic parsing.In:Proc.of the 55th Annual Meeting of the Association for Computational Linguistics.2017.1139-1149.
    [71] Dowdell T, Zhang H.Language modelling for source code with transformer-xl.arXiv:2007.15813v1, 2020.
    [72] Buratti L, Pujar S, Bornea M, et al.Exploring software naturalness through neural language models.arXiv:2006.12641, 2020.
    [73] Mou L, Li G, Zhang L, et al.TBCNN:A tree-based convolutional neural network for programming language processing.arXiv:1409.5718, 2014.
    [74] Zhang J, Wang X, Zhang H, et al.A novel neural source code representation based on abstract syntax tree.In:Proc.of the 41st Int'l Conf.on Software Engineering (ICSE).IEEE, 2019.783-794.
    [75] Kim S, Zhao JM, Tian YC, et al.Code prediction by feeding trees to transformers.In:Proc.of the 43rd Int'l Conf.on Software Engineering (ICSE).2021.
    [76] Mou L, Men R, Li G, et al.Natural language inference by tree-based convolution and heuristic matching.In:Proc.of the 54th Annual Meeting of the Association for Computational Linguistics.2016.130-136.
    [77] Chakraborty S, Ding Y, Allamanis M, et al.CODIT:Code editing with tree-based neural models.IEEE Trans.on Software Engineering, 2022, 48(4):1385-1399.
    [78] Luong MT, Sutskever I, Le QV, et al.Addressing the rare word problem in neural machine translation.arxiv:1410.8206, 2014.
    [79] Chen Z, Kommrusch S, Tufano M, et al.SequenceR:Sequence-to-sequence learning for end-to-end program repair.IEEE Trans.on Software Engineering, 2021, 47(9):1943-1959.
    [80] Yang B, Zhang N, Li SP, et al.Survey of intelligent code completion.Ruan Jian Xue Bao/Journal of Software, 2020, 31(5):1435-1453(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5966.htm[doi:10.13328/j.cnki.jos.005966]
    [81] Qiu F, Yan M, Xia X, et al.JITO:A tool for just-in-time defect identification and localization.In:Proc.of the 28th ACM Joint Meeting on European Software Engineering Conf.and Symp.on the Foundations of Software Engineering.ACM, 2020.1586-1590.
    [82] Yan M, Xia X, Fan Y, et al.Just-in-time defect identification and localization:A two-phase framework.IEEE Trans.on Software Engineering, 2022, 48(1):82-101.
    [83] Hu X, Li G, Xia X, et al.Summarizing source code with transferred api knowledge.In:Proc.of the 27th Int'l Joint Conf.on Artificial Intelligence (IJCAI 2018).2018.2269-2275.
    [84] Raychev V, Vechev M, Krause A.Predicting program properties from" big code."In:Proc.of the 42nd Annual ACM SIGPLANSIGACT Symp.on Principles of Programming Languages (POPL 2015).ACM, 2015.111-124.
    [85] Allamanis M, Barr ET, Bird C, et al.Learning natural coding conventions.In:Proc.of the 22nd ACM SIGSOFT Int'l Symp.on Foundations of Software Engineering (FSE 2014).2014.281-293.
    [86] Papineni K, Roukos S, Ward T, et al.BLEU:A method for automatic evaluation of machine translation.In:Proc.of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002).2001.[doi:10.3115/1073083.1073135]
    [87] Sennrich R, Haddow B, Birch A.Neural machine translation of rare words with subword units.arXiv:1508.07909, 2015.
    [88] Wang YG, Zheng Z, Li Y.word2vec-ACV:Word vector generation model of OOV context meaning.Application Research of Computers, 2019, 36(6):1623-1628(in Chinese with English abstract).
    [89] Jean S, Cho K, Memisevic R, et al.On using very large target vocabulary for neural machine translation.In:Proc.of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int'l Joint Conf.on Natural Language Processing.2015.1-10.
    [90] Liu Z, Xia X, Hassan AE, et al.Neural-machine-translation-based commit message generation:How far are we?In:Proc.of the 33rd ACM/IEEE Int'l Conf.on Automated Software Engineering (ASE 2018).ACM, 2018.373-384.
    附中文参考文献:
    [32] 张献,贲可荣,曾杰.基于代码自然性的切片粒度缺陷预测方法.软件学报, 2021, 32(7):2219-2241.http://www.jos.org.cn/1000-9825/6261.htm[doi:10.13328/j.cnki.jos.006261]
    [80] 杨博,张能,李善平,等.智能代码补全研究综述.软件学报, 2020, 31(5):1435-1453.http://www.jos.org.cn/1000-9825/5966.htm[doi:10.13328/j.cnki.jos.005966]
    [88] 王永贵,郑泽,李玥.word2vec-ACV:OOV语境含义的词向量生成模型.计算机应用研究, 2019, 36(6):1623-1628.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

陈浙哲,鄢萌,夏鑫,刘忠鑫,徐洲,雷晏.代码自然性及其应用研究进展.软件学报,2022,33(8):3015-3034

复制
分享
文章指标
  • 点击次数:2015
  • 下载次数: 4813
  • HTML阅读次数: 2822
  • 引用次数: 0
历史
  • 收稿日期:2021-01-29
  • 最后修改日期:2021-04-14
  • 在线发布日期: 2021-05-21
  • 出版日期: 2022-08-06
文章二维码
您是第19527389位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号