智能代码补全研究综述
作者:
作者简介:

杨博(1997-),男,江苏沭阳人,博士生,CCF学生会员,主要研究领域为智能软件工程,软件仓库挖掘;张能(1990-),男,博士,助理研究员,CCF专业会员,主要研究领域为软件工程,服务计算;李善平(1963-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为分布式计算,软件工程,操作系统内核;夏鑫(1986-),男,博士,讲师,CCF专业会员,主要研究领域为软件仓库挖掘,经验软件工程.

通讯作者:

夏鑫,E-mail:xin.xia@monash.edu


Survey of Intelligent Code Completion
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [88]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    代码补全(code completion)是自动化软件开发的重要功能之一,是大多数现代集成开发环境和源代码编辑器的重要组件.代码补全提供即时类名、方法名和关键字等预测,辅助开发人员编写程序,直观提高软件开发效率.近年来,开源软件社区中源代码和数据规模不断扩大,人工智能技术取得了卓越进展,这对自动化软件开发技术产生了极大的促进作用.智能代码补全(intelligent code completion)根据源代码建立语言模型,从语料库学习已有代码特征,根据待补全位置的上下文代码特征在语料库中检索最相似的匹配项进行推荐和预测.相对于传统代码补全,智能代码补全凭借其高准确率、多补全形式、可学习迭代的特性成为软件工程领域的热门方向之一.研究者们在智能代码补全方面进行了一系列研究,根据这些方法如何表征和利用源代码信息的不同方式,可以将它们分为基于编程语言表征和基于统计语言表征两个研究方向,其中,基于编程语言表征又分为标识符序列、抽象语法树、控制/数据流图这3个类别,基于统计语言表征又分为N-gram模型、神经网络模型这2个类别.从代码表征的角度入手,对近年来代码补全方法研究进展进行梳理和总结,主要内容包括:(1)根据代码表征方式阐述并归类了现有的智能代码补全方法;(2)总结了代码补全的一般过程和模型评估中的模型验证方法与性能评估指标;(3)归纳了智能代码补全的主要挑战;(4)展望了智能代码补全的未来发展方向.

    Abstract:

    Code completion is one of the crucial functions of automation software development. It is an essential component of most modern integrated development environments and source code editors. Code completion provides predictions such as instant class names, method names, keywords, and assists developer to code, which improves the efficiency of software development intuitively. In recent years, with the expanding of the source code and data scale in the open-source software community, and outstanding progress in artificial intelligence technology, the automation software development technology has been much promoted. Intelligent code completion builds a language model for source code, learns features from the existing code corpus, and retrieves the most similar matches in the corpus for recommendation and prediction based on the context code features around the position to be completed. Compared to traditional code completion, intelligence code completion has become one of the hot trends in the field of software engineering with its characteristics like high accuracy, multiple completion forms, and iterative learning ability. Researchers have conducted a series of researches on intelligent code completion. According to the different forms that these completion methods represent and utilize source code information, they can be divided into two research directions: programming language representation and statistical language representation. The programming language is divided into three types: token sequences, abstract syntax tree, and control/data flow graph. The statistical language also has two types: n-gram model and the neural network model. This paper starts from the perspective of code representation and summarizes the research progress of code completion methods in recent years. The main contents include: (1) expounding and classifying existing intelligent code completion methods according to code representation; (2) summarizing the experimental verification methods and performance evaluation indicators used in model evaluation; (3) summarizing the critical issues of intelligent code completion; (4) looking forward to the future development of intelligent code completion.

    参考文献
    [1] Lo D, Xia X. Fusion fault localizers. In: Proc. of the 29th ACM/IEEE Int’l Conf. on Automated Software Engineering. ACM, 2014. 127-138.
    [2] Xia X, Lo D, Pan SJ, Nagappan N, Wang X. Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans. on Software Engineering, 2016,42(10):977-998.
    [3] Le Goues C, Nguyen T, Forrest S, Weimer W. Genprog: A generic method for automatic software repair. IEEE Trans. on Software Engineering, 2011,38(1):54-72.
    [4] Xiong Y, Liu X, Zeng M, Zhang L, Huang G. Identifying patch correctness in test-based program repair. In: Proc. of the 40th Int’l Conf. on Software Engineering. ACM, 2018. 789-799.
    [5] Bellamy B, Avgustinov P, De Moor O, Sereni D. Efficient local type inference. ACM SIGPLAN Notices, 2008,43(10):475-492.
    [6] Pierce BC, Turner DN. Local type inference. ACM Trans. on Programming Languages and Systems (TOPLAS), 2000,22(1):1-44.
    [7] Huang Q, Xia X, Xing Z, Lo D, Wang X. API method recommendation without worrying about the task-API knowledge gap. In: Proc. of the 33rd ACM/IEEE Int’l Conf. on Automated Software Engineering. ACM, 2018. 293-304.
    [8] Nguyen P, Di Rocco J, Ruscio D, Ochoa L, Degueule T, Di Penta M. Focus: A recommender system for mining api function calls and usage patterns. In: Proc. of the 41st ACM/IEEE Int’l Conf. on Software Engineering (ICSE). 2019.
    [9] Hill R, Rideout J. Automatic method completion. In: Proc. of the 19th IEEE Int’l Conf. on Automated Software Engineering. IEEE Computer Society, 2004. 228-235.
    [10] Asaduzzaman M, Roy CK, Schneider KA, Hou D. Cscc: Simple, efficient, context sensitive code completion. In: Proc. of the 2014 IEEE Int’l Conf. on Software Maintenance and Evolution. IEEE, 2014. 71-80.
    [11] Bruch M, Monperrus M, Mezini M. Learning from examples to improve code completion systems. In: Proc. of the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Engineering. ACM, 2009. 213-222.
    [12] Roos P. Fast and precise statistical code completion. In: Proc. of the 37th Int’l Conf. on Software Engineering, Vol.2. IEEE, 2015. 757-759.
    [13] Li J, Wang Y, Lyu MR, King I. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573, 2017.
    [14] Raychev V, Vechev M, Yahav E. Code completion with statistical language models. ACM SIGPLAN Notices, 2014,49(6):419-428.
    [15] Murphy GC, Kersten M, Findlater L. How are Java software developers using the elipse ide? IEEE Software, 2006,23(4):76-83.
    [16] Gorin RE. SPELL: A spelling checking and correction program. Online Documentation for the DEC-10 Computer, 1971. 147-160. https://www.saildart.org/allow/SPELL.REG%5bUP,DOC%5d
    [17] Pletcher DM, Hou D. BCC: Enhancing code completion for better API usability. In: Proc. of the 2009 IEEE Int’l Conf. on Software Maintenance. IEEE, 2009. 393-394.
    [18] Hou D, Pletcher DM. An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion. In: Proc. of the 2011 27th IEEE Int’l Conf. on Software Maintenance (ICSM). IEEE, 2011. 233-242.
    [19] Robbes R, Lanza M. How program history can improve code completion. In: Proc. of the 2008 23rd IEEE/ACM Int’l Conf. on Automated Software Engineering. IEEE Computer Society, 2008. 317-326.
    [20] Proksch S, Lerch J, Mezini M. Intelligent code completion with Bayesian networks. ACM Trans. on Software Engineering and Methodology (TOSEM), 2015,25(1):1-31.
    [21] Hindle A, Barr ET, Su Z, Gabel M, Devanbu P. On the naturalness of software. In: Proc. of the 2012 34th Int’l Conf. on Software Engineering (ICSE). IEEE, 2012. 837-847.
    [22] Bielik P, Raychev V, Vechev M. PHOG: Probabilistic model for code. In: Proc. of the Int’l Conf. on Machine Learning. 2016. 2933-2942.
    [23] Lee YY, Harwell S, Khurshid S, Marinov D. Temporal code completion and navigation. In: Proc. of the 2013 Int’l Conf. on Software Engineering. IEEE, 2013. 1181-1184.
    [24] Nguyen AT, Nguyen HA, Nguyen TT, Nguyen TN. GraPacc: A graph-based pattern-oriented, context-sensitive code completion tool. In: Proc. of the 2012 34th Int’l Conf. on Software Engineering (ICSE). IEEE, 2012. 1407-1410.
    [25] Omori T, Kuwabara H, Maruyama K. A study on repetitiveness of code completion operations. In: Proc. of the 2012 28th IEEE Int’l Conf. on Software Maintenance (ICSM). IEEE, 2012. 584-587.
    [26] Jin X, Servant F. The hidden cost of code completion: Understanding the impact of the recommendation-list length on its efficiency. In: Proc. of the 2018 IEEE/ACM 15th Int’l Conf. on Mining Software Repositories (MSR). IEEE, 2018. 70-73.
    [27] Tu Z, Su Z, Devanbu P. On the localness of software. In: Proc. of the 22nd ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. ACM, 2014. 269-280.
    [28] Zhong H, Wang X. Boosting complete-code tool for partial program. In: Proc. of the 32nd IEEE/ACM Int’l Conf. on Automated Software Engineering. IEEE, 2017. 671-681.
    [29] White M, Vendome C, Linares-Vásquez M, Poshyvanyk D. Toward deep learning software repositories. In: Proc. of the 12th Working Conf. on Mining Software Repositories. IEEE, 2015. 334-345.
    [30] Nguyen TT, Nguyen AT, Nguyen HA, Nguyen TN. A statistical semantic language model for source code. In: Proc. of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 2013. 532-542.
    [31] de Souza Amorim LE, Erdweg S, Wachsmuth G, Visser E. Principled syntactic code completion using placeholders. In: Proc. of the 2016 ACM SIGPLAN Int’l Conf. on Software Language Engineering. ACM, 2016. 163-175.
    [32] Hou D, Pletcher DM. Towards a better code completion system by API grouping, filtering, and popularity-based ranking. In: Proc. of the 2nd Int’l Workshop on Recommendation Systems for Software Engineering. ACM, 2010. 26-30.
    [33] Jacobellis J, Meng N, Kim M. Cookbook: In Situ code completion using edit recipes learned from examples. In: Companion Proc. of the 36th Int’l Conf. on Software Engineering. ACM, 2014. 584-587.
    [34] Bhoopchand A, Rocktäschel T, Barr E, Riedel S. Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307, 2016.
    [35] Nguyen TT, Pham HV, Vu PM, Nguyen TT. Recommending API usages for mobile apps with hidden Markov model. In: Proc. of the 2015 30th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). IEEE, 2015. 795-800.
    [36] Gvero T, Kuncak V, Kuraj I, Piskac R. Complete completion using types and weights. ACM SIGPLAN Notices, 2013,48(6):27-38.
    [37] Hellendoorn VJ, Proksch S, Gall HC, Bacchelli A. When code completion fails: A case study on real-world completions. In: Proc. of the 41st Int’l Conf. on Software Engineering. Piscataway: IEEE, 2019. 960-970.
    [38] Little G, Miller RC. Keyword programming in Java. Automated Software Engineering, 2009,16(1):37-71.
    [39] Han S, Wallace DR, Miller RC. Code completion from abbreviated input. In: Proc. of the 2009 IEEE/ACM Int’l Conf. on Automated Software Engineering. IEEE, 2009. 332-343.
    [40] Han S, Wallace DR, Miller RC. Code completion of multiple keywords from abbreviated input. Automated Software Engineering, 2011,18(3-4):363-398.
    [41] https://github.com/
    [42] Bettenburg N, Nagappan M, Hassan AE. Towards improving statistical modeling of software engineering data: Think locally, act globally! Empirical Software Engineering, 2015,20(2):294-335.
    [43] Nguyen AT, Hilton M, Codoban M, Nguyen HA, Mast L, Rademacher E, Nguyen TN, Dig D. API code recommendation using statistical learning from fine-grained changes. In: Proc. of the 2016 24th ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. ACM, 2016. 511-522.
    [44] Arora C, Sabetzadeh M, Briand L, Zimmer F. Automated checking of conformance to requirements templates using natural language processing. IEEE Trans. on Software Engineering, 2015,41(10):944-968.
    [45] Falessi D, Cantone G, Canfora G. Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Trans. on Software Engineering, 2011,39(1):18-44.
    [46] Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3(6):1137-1155.
    [47] Mikolov T, Karafiát M, Burget L, Černock? J, Khudanpur S. Recurrent neural network based language model. In: Proc. of the 11th Annual Conf. of the Int’l Speech Communication Association. 2010.
    [48] Hu X, Li G, Liu F, Jin Z. Program generation and code completion techniques based on deep learning: Literature review. Ruan Jian Xue Bao/Journal of Software, 2019,30(5):1206-1223(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5717.htm [doi: 10.13328/j.cnki.jos.005717]
    [49] Zong CQ. Statistical Natural Language Processing. Beijing: Tsinghua University Press, 2013(in Chinese).
    [50] Perelman D, Gulwani S, Ball T, Grossman D. Type-directed completion of partial expressions. ACM SIGPLAN Notices, 2012, 47(6):275-286.
    [51] Holmes R, Murphy GC. Using structural context to recommend source code examples. In: Proc. of the 27th Int’l Conf. on Software Engineering (ICSE 2005). IEEE, 2005. 117-125.
    [52] Bajaj K, Pattabiraman K, Mesbah A. Dompletion: Dom-aware Javascript code completion. In: Proc. of the 29th ACM/IEEE Int’l Conf. on Automated Software Engineering. ACM, 2014. 43-54.
    [53] Raychev V, Bielik P, Vechev M. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, 2016,51(10):731-747.
    [54] http://wala.sourceforge.net/wiki/index.php/Main_Page
    [55] https://github.com/saltlab/dompletion
    [56] Gabel M, Su Z. A study of the uniqueness of source code. In: Proc. of the 18th ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. ACM, 2010. 147-156.
    [57] Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 1989,77(2): 257-286.
    [58] Brill E, Moore RC. An improved error model for noisy channel spelling correction. In: Proc. of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2000. 286-293.
    [59] Hinton GE, Revow M, Dayan P. Recognizing handwritten digits using mixtures of linear models. In: Proc. of the Advances in Neural Information Processing Systems. 1995. 1015-1022.
    [60] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In: Proc. of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002. 311-318.
    [61] Brown PF, Pietra VJD, Pietra SAD, Mercer RL. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 1993,19(2):263-311.
    [62] Bahl LR, Jelinek F, Mercer RL. A maximum likelihood approach to continuous speech recognition. IEEE Trans. on Pattern Analysis & Machine Intelligence, 1983,5(2):179-190.
    [63] Franks C, Tu Z, Devanbu P, Hellendoorn V. Cacheca: A cache language model based code suggestion tool. In: Proc. of the 37th Int’l Conf. on Software Engineering, Vol.2. IEEE, 2015. 705-708.
    [64] Nguyen AT, Nguyen TN. Graph-based statistical language model for code. In: Proc. of the 2015 IEEE/ACM 37th IEEE Int’l Conf. on Software Engineering. IEEE, 2015. 858-868.
    [65] Allamanis M, Sutton C. Mining source code repositories at massive scale using language modeling. In: Proc. of the 10th Working Conf. on Mining Software Repositories. IEEE, 2013. 207-216.
    [66] Negara S, Codoban M, Dig D, Johnson RE. Mining fine-grained code changes to detect unknown change patterns. In: Proc. of the 36th Int’l Conf. on Software Engineering. ACM, 2014. 803-813.
    [67] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks, 1994,5(2):157-166.
    [68] Martens J, Sutskever I. Learning recurrent neural networks with hessian-free optimization. In: Proc. of the 28th Int’l Conf. on Machine Learning (ICML 2011). 2011. 1033-1040.
    [69] Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Proc. of the 13th Annual Conf. of the Int’l Speech Communication Association. 2012.
    [70] Hellendoorn VJ, Devanbu P. Are deep neural networks the best choice for modeling source code? In: Proc. of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017. 763-773.
    [71] Allamanis M, Brockschmidt M, Khademi M. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017.
    [72] Gu X, Zhang H, Zhang D, Kim S. Deep API learning. In: Proc. of the 2016 24th ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. ACM, 2016. 631-642.
    [73] Rahman M, Palani D, Rigby PC. Natural software revisited. In: Proc. of the 41st Int’l Conf. on Software Engineering. IEEE, 2019. 37-48.
    [74] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proc. of the Advances in Neural Information Processing Systems. 2012. 1097-1105.
    [75] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 1-9.
    [76] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proc. of the IEEE Int’l Conf. on Computer Vision. 2017. 2961-2969.
    [77] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the Advances in Neural Information Processing Systems. 2015. 91-99.
    [78] Girshick R. Fast R-CNN. In: Proc. of the IEEE Int’l Conf. on Computer Vision. 2015. 1440-1448.
    [79] Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proc. of the 25th Int’l Conf. on Machine Learning. ACM, 2008. 160-167.
    [80] Kim Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
    [81] Abdel-Hamid O, Mohamed A, Jiang H, Penn G. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proc. of the 2012 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. 4277-4280.
    [82] Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A. The Microsoft 2017 conversational speech recognition system. In: Proc. of the 2018 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. 5934-5938.
    [83] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015,521(7553):436-444.
    [84] Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation. In: Proc. of the 26th Conf. on Program Comprehension. ACM, 2018. 200-210.
    [85] Liu Z, Xia X, Hassan AE, Lo D, Xing Z, Wang X. Neural-machine-translation-based commit message generation: How far are we? In: Proc. of the 33rd ACM/IEEE Int’l Conf. on Automated Software Engineering. ACM, 2018. 373-384.
    附中文参考文献:
    [48] 胡星,李戈,刘芳,金芝.基于深度学习的程序生成与补全技术研究进展.软件学报,2019,30(5):1206-1223. http://www.jos.org.cn/1000-9825/5717.htm [doi: 10.13328/j.cnki.jos.005717]
    [49] 宗成庆.统计自然语言处理.北京:清华大学出版社,2013.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杨博,张能,李善平,夏鑫.智能代码补全研究综述.软件学报,2020,31(5):1435-1453

复制
分享
文章指标
  • 点击次数:4724
  • 下载次数: 9765
  • HTML阅读次数: 5968
  • 引用次数: 0
历史
  • 收稿日期:2019-08-19
  • 最后修改日期:2019-10-28
  • 在线发布日期: 2020-04-09
  • 出版日期: 2020-05-06
文章二维码
您是第19904468位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号