预训练模型在软件工程领域应用研究进展
作者:
基金项目:

国家自然科学基金(62202223); 江苏省自然科学基金(BK20220881); 高安全系统的软件开发与验证技术工信部重点实验室(南京航空航天大学)开放项目(NJ2022027)


Research Progress of Pre-trained Model in Software Engineering
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [122]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    近年来深度学习在软件工程领域任务中取得了优异的性能. 众所周知, 实际任务中优异性能依赖于大规模训练集, 而收集和标记大规模训练集需要耗费大量资源和成本, 这限制了深度学习技术在实际任务中的广泛应用. 随着深度学习领域预训练模型(pre-trained model, PTM)的发布, 将预训练模型引入到软件工程(software engineering, SE)任务中得到了国内外软件工程领域研究人员的广泛关注, 并得到了质的飞跃, 使得智能化软件工程进入了一个新时代. 然而, 目前没有研究提炼预训练模型在软件工程领域的成功和机遇. 为阐明这一交叉领域的工作 (pre-trained models for software engineering, PTM4SE), 系统梳理当前基于预训练模型的智能软件工程相关工作, 首先给出基于预训练模型的智能软件工程方法框架, 其次分析讨论软件工程领域常用的预训练模型技术, 详细介绍使用预训练模型的软件工程领域下游任务, 并比较和分析预训练模型技术这些任务上的性能. 然后详细介绍常用的训练和微调PTM的软件工程领域数据集. 最后, 讨论软件工程领域使用PTM面临的挑战和机遇. 同时将整理的软件工程领域PTM和常用数据集发布在https://github.com/OpenSELab/PTM4SE.

    Abstract:

    In recent years, deep learning has achieved excellent performance in software engineering (SE) tasks. Excellent performance in practical tasks depends on large-scale training sets, and collecting and labeling large-scale training sets require a lot of resources and costs, which limits the wide application of deep learning techniques in practical tasks. With the release of pre-trained model (PTM) in the field of deep learning, researchers in SE have begun to pay attention to PTM and introduced PTM into SE tasks. PTM has made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (pre-trained models for software engineering, PTM4SE), this study systematically reviews the current studies related to PTM4SE. Specifically, the study first describes the framework of the intelligent software engineering methods based on pre-trained models and then analyzes the commonly used pre-trained models in SE. Meanwhile, it introduces the downstream tasks in SE with pre-trained models in detail and compares and analyzes the performance of pre-trained model techniques on these tasks. The study then presents the datasets used in SE for training and fine-tuning the PTMs. Finally, it discusses the challenges and opportunities for PTM4SE. The collated PTMs and datasets in SE are published athttps://github.com/OpenSELab/PTM4SE.

    参考文献
    [1] Watson C, Cooper N, Palacio DN, Moran K, Poshyvanyk D. A systematic literature review on the use of deep learning in software engineering research. ACM Trans. on Software Engineering and Methodology, 2022, 31(2): 32.
    [2] Hellendoorn VJ, Devanbu P. Are deep neural networks the best choice for modeling source code? In: Proc. of the 11th Joint Meeting on Foundations of Software Engineering. Paderborn: Association for Computing Machinery, 2017. 763–773.
    [3] Budhiraja A, Dutta K, Reddy R, Shrivastava M. DWEN: Deep word embedding network for duplicate bug report detection in software repositories. In: Proc. of the 40th Int’l Conf. on Software Engineering. Gothenburg: Association for Computing Machinery, 2018. 193–194. [doi: 10.1145/3183440.3195092]
    [4] Liu BC, Huo W, Zhang C, Li WC, Li F, Piao A, Zou W. αDiff: Cross-version binary code similarity detection with DNN. In: Proc. of the 33rd ACM/IEEE Int’l Conf. on Automated Software Engineering. Montpellier: Association for Computing Machinery, 2018. 667–678. [doi: 10.1145/3238147.3238199]
    [5] Qiu XP, Sun TX, Xu YG, Shao YF, Dai N, Huang XJ. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 2020, 63(10): 1872–1897.
    [6] Niu CA, Li CY, Luo B, Ng V. Deep learning meets software engineering: A survey on pre-trained models of source code. In: Proc. of the 31st Int’l Joint Conf. on Artificial Intelligence. Vienna: IJCAI, 2022. 5546–5555. [doi: 10.24963/ijcai.2022/775]
    [7] Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A. Experimentation in Software Engineering. Berlin: Springer, 2012. [doi: 10.1007/978-3-642-29044-2]
    [8] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
    [9] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
    [10] Yang ZL, Dai ZH, Yang YM, Carbonell J, Salakhutdinov R, Le QV. XLNet: Generalized autoregressive pretraining for language understanding. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 517.
    [11] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
    [12] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2015.
    [13] Liu Y, Liu MW, Peng X, Treude C, Xing ZC, Zhang XX. Generating concept based API element comparison using a knowledge graph. In: Proc. of the 35th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Melbourne: IEEE, 2020. 834–845.
    [14] Al Omran FNA, Treude C. Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments. In: Proc. of the 14th IEEE/ACM Int’l Conf. on Mining Software Repositories (MSR). Buenos Aires: IEEE, 2017. 187–197. [doi: 10.1109/MSR.2017.42]
    [15] Xie WK, Peng X, Liu MW, Treude C, Xing ZC, Zhang XX, Zhao WY. API method recommendation via explicit matching of functionality verb phrases. In: Proc. of the 28th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. New York: Association for Computing Machinery, 2020. 1015–1026.
    [16] Zhao XJ, Xing ZC, Kabir MA, Sawada N, Li J, Lin SW. HDSKG: Harvesting domain specific knowledge graph from content of webpages. In: Proc. of the 24th IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Klagenfurt: IEEE, 2017. 56–67. [doi: 10.1109/SANER.2017.7884609]
    [17] Tian Y, Lo D. A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In: Proc. of the 22nd Int’l Conf. on Software Analysis, Evolution, and Reengineering (SANER). Montreal: IEEE, 2015. 570–574.
    [18] Wu Q, Liang GT, Wang QX, Xie T, Mei H. Iterative mining of resource-releasing specifications. In: Proc. of the 26th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE 2011). Lawrence: IEEE, 2011. 233–242. [doi: 10.1109/ASE.2011.6100058]
    [19] Pandita R, Xiao XS, Zhong H, Xie T, Oney S, Paradkar A. Inferring method specifications from natural language API descriptions. In: Proc. of the 34th Int’l Conf. on Software Engineering (ICSE). Zurich: IEEE, 2012. 815–825. [doi: 10.1109/ICSE.2012.6227137]
    [20] Howard MJ, Gupta S, Pollock L, Vijay-Shanker K. Automatically mining software-based, semantically-similar words from comment-code mappings. In: Proc. of the 10th Working Conf. on Mining Software Repositories (MSR). San Francisco: IEEE, 2013. 377–386. [doi: 10.1109/MSR.2013.6624052]
    [21] Pandita R, Xiao XS, Yang W, Enck W, Xie T. WHYPER: Towards automating risk assessment of mobile applications. In: Proc. of the 22nd USENIX Conf. on Security. Washington: USENIX Association, 2013. 527–542.
    [22] Wang C, Peng X, Liu MW, Xing ZC, Bai XF, Xie B, Wang T. A learning-based approach for automatic construction of domain glossary from source code and documentation. In: Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Tallinn: Association for Computing Machinery, 2019. 97–108.
    [23] Liu YC, Sun XB, Duan YC. Analyzing program readability based on WordNet. In: Proc. of the 19th Int’l Conf. on Evaluation and Assessment in Software Engineering. New York: Association for Computing Machinery, 2015. 27. [doi: 10.1145/2745802.2745837]
    [24] Ghadhab L, Jenhani I, Mkaouer MW, Messaoud MB. Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Information and Software Technology, 2021, 135: 106566.
    [25] Biswas E, Karabulut ME, Pollock L, Vijay-Shanker K. Achieving reliable sentiment analysis in the software engineering domain using BERT. In: Proc. of the 2020 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME). Adelaide: IEEE, 2020. 162–173. [doi: 10.1109/ICSME46990.2020.00025]
    [26] Gao CY, Zhou WJ, Xia X, Lo D, Xie Q, Lyu MR. Automating APP review response generation based on contextual knowledge. ACM Trans. on Software Engineering and Methodology, 2022, 31(1): 11.
    [27] Wang J, Zhang XF, Chen L. How well do pre-trained contextual language representations recommend labels for GitHub issues? Knowledge-based Systems, 2021, 232: 107476. [doi: 10.1016/j.knosys.2021.107476]
    [28] Hadi MA, Fard FH. Evaluating pre-trained models for user feedback analysis in software engineering: A study on classification of APP-reviews. arXiv:2104.05861, 2021.
    [29] von der Mosel J, Trautsch A, Herbold S. On the validity of pre-trained Transformers for natural language processing in the software engineering domain. IEEE Trans. on Software Engineering, 2023, 49(4): 1487–1507.
    [30] Prenner JAA, Robbes R. Making the most of small software engineering datasets with modern machine learning. IEEE Trans. on Software Engineering, 2022, 48(12): 5050–5067.
    [31] Tian HY, Liu K, Kaboré AK, Koyuncu A, Li L, Klein J, Bissyandé TF. Evaluating representation learning of code changes for predicting patch correctness in program repair. In: Proc. of the 35th IEEE/ACM Int’l Conf. on Automated Software Engineering. Melbourne: Association for Computing Machinery, 2020. 981–992. [doi: 10.1145/3324884.3416532]
    [32] Wang Y, Wang WS, Joty S, Hoi SCH. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Punta Cana: ACL, 2021. 8696–8708. [doi: 10.18653/v1/2021.emnlp-main.685]
    [33] Buratti L, Pujar S, Bornea M, McCarley S, Zheng YH, Rossiello G, Morari A, Laredo J, Thost V, Zhuang YF, Domeniconi G. Exploring software naturalness through neural language models. arXiv:2006.12641, 2020.
    [34] De Bortoli Fávero EM, Casanova D, Pimentel AR. SE3M: A model for software effort estimation using pre-trained embedding models. Information and Software Technology, 2022, 147: 106886.
    [35] Karmakar A, Robbes R. What do pre-trained code models know about code? In: Proc. of the 36th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Melbourne: IEEE, 2021. 1332–1336. [doi: 10.1109/ASE51524.2021.9678927]
    [36] Zhang T, Xu BW, Thung F, Haryono SA, Lo D, Jiang LX. Sentiment analysis for software engineering: How far can pre-trained Transformer models go? In: Proc. of the 2020 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME). Adelaide: IEEE, 2020. 70–80. [doi: 10.1109/ICSME46990.2020.00017]
    [37] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for NLP. In: Proc. of the 36th Int’l Conf. on Machine Learning. Los Angeles: PMLR, 2019. 2790–2799.
    [38] Yang CR, Xu BW, Khan JY, Uddin G, Han D, Yang Z, Lo D. Aspect-based API review classification: How far can pre-trained Transformer model go? In: Proc. of the 2022 IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Honolulu: IEEE, 2022. 385–395. [doi: 10.1109/SANER53432.2022.00054]
    [39] Uddin G, Guéhénuc YG, Khomh F, Roy CK. An empirical study of the effectiveness of an ensemble of stand-alone sentiment detection tools for software engineering datasets. ACM Trans. on Software Engineering and Methodology, 2022, 31(3): 48.
    [40] Lin JF, Liu YL, Zeng QK, Jiang M, Cleland-Huang J. Traceability transformed: Generating more accurate links with pre-trained BERT models. In: Proc. of the 43rd IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Madrid: IEEE, 2021. 324–335.
    [41] Dong YH, Jiang X, Liu YC, Li G, Jin Z. CodePAD: Sequence-based code generation with pushdown automaton. arXiv:2211.00818, 2023.
    [42] Poesia G, Polozov A, Le V, Tiwari A, Soares G, Meek C, Gulwani S. Synchromesh: Reliable code generation from pre-trained language models. In: Proc. of the 10th Int’l Conf. on Learning Representations. ICLR, 2022.
    [43] Cao Y, Fard FH. Pre-trained neural language models for automatic mobile APP user feedback answer generation. In: Proc. of the 36th IEEE/ACM Int’l Conf. on Automated Software Engineering Workshops (ASEW). Melbourne: IEEE, 2021. 120–125.
    [44] Guo DY, Lu S, Duan N, Wang YL, Zhou M, Yin J. UniXcoder: Unified cross-modal pre-training for code representation. In: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: Association for Computational Linguistics, 2022. 7212–7225. [doi: 10.18653/v1/2022.acl-long.499]
    [45] Wang DZ, Jia ZY, Li SS, Yu Y, Xiong Y, Dong W, Liao XK. Bridging pre-trained models and downstream tasks for source code understanding. In: Proc. of the 44th Int’l Conf. on Software Engineering (ICSE). Pittsburgh: IEEE, 2022. 287–298.
    [46] Hanif H, Maffeis S. VulBERTa: Simplified source code pre-training for vulnerability detection. In: Proc. of the 2022 Int’l Joint Conf. on Neural Networks (IJCNN). Padua: IEEE, 2022. 1–8. [doi: 10.1109/IJCNN55064.2022.9892280]
    [47] Ahmed T, Ledesma NR, Devanbu P. SynShine: Improved fixing of syntax errors. IEEE Trans. on Software Engineering, 2023, 49(4): 2169–2181.
    [48] Izadi M, Akbari K, Heydarnoori A. Predicting the objective and priority of issue reports in software repositories. Empirical Software Engineering, 2022, 27(2): 50.
    [49] Ahmed T, Ledesma NR, Devanbu P. SynFix: Automatically fixing syntax errors using compiler diagnostics. arXiv:2104.14671, 2021.
    [50] Efstathiou V, Chatzilenas C, Spinellis D. Word embeddings for the software engineering domain. In: Proc. of the 15th Int’l Conf. on Mining Software Repositories. Gothenburg: Association for Computing Machinery, 2018. 38–41. [doi: 10.1145/3196398.3196448]
    [51] Pradel M, Murali V, Qian R, Machalica M, Meijer E, Chandra S. Scaffle: Bug localization on millions of files. In: Proc. of the 29th ACM SIGSOFT Int’l Symp. on Software Testing and Analysis. New York: Association for Computing Machinery, 2020. 225–236.
    [52] Sun ZS, Liu Y, Cheng ZM, Yang C, Che PY. Req2Lib: A semantic neural model for software library recommendation. In: Proc. of the 27th IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). London: IEEE, 2020. 542–546.
    [53] Zhong H, Su ZD. Detecting API documentation errors. In: Proc. of the 2013 ACM SIGPLAN Int’l Conf. on Object Oriented Programming Systems Languages & Applications. Indianapolis: Association for Computing Machinery, 2013. 803–816.
    [54] Liu MW, Peng X, Marcus A, Xing ZC, Xie WK, Xing SS, Liu Y. Generating query-specific class API summaries. In: Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and the Symp. on the Foundations of Software Engineering. Tallinn: Association for Computing Machinery, 2019. 120–130. [doi: 10.1145/3338906.3338971]
    [55] Perez L, Ottens L, Viswanathan S. Automatic code generation using pre-trained language models. arXiv:2102.10535, 2021.
    [56] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code. arXiv:2107.03374, 2021.
    [57] Zhong H, Zhang L, Xie T, Mei H. Inferring resource specifications from natural language API documentation. In: Proc. of the 2009 IEEE/ACM Int’l Conf. on Automated Software Engineering. Auckland: IEEE, 2009. 307–318. [doi: 10.1109/ASE.2009.94]
    [58] Ghanem M, Elnaggar A, Mckinnon A, Debes C, Boisard O, Matthes F. Automated employee objective matching using pre-trained word embeddings. In: Proc. of the 25th IEEE Int’l Enterprise Distributed Object Computing Conf. (EDOC). Gold Coast: IEEE, 2021. 51–60. [doi: 10.1109/EDOC52215.2021.00016]
    [59] Best N, Ott J, Linstead EJ. Exploring the efficacy of transfer learning in mining image-based software artifacts. Journal of Big Data, 2020, 7(1): 59.
    [60] Keller P, Kaboré AK, Plein L, Klein J, Traon YL, Bissyandé TF. What you see is what it means! Semantic representation learning of code based on visualization and transfer learning. ACM Trans. on Software Engineering and Methodology, 2021, 31(2): 31.
    [61] Soselia D, Saifullah K, Zhou TY. Reinforcement learning finetuned vision-code Transformer for UI-to-code generation. arXiv:2305.14637, 2023.
    [62] Li G, Li Y. Spotlight: Mobile UI understanding using vision-language models with a focus. In: Proc. of the 11th Int’l Conf. on Learning Representations. Kigali: OpenReview.net, 2023. 1–16.
    [63] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: A large-scale hierarchical image database. In: Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 248–255. [doi: 10.1109/CVPR.2009.5206848]
    [64] Tufano R, Masiero S, Mastropaolo A, Pascarella L, Poshyvanyk D, Bavota G. Using pre-trained models to boost code review automation. In: Proc. of the 44th Int’l Conf. on Software Engineering. Pittsburgh: Association for Computing Machinery, 2022. 2291–2302. [doi: 10.1145/3510003.3510621]
    [65] Zhang YF, Huang C, Zhang YK, Cao K, Andersen ST, Shao HJ, Leach K, Huang Y. COMBO: Pre-training representations of binary code using contrastive learning. arXiv:2210.05102, 2022.
    [66] Mastropaolo A, Scalabrino S, Cooper N, Nader Palacio D, Poshyvanyk D, Oliveto R, Bavota G. Studying the usage of text-to-text transfer Transformer to support code-related tasks. In: Proc. of the 43rd IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Madrid: IEEE, 2021. 336–347. [doi: 10.1109/ICSE43902.2021.00041]
    [67] Berabi B, He JX, Raychev V, Vechev M. TFix: Learning to fix coding errors with a text-to-text Transformer. In: Proc. of the 38th Int’l Conf. on Machine Learning. Vienna: PMLR, 2021. 780–791.
    [68] Yuan W, Zhang QJ, He TK, Fang CR, Hung NQV, Hao XD, Yin HZ. CIRCLE: Continual repair across programming languages. In: Proc. of the 31st ACM SIGSOFT Int’l Symp. on Software Testing and Analysis. New York: Association for Computing Machinery, 2022. 678–690.
    [69] Li ZY, Lu S, Guo DY, Duan N, Jannu S, Jenks G, Majumder D, Green J, Svyatkovskiy A, Fu SY, Sundaresan N. Automating code review activities by large-scale pre-training. In: Proc. of the 30th ACM Joint European Software Engineering Conf. and the Symp. on the Foundations of Software Engineering. Singapore: Association for Computing Machinery, 2022. 1035–1047.
    [70] Mastropaolo A, Aghajani E, Pascarella L, Bavota G. An empirical study on code comment completion. In: Proc. of the 2021 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME). Luxembourg: IEEE, 2021. 159–170.
    [71] Chen ZP, Cao YB, Lu X Mei QZ, Liu XZ. SEntiMoji: An emoji-powered learning approach for sentiment analysis in software engineering. In: Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and the Symp. on the Foundations of Software Engineering. Tallinn: Association for Computing Machinery, 2019. 841–852. [doi: 10.1145/3338906.3338977]
    [72] Gui Y, Wan Y, Zhang HY, Huang HF, Sui YL, Xu GD, Shao ZY, Jin H. Cross-language binary-source code matching with intermediate representations. In: Proc. of the 2022 IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Honolulu: IEEE, 2022. 601–612. [doi: 10.1109/SANER53432.2022.00077]
    [73] Alon U, Zilberstein M, Levy O, Yahav E. code2vec: Learning distributed representations of code. Proc. of the ACM on Programming Languages, 2019, 3(POPL): 40.
    [74] Feng ZY, Guo DY, Tang DY, Duan N, Feng XC, Gong M, Shou LJ, Qin B, Liu T, Jiang DX, Zhou M. CodeBERT: A pre-trained model for programming and natural languages. In: Proc. of the 2020 Findings of the Association for Computational Linguistics. ACL, 2020. 1536–1547.
    [75] Ahmad W, Chakraborty S, Ray B, Chang KW. Unified pre-training for program understanding and generation. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2021. 2655–2668.
    [76] Peng DL, Zheng SX, Li YT, Ke GL, He D, Liu TY. How could neural networks understand programs? In: Proc. of the 38th Int’l Conf. on Machine Learning. Vienna: PMLR, 2021. 8476–8486.
    [77] Lachaux MA, Rozière B, Szafraniec M, Lample G. DOBF: A deobfuscation pre-training objective for programming languages. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. NeurIPS, 2021. 14967–14979.
    [78] Jung TH. CommitBERT: Commit message generation using pre-trained programming language model. In: Proc. of the 1st Workshop on Natural Language Processing for Programming. ACL, 2021. 26–33.
    [79] Mashhadi E, Hemmati H. Applying CodeBERT for automated program repair of Java simple bugs. In: Proc. of the 18th IEEE/ACM Int’l Conf. on Mining Software Repositories (MSR). Madrid: IEEE, 2021. 505–509. [doi: 10.1109/MSR52588.2021.00063]
    [80] Liu F, Li G, Zhao YF, Jin Z. Multi-task learning based pre-trained language model for code completion. In: Proc. of the 35th IEEE/ACM Int’l Conf. on Automated Software Engineering. Melbourne: Association for Computing Machinery, 2020. 473–485.
    [81] Tao W, Wang YL, Shi ES, Du L, Han S, Zhang HY, Zhang DM, Zhang WQ. A large-scale empirical study of commit message generation: Models, datasets and evaluation. Empirical Software Engineering, 2022, 27(7): 198.
    [82] Chen N, Sun QS, Zhu RY, Li X, Lu XS, Gao M. CAT-probing: A metric-based approach to interpret how pre-trained models for programming language attend code structure. In: Proc. of the 2022 Findings of the Association for Computational Linguistics. Abu Dhabi: ACL, 2022. 4000–4008. [doi: 10.18653/v1/2022.findings-emnlp.295]
    [83] Li XN, Gong YY, Shen YL, Qiu XP, Zhang H, Yao BL, Qi WZ, Jiang DX, Chen WZ, Duan N. CodeRetriever: A large scale contrastive pre-training method for code search. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu Dhabi: Association for Computational Linguistics, 2022. 2898–2910. [doi: 10.18653/v1/2022.emnlp-main.187]
    [84] Guo DY, Ren S, Lu S, Feng ZY, Tang DY, Liu SJ, Zhou L, Duan N, Svyatkovskiy A, Fu SY, Tufano M, Deng SK, Clement C, Drain D, Sundaresan N, Yin J, Jiang DX, Zhou M. GraphCodeBERT: Pre-training code representations with data flow. In: Proc. of the 9th Int’l Conf. on Learning Representations. ICLR, 2021.
    [85] Jiang X, Zheng ZR, Lyu C, Li L, Lyu L. TreeBERT: A tree-based pre-trained model for programming language. In: Proc. of the 37th Conf. on Uncertainty in Artificial Intelligence. Toronto: PMLR, 2021. 54–63.
    [86] Wang X, Wang YS, Wan Y, Wang JW, Zhou PY, Li L, Wu H, Liu J. CODE-MVP: Learning to represent source code from multiple views with contrastive pre-training. In: Proc. of the 2022 Findings of the Association for Computational Linguistics. Seattle: Association for Computational Linguistics, 2022. 1066–1077. [doi: 10.18653/v1/2022.findings-naacl.80]
    [87] Kanade A, Maniatis P, Balakrishnan G, Shi KS. Pre-trained contextual embedding of source code. arXiv:2001.00059, 2020.
    [88] Kanade A, Maniatis P, Balakrishnan G, Shi KS. Learning and evaluating contextual embedding of source code. In: Proc. of the 37th Int’l Conf. on Machine Learning. Vienna: PMLR, 2020. 5110–5121.
    [89] Chirkova N, Troshin S. CodeBPE: Investigating subtokenization options for large language model pretraining on source code. In: Proc. of the 11th Int’l Conf. on Learning Representations. Kigali: ICLR, 2022. 1–13.
    [90] Bui NDQ, Yu YJ, Jiang LX. InferCode: Self-supervised learning of code representations by predicting subtrees. In: Proc. of the 43rd IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Madrid: IEEE, 2021. 1186–1197.
    [91] Wang X, Wang YS, Mi F, Zhou PY, Wan Y, Liu X, Li L, Wu H, Liu J, Jiang X. SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556, 2021.
    [92] Jia JH, Srikant S, Mitrovska T, Gan C, Chang SY, Liu SJ, O’Reilly UM. CLawSAT: Towards both robust and accurate code models. In: Proc. of the 2023 Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Macao: IEEE, 2023. 212–223.
    [93] Pashakhanloo P, Naik A, Wang YP, Dai HJ, Maniatis P, Naik M. CodeTrek: Flexible modeling of code using an extensible relational representation. In: Proc. of the 10th Int’l Conf. on Learning Representations. ICLR, 2022. 1–23.
    [94] Khan J, Uddin G. Automatic code documentation generation using GPT-3. In: Proc. of the 37th IEEE/ACM Int’l Conf. on Automated Software Engineering. Rochester: Association for Computing Machinery, 2022. 174. [doi: 10.1145/3551349.3559548]
    [95] Chen B, Zhang FJ, Nguyen A, Zan DG, Lin ZQ, Lou JG, Chen QZ. CodeT: Code generation with generated tests. In: Proc. of the 11th Int’l Conf. on Learning Representations. Kigali: ICLR, 2022.
    [96] Trummer I. CodexDB: Synthesizing code for query processing from natural language instructions using GPT-3 Codex. Proc. of the VLDB Endowment, 2022, 15(11): 2921–2928.
    [97] Prenner JA, Robbes R. Automatic program repair with OpenAI’s Codex: Evaluating QuixBugs. arXiv:2111.03922, 2021.
    [98] Li XN, Guo DY, Gong YY, Lin Y, Shen YL, Qiu XP, Jiang DX, Chen WZ, Duan N. Soft-labeled contrastive pre-training for function-level code representation. In: Proc. of the 2022 Findings of the Association for Computational Linguistics. Abu Dhabi: Association for Computational Linguistics, 2022. 118–129. [doi: 10.18653/v1/2022.findings-emnlp.9]
    [99] Bui NDQ, Yu YJ, Jiang LX. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In: Proc. of the 44th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: Association for Computing Machinery, 2021. 511–521.
    [100] Niu CA, Li CY, Ng V, Ge JD, Huang LG, Luo B. SPT-Code: Sequence-to-sequence pre-training for learning source code representations. In: Proc. of the 44th IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Pittsburgh: IEEE, 2022. 1–13.
    [101] Tipirneni S, Zhu M, Reddy CK. StructCoder: Structure-aware Transformer for code generation. arXiv:2206.05239, 2022.
    [102] Ma W, Zhao MJ, Xie XF, Hu Q, Liu SQ, Zhang J, Wang WH, Liu Y. Is self-attention powerful to learn code syntax and semantics? arXiv:2212.10017, 2022.
    [103] Liu ZX, Huang Q, Xia X, Shihab E, Lo D, Li SP. SATD detector: A text-mining-based self-admitted technical debt detection tool. In: Proc. of the 40th Int’l Conf. on Software Engineering: Companion Proc. Gothenburg: Association for Computing Machinery, 2018. 9–12. [doi: 10.1145/3183440.3183478]
    [104] Wu HJ, Zhang Z, Wang SW, Lei Y, Lin B, Qin YH, Zhang HY, Mao XG. Peculiar: Smart contract vulnerability detection based on crucial data flow graph and pre-training techniques. In: Proc. of the 32nd IEEE Int’l Symp. on Software Reliability Engineering (ISSRE). Wuhan: IEEE, 2021. 378–389. [doi: 10.1109/ISSRE52982.2021.00047]
    [105] Wang J, Xiao H, Zhong SW, Xiao YH. DeepVulSeeker: A novel vulnerability identification framework via code graph structure and pre-training mechanism. Future Generation Computer Systems, 2023, 148: 15–26.
    [106] Liu AQ, Li YZ, Xie XF, Liu Y. CommitBART: A large pre-trained model for github commits. arXiv:2208.08100, 2022.
    [107] Yang Z, Shi JK, He JD, Lo D. Natural attack for pre-trained models of code. In: Proc. of the 44th Int’l Conf. on Software Engineering. Pittsburgh: Association for Computing Machinery, 2022. 1482–1493. [doi: 10.1145/3510003.3510146]
    [108] Svyatkovskiy A, Fakhoury S, Ghorbani N, Mytkowicz T, Dinella E, Bird C, Jang J, Sundaresan N, Lahiri SK. Program merge conflict resolution via neural Transformers. In: Proc. of the 30th ACM Joint European Software Engineering Conf. and the Symp. on the Foundations of Software Engineering. Singapore: Association for Computing Machinery, 2022. 822–833.
    [109] Ding YRB, Wang ZJ, Ahmad WU, Ramanathan MK, Nallapati R, Bhatia P, Roth D, Xiang B. CoCoMIC: Code completion by jointly modeling in-file and cross-file context. arXiv:2212.10007, 2022.
    [110] Gong Z, Guo YP, Zhou PY, Gao CY, Wang YS, Xu ZL. MultiCoder: Multi-programming-lingual pre-training for low-resource code completion. arXiv:2212.09666, 2022.
    [111] Lu S, Duan N, Han H, Guo DY, Hwang SW, Svyatkovskiy A. ReACC: A retrieval-augmented code completion framework. In: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: Association for Computational Linguistics, 2022. 6227–6240. [doi: 10.18653/v1/2022.acl-long.431]
    [112] Li J, Li G, Li Z, Jin Z, Hu X, Zhang KC, Fu ZY. CODEEDITOR: Learning to edit source code with pre-trained models. ACM Trans. on Software Engineering and Methodology, 2023, 32(6): 143.
    [113] 陈浙哲, 鄢萌, 夏鑫, 刘忠鑫, 徐洲, 雷晏. 代码自然性及其应用研究进展. 软件学报, 2022, 33(8): 3015–3034. http://www.jos.org.cn/1000-9825/6355.htm
    Chen ZZ, Yan M, Xia X, Liu ZX, Xu Z, Lei Y. Research progress of code naturalness and its application. Ruan Jian Xue Bao/Journal of Software, 2022, 33(8): 3015–3034 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6355.htm
    [114] 姜佳君, 陈俊洁, 熊英飞. 软件缺陷自动修复技术综述. 软件学报, 2021, 32(9): 2665–2690. http://www.jos.org.cn/1000-9825/6274.htm
    Jiang JJ, Chen JJ, Xiong YF. Survey of automatic program repair techniques. Ruan Jian Xue Bao/Journal of Software, 2021, 32(9): 2665–2690 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6274.htm
    [115] Christopoulou F, Lampouras G, Gritta M, Zhang GC, Guo YP, Li ZQ, Zhang Q, Xiao M, Shen B, Li L, Yu H, Yan L, Zhou PY, Wang X, Ma YC, Iacobacci I, Wang YS, Liang GT, Wei JS, Jiang X, Wang QX, Liu Q. PanGu-Coder: Program synthesis with function-level language modeling. arXiv:2207.11280, 2022.
    [116] Shcherban S, Liang P, Li ZY, Yang C. Multiclass classification of four types of UML diagrams from images using deep learning. In: Proc. of the 33rd Int’l Conf. on Software Engineering and Knowledge Engineering. KSIR Virtual Conf. Center: SEKE, 2021. 57–62. [doi: 10.18293/SEKE2021-185]
    [117] Zhang W, Luan SM, Tian LQ. A rapid combined model for automatic generating Web UI codes. Wireless Communications and Mobile Computing, 2022, 2022: 4415479.
    [118] Hossain S, Emi MA, Hossain Mishu M, Zannat R, Ohidujjaman. Code generator based on voice command for multiple programming language. In: Proc. of the 12th Int’l Conf. on Computing Communication and Networking Technologies. Kharagpur: IEEE, 2021. 1–5. [doi: 10.1109/ICCCNT51525.2021.9579880]
    [119] Ni C, Xia X, Lo D, Chen X, Gu Q. Revisiting supervised and unsupervised methods for effort-aware cross-project defect prediction. IEEE Trans. on Software Engineering, 2022, 48(3): 786–802.
    [120] Hu B, Wu YJ, Peng X, Sha CF, Wang XC, Fu BQ, Zhao WY. Predicting change propagation between code clone instances by graph-based deep learning. In: Proc. of the 30th IEEE/ACM Int’l Conf. on Program Comprehension (ICPC). Pittsburgh: Association for Computing Machinery, 2022. 425–436. [doi: 10.1145/3524610.3527766]
    引证文献
引用本文

宫丽娜,周易人,乔羽,姜淑娟,魏明强,黄志球.预训练模型在软件工程领域应用研究进展.软件学报,2025,36(1):1-26

复制
分享
文章指标
  • 点击次数:1292
  • 下载次数: 2656
  • HTML阅读次数: 416
  • 引用次数: 0
历史
  • 收稿日期:2023-02-06
  • 最后修改日期:2023-06-21
  • 在线发布日期: 2024-06-18
  • 出版日期: 2025-01-06
文章二维码
您是第19708212位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号