Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning

doi:10.13328/j.cnki.jos.007016

微信服务号

微信订阅号

2025-5-15- 18

Home > Archive>Volume 35, Issue 4, 2024 >1601-1617. DOI:10.13328/j.cnki.jos.007016

PDF HTML XML Export Cite reminder

Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning
DOI:
                        10.13328/j.cnki.jos.007016
                    
Author:
                        YANG Hong-YuYANG Hong-Yu
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
MA Jian-HuiMA Jian-Hui
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
HOU MinHOU Min
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
SHEN Shuang-HongSHEN Shuang-Hong
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
CHEN En-HongCHEN En-Hong
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [55]

Related [20]

Cited by

Materials

Comments

Abstract:

Code representation aims to extract the characteristics of source code to obtain its semantic embedding, playing a crucial role in deep learning-based code intelligence. Traditional handcrafted code representation methods mainly rely on domain expert annotations, which are time-consuming and labor-intensive. Moreover, the obtained code representations are task-specific and not easily reusable for specific downstream tasks, which contradicts the green and sustainable development concept. To this end, many large-scale pretraining models for source code representation have shown remarkable success in recent years. These methods utilize massive source code for self-supervised learning to obtain universal code representations, which are then easily fine-tuned for various downstream tasks. Based on the abstraction levels of programming languages, code representations have four level features: text level, semantic level, functional level, and structural level. Nevertheless, current models for code representation treat programming languages merely as ordinary text sequences resembling natural language. They overlook the functional-level and structural-level features, which bring performance inferior. To overcome this drawback, this study proposes a representation enhanced contrastive multimodal pretraining (REcomp) framework for code representation pretraining. REcomp has developed a novel semantic-level to structure-level feature fusion algorithm, which is employed for serializing abstract syntax trees. Through a multi-modal contrastive learning approach, this composite feature is integrated with both the textual and functional features of programming languages, enabling a more precise semantic modeling. Extensive experiments are conducted on three real-world public datasets. Experimental results clearly validate the superiority of REcomp.

Key words:code representation;pre-trained model;multimodal;contrastive learning

Reference

[1] Rey SJ. Big code. Geographical Analysis, 2023, 55(2):211-224.

[2] Lu S, Guo D, Ren S, et al. CodeXGLUE:A machine learning benchmark dataset for code understanding and generation. arXiv:2102. 04664, 2021.

[3] Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023, 59(24):277-288 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2209-0402]

[4] Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. Journal of Computer Science and Technology, 2011, 26(2):328-342.[doi:10.1007/s11390-011-1135-6]

[5] Jiang L, Misherghi G, Su Z, et al. Deckard:Scalable and accurate tree-based detection of code clones. In:Proc. of the 29th Int'l Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96-105.

[6] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In:Proc. of the 17th IEEE Int'l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757-762.

[7] Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1):1-8 (in Chinese with English abstract).

[8] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5):122-131

[9] Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.

[10] Mou L, Li G, Jin Z, et al. TBCNN:A tree-based convolutional neural network for programming language processing. arXiv:1409. 5718, 2014.

[11] Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In:Proc. of the 28th Int'l Conf. on Program Comprehension. 2020. 196-207.

[12] Kim Y. Convolutional neural network for sentence classification[MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.

[13] Li Z, Wu Y, Peng B, et al. SeCNN:A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181:111036.

[14] Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In:Proc. of the 2019 34th IEEE/ACM Int'l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13-25.

[15] Zeng C, Yu Y, Li S, et al. DeGraphCS:Embedding variable-based flow graph for neural code search. ACM Trans. on Software Engineering and Methodology, 2023, 32(2):1-27.

[16] Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021, 57(20):53-63 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2106-0368]

[17] Hu X, Li G, Xia X, et al. Deep code comment generation. In:Proc. of the 26th Conf. on Program Comprehension. 2018. 200-210.

[18] Wen W, Chu J, Zhao T, et al. Code2tree:A method for automatically generating code comments. Hindawi Scientific Programming, 2022. https://doi.org/10.1155/2022/6350686

[19] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In:Proc. of the IJCAI. 2017. 3034-3040.

[20] Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37-45.

[21] Alon U, Zilberstein M, Levy O, et al. Code2vec:Learning distributed representations of code. Proc. of the ACM on Programming Languages, 2019, 3(POPL):1-29.

[22] Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In:Proc. of the 2021 IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME). IEEE, 2021. 483-494.

[23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 30

[24] Devlin J, Chang MW, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding. arXiv:1810. 04805, 2018.

[25] Liu Y, Ott M, Goyal N, et al. RoBERTa:A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.

[26] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI, 2018.

[27] Lewis M, Liu Y, Goyal N, et al. BART:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.

[28] Ahmad WU, Chakraborty S, Ray B, et al. Unified pre-training for program understanding and generation. arXiv:2103.06333, 2021.

[29] Feng Z, Guo D, Tang D, et al. CodeBERT:A pre-trained model for programming and natural languages. arXiv:2002.08155, 2020.

[30] Guo D, Ren S, Lu S, et al. GraphCodeBERT:Pre-training code representations with data flow. arXiv:2009.08366, 2020.

[31] Wang X, Wang Y, Mi F, et al. SynCoBERT:Syntax-guided multi-modal contrastive pretraining for code representation. arXiv:2108.04556, 2021.

[32] Guo D, Lu S, Duan N, et al. UniXcoder:Unified cross-modal pre-training for code representation. arXiv:2203. 03850, 2022.

[33] Husain H, Wu HH, Gazit T, et al. CodeSearchNet challenge:Evaluating the state of semantic code search. arXiv:1909.09436, 2019.

[34] Svajlenko J, Islam JF, Keivanloo I, et al. Towards a big data curated benchmark of inter-project code clones. In:Proc. of the 2014 IEEE Int'l Conf. on Software Maintenance and Evolution. IEEE, 2014. 476-480.

[35] Mou L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence (AAAI-16). 2016.

[36] Lü TG, Hong RC, He J, et al. Multimodal-guided local feature selection for few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5):2068-2082 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j. cnki.jos.006771]

[37] Gu X, Zhang H, Kim S. Deep code search. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018.

[38] Yang G, Chen X, Cao J, et al. Comformer:Code comment generation via transformer and fusion method-based hybrid code representation. In:Proc. of the 8th Int'l Conf. on Dependable Systems and Their Applications (DSA). 2021.

[39] Liu B, Li RL, Feng JF. A brief introduction to deep metric learning. CAAI Trans. on in Telligent Systems, 2019, 14(6):1064-1072 (in Chinese with English abstract).

[40] Gao T, Yao X, Chen D. SimCSE:Simple contrastive learning of sentence embeddings. arXiv:2104.08821, 2021.

[41] Bui ND, Yu Y, Jiang L. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In:Proc. of the 44th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. 2021. 511-521.

[42] Chen Q, Lacomis J, Schwartz EJ, et al. VarCLR:Variable semantic representation pre-training via contrastive learning. In:Proc. of the 44th Int'l Conf. on Software Engineering. 2022. 2327-2339.

[43] Neelakantan A, Xu T, Puri R, et al. Text and code embeddings by contrastive pre-training. arXiv:2201.10005, 2022.

[44] Jain P, Jain A, Zhang T, et al. Contrastive code representation learning. arXiv:2007.04973, 2020.

[45] Du L, Shi X, Wang Y, et al. Is a single model enough? MuCoS:A multi-model ensemble learning for semantic code search. arXiv:2107.04773, 2021.

[46] Rabin MR, Bui ND, Wang K, et al. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021, 135:106552.

[47] Wei M, Zhang LP. Research progress of code search methods. Application Research of Computers, 2021, 38(11):3215-3221, 3230 (in Chinese with English abstract).[doi:10.19734/j.issn.1001-3695.2021.04.0096]

[48] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation:Encoder-decoder approaches. arXiv:1409.1259, 2014.

附中文参考文献:

[3] 成思强, 刘建勋, 彭珍连, 等. 以CodeBERT为基础的代码分类研究. 计算机工程与应用, 2023, 59(24):277-288.[doi:10.3778/j.issn.1002-8331.2209-0402].

[7] 周志华, 陈世福. 神经网络集成. 计算机学报, 2002, 25(1):1-8.

[16] 王霞, 梁瑶, 谢春丽. 深度学习在代码表征中的应用综述. 计算机工程与应用, 2021, 57(20):53-63.[doi:10.3778/j.issn.1002- 8331.2106-0368]

[36] 吕天根, 洪日昌, 何军, 等. 多模态引导的局部特征选择小样本学习方法. 软件学报, 2023, 34(5):2068-2082. http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j.cnki.jos.006771]

[39] 刘冰, 李瑞麟, 封举富. 深度度量学习综述. 智能系统学报, 2019, 14(6):1064-1072.

[47] 魏敏, 张丽萍. 语义代码检索方法研究进展. 计算机应用研究, 2021, 38(11):3215-3221, 3230.[doi:10.19734/j.issn.1001-3695. 2021.04.0096]

Get Citation

杨宏宇,马建辉,侯旻,沈双宏,陈恩红.基于多模态对比学习的代码表征增强预训练方法.软件学报,2024,35(4):1601-1617

Copy

Article Metrics

Abstract:1699
PDF: 4008
HTML: 1798
Cited by: 0

History

Received:May 15,2023
Revised:July 07,2023
Adopted:
Online: September 11,2023
Published: April 06,2024

You are the first2044692Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History