基于多模态对比学习的代码表征增强预训练方法

doi:10.13328/j.cnki.jos.007016

微信服务号

微信订阅号

2025年3月17日 10:11 星期一

首页 > 过刊浏览>2024年第35卷第4期 >1601-1617. DOI:10.13328/j.cnki.jos.007016

PDF HTML阅读 XML下载导出引用引用提醒

基于多模态对比学习的代码表征增强预训练方法
DOI:
                        10.13328/j.cnki.jos.007016
                    
CSTR:
                        
                    
作者:
                        杨宏宇杨宏宇
大数据分析与应用安徽省重点实验室(中国科学技术大学), 安徽 合肥 230027;中国科学技术大学 计算机科学与技术学院, 安徽 合肥 230027;认知智能全国重点实验室, 安徽 合肥 230088
在期刊界中查找
在百度中查找
在本站中查找
马建辉马建辉
大数据分析与应用安徽省重点实验室(中国科学技术大学), 安徽 合肥 230027;中国科学技术大学 计算机科学与技术学院, 安徽 合肥 230027;认知智能全国重点实验室, 安徽 合肥 230088
在期刊界中查找
在百度中查找
在本站中查找
侯旻侯旻
大数据分析与应用安徽省重点实验室(中国科学技术大学), 安徽 合肥 230027;中国科学技术大学 大数据学院, 安徽 合肥 230027;认知智能全国重点实验室, 安徽 合肥 230088
在期刊界中查找
在百度中查找
在本站中查找
沈双宏沈双宏
大数据分析与应用安徽省重点实验室(中国科学技术大学), 安徽 合肥 230027;中国科学技术大学 大数据学院, 安徽 合肥 230027;认知智能全国重点实验室, 安徽 合肥 230088
在期刊界中查找
在百度中查找
在本站中查找
陈恩红陈恩红
大数据分析与应用安徽省重点实验室(中国科学技术大学), 安徽 合肥 230027;中国科学技术大学 大数据学院, 安徽 合肥 230027;认知智能全国重点实验室, 安徽 合肥 230088
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:马建辉,E-mail:jianhui@ustc.edu.cn
中图分类号:
基金项目:

Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning

Author:

YANG Hong-Yu
YANG Hong-Yu
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
在期刊界中查找
在百度中查找
在本站中查找
MA Jian-Hui
MA Jian-Hui
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
在期刊界中查找
在百度中查找
在本站中查找
HOU Min
HOU Min
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
在期刊界中查找
在百度中查找
在本站中查找
SHEN Shuang-Hong
SHEN Shuang-Hong
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN En-Hong
CHEN En-Hong
Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [55]

相似文献

引证文献

资源附件

文章评论

摘要:

代码表征旨在融合源代码的特征, 以获取其语义向量, 在基于深度学习的代码智能中扮演着重要角色. 传统基于手工的代码表征依赖领域专家的标注, 繁重耗时, 且无法灵活地复用于特定下游任务, 这与绿色低碳的发展理念极不相符. 因此, 近年来, 许多自监督学习的编程语言大规模预训练模型(如CodeBERT)应运而生, 为获取通用代码表征提供了有效途径. 这些模型通过预训练获得通用的代码表征, 然后在具体任务上进行微调, 取得了显著成果. 但是, 要准确表示代码的语义信息, 需要融合所有抽象层次的特征(文本级、语义级、功能级和结构级). 然而, 现有模型将编程语言仅视为类似于自然语言的普通文本序列, 忽略了它的功能级和结构级特征. 因此,旨在进一步提高代码表征的准确性, 提出了基于多模态对比学习的代码表征增强的预训练模型(representation enhanced contrastive multimodal pretraining, REcomp). REcomp设计了新的语义级-结构级特征融合算法, 将它用于序列化抽象语法树, 并通过多模态对比学习的方法将该复合特征与编程语言的文本级和功能级特征相融合, 以实现更精准的语义建模. 最后, 在3个真实的公开数据集上进行了实验, 验证了REcomp在提高代码表征准确性方面的有效性.

关键词:代码表征;预训练模型;多模态;对比学习

Abstract:

Code representation aims to extract the characteristics of source code to obtain its semantic embedding, playing a crucial role in deep learning-based code intelligence. Traditional handcrafted code representation methods mainly rely on domain expert annotations, which are time-consuming and labor-intensive. Moreover, the obtained code representations are task-specific and not easily reusable for specific downstream tasks, which contradicts the green and sustainable development concept. To this end, many large-scale pretraining models for source code representation have shown remarkable success in recent years. These methods utilize massive source code for self-supervised learning to obtain universal code representations, which are then easily fine-tuned for various downstream tasks. Based on the abstraction levels of programming languages, code representations have four level features: text level, semantic level, functional level, and structural level. Nevertheless, current models for code representation treat programming languages merely as ordinary text sequences resembling natural language. They overlook the functional-level and structural-level features, which bring performance inferior. To overcome this drawback, this study proposes a representation enhanced contrastive multimodal pretraining (REcomp) framework for code representation pretraining. REcomp has developed a novel semantic-level to structure-level feature fusion algorithm, which is employed for serializing abstract syntax trees. Through a multi-modal contrastive learning approach, this composite feature is integrated with both the textual and functional features of programming languages, enabling a more precise semantic modeling. Extensive experiments are conducted on three real-world public datasets. Experimental results clearly validate the superiority of REcomp.

Key words:code representation;pre-trained model;multimodal;contrastive learning

参考文献

[1] Rey SJ. Big code. Geographical Analysis, 2023, 55(2):211-224.

[2] Lu S, Guo D, Ren S, et al. CodeXGLUE:A machine learning benchmark dataset for code understanding and generation. arXiv:2102. 04664, 2021.

[3] Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023, 59(24):277-288 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2209-0402]

[4] Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. Journal of Computer Science and Technology, 2011, 26(2):328-342.[doi:10.1007/s11390-011-1135-6]

[5] Jiang L, Misherghi G, Su Z, et al. Deckard:Scalable and accurate tree-based detection of code clones. In:Proc. of the 29th Int'l Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96-105.

[6] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In:Proc. of the 17th IEEE Int'l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757-762.

[7] Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1):1-8 (in Chinese with English abstract).

[8] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5):122-131

[9] Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.

[10] Mou L, Li G, Jin Z, et al. TBCNN:A tree-based convolutional neural network for programming language processing. arXiv:1409. 5718, 2014.

[11] Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In:Proc. of the 28th Int'l Conf. on Program Comprehension. 2020. 196-207.

[12] Kim Y. Convolutional neural network for sentence classification[MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.

[13] Li Z, Wu Y, Peng B, et al. SeCNN:A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181:111036.

[14] Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In:Proc. of the 2019 34th IEEE/ACM Int'l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13-25.

[15] Zeng C, Yu Y, Li S, et al. DeGraphCS:Embedding variable-based flow graph for neural code search. ACM Trans. on Software Engineering and Methodology, 2023, 32(2):1-27.

[16] Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021, 57(20):53-63 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2106-0368]

[17] Hu X, Li G, Xia X, et al. Deep code comment generation. In:Proc. of the 26th Conf. on Program Comprehension. 2018. 200-210.

[18] Wen W, Chu J, Zhao T, et al. Code2tree:A method for automatically generating code comments. Hindawi Scientific Programming, 2022. https://doi.org/10.1155/2022/6350686

[19] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In:Proc. of the IJCAI. 2017. 3034-3040.

[20] Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37-45.

[21] Alon U, Zilberstein M, Levy O, et al. Code2vec:Learning distributed representations of code. Proc. of the ACM on Programming Languages, 2019, 3(POPL):1-29.

[22] Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In:Proc. of the 2021 IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME). IEEE, 2021. 483-494.

[23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 30

[24] Devlin J, Chang MW, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding. arXiv:1810. 04805, 2018.

[25] Liu Y, Ott M, Goyal N, et al. RoBERTa:A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.

[26] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI, 2018.

[27] Lewis M, Liu Y, Goyal N, et al. BART:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.

[28] Ahmad WU, Chakraborty S, Ray B, et al. Unified pre-training for program understanding and generation. arXiv:2103.06333, 2021.

[29] Feng Z, Guo D, Tang D, et al. CodeBERT:A pre-trained model for programming and natural languages. arXiv:2002.08155, 2020.

[30] Guo D, Ren S, Lu S, et al. GraphCodeBERT:Pre-training code representations with data flow. arXiv:2009.08366, 2020.

[31] Wang X, Wang Y, Mi F, et al. SynCoBERT:Syntax-guided multi-modal contrastive pretraining for code representation. arXiv:2108.04556, 2021.

[32] Guo D, Lu S, Duan N, et al. UniXcoder:Unified cross-modal pre-training for code representation. arXiv:2203. 03850, 2022.

[33] Husain H, Wu HH, Gazit T, et al. CodeSearchNet challenge:Evaluating the state of semantic code search. arXiv:1909.09436, 2019.

[34] Svajlenko J, Islam JF, Keivanloo I, et al. Towards a big data curated benchmark of inter-project code clones. In:Proc. of the 2014 IEEE Int'l Conf. on Software Maintenance and Evolution. IEEE, 2014. 476-480.

[35] Mou L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence (AAAI-16). 2016.

[36] Lü TG, Hong RC, He J, et al. Multimodal-guided local feature selection for few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5):2068-2082 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j. cnki.jos.006771]

[37] Gu X, Zhang H, Kim S. Deep code search. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018.

[38] Yang G, Chen X, Cao J, et al. Comformer:Code comment generation via transformer and fusion method-based hybrid code representation. In:Proc. of the 8th Int'l Conf. on Dependable Systems and Their Applications (DSA). 2021.

[39] Liu B, Li RL, Feng JF. A brief introduction to deep metric learning. CAAI Trans. on in Telligent Systems, 2019, 14(6):1064-1072 (in Chinese with English abstract).

[40] Gao T, Yao X, Chen D. SimCSE:Simple contrastive learning of sentence embeddings. arXiv:2104.08821, 2021.

[41] Bui ND, Yu Y, Jiang L. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In:Proc. of the 44th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. 2021. 511-521.

[42] Chen Q, Lacomis J, Schwartz EJ, et al. VarCLR:Variable semantic representation pre-training via contrastive learning. In:Proc. of the 44th Int'l Conf. on Software Engineering. 2022. 2327-2339.

[43] Neelakantan A, Xu T, Puri R, et al. Text and code embeddings by contrastive pre-training. arXiv:2201.10005, 2022.

[44] Jain P, Jain A, Zhang T, et al. Contrastive code representation learning. arXiv:2007.04973, 2020.

[45] Du L, Shi X, Wang Y, et al. Is a single model enough? MuCoS:A multi-model ensemble learning for semantic code search. arXiv:2107.04773, 2021.

[46] Rabin MR, Bui ND, Wang K, et al. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021, 135:106552.

[47] Wei M, Zhang LP. Research progress of code search methods. Application Research of Computers, 2021, 38(11):3215-3221, 3230 (in Chinese with English abstract).[doi:10.19734/j.issn.1001-3695.2021.04.0096]

[48] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation:Encoder-decoder approaches. arXiv:1409.1259, 2014.

附中文参考文献:

[3] 成思强, 刘建勋, 彭珍连, 等. 以CodeBERT为基础的代码分类研究. 计算机工程与应用, 2023, 59(24):277-288.[doi:10.3778/j.issn.1002-8331.2209-0402].

[7] 周志华, 陈世福. 神经网络集成. 计算机学报, 2002, 25(1):1-8.

[16] 王霞, 梁瑶, 谢春丽. 深度学习在代码表征中的应用综述. 计算机工程与应用, 2021, 57(20):53-63.[doi:10.3778/j.issn.1002- 8331.2106-0368]

[36] 吕天根, 洪日昌, 何军, 等. 多模态引导的局部特征选择小样本学习方法. 软件学报, 2023, 34(5):2068-2082. http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j.cnki.jos.006771]

[39] 刘冰, 李瑞麟, 封举富. 深度度量学习综述. 智能系统学报, 2019, 14(6):1064-1072.

[47] 魏敏, 张丽萍. 语义代码检索方法研究进展. 计算机应用研究, 2021, 38(11):3215-3221, 3230.[doi:10.19734/j.issn.1001-3695. 2021.04.0096]

引用本文

杨宏宇,马建辉,侯旻,沈双宏,陈恩红.基于多模态对比学习的代码表征增强预训练方法.软件学报,2024,35(4):1601-1617

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-05-15
最后修改日期:2023-07-07
录用日期:
在线发布日期: 2023-09-11
出版日期: 2024-04-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码