基于多模态对比学习的代码表征增强预训练方法
作者:
通讯作者:

马建辉,E-mail:jianhui@ustc.edu.cn


Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning
Author:
  • YANG Hong-Yu

    YANG Hong-Yu

    Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • MA Jian-Hui

    MA Jian-Hui

    Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • HOU Min

    HOU Min

    Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • SHEN Shuang-Hong

    SHEN Shuang-Hong

    Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • CHEN En-Hong

    CHEN En-Hong

    Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China), Hefei 230027, China;School of Data Science, University of Science and Technology of China, Hefei 230027, China;State Key Laboratory of Cognitive Intelligence, Hefei 230088, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [55]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    代码表征旨在融合源代码的特征, 以获取其语义向量, 在基于深度学习的代码智能中扮演着重要角色. 传统基于手工的代码表征依赖领域专家的标注, 繁重耗时, 且无法灵活地复用于特定下游任务, 这与绿色低碳的发展理念极不相符. 因此, 近年来, 许多自监督学习的编程语言大规模预训练模型(如CodeBERT)应运而生, 为获取通用代码表征提供了有效途径. 这些模型通过预训练获得通用的代码表征, 然后在具体任务上进行微调, 取得了显著成果. 但是, 要准确表示代码的语义信息, 需要融合所有抽象层次的特征(文本级、语义级、功能级和结构级). 然而, 现有模型将编程语言仅视为类似于自然语言的普通文本序列, 忽略了它的功能级和结构级特征. 因此,旨在进一步提高代码表征的准确性, 提出了基于多模态对比学习的代码表征增强的预训练模型(representation enhanced contrastive multimodal pretraining, REcomp). REcomp设计了新的语义级-结构级特征融合算法, 将它用于序列化抽象语法树, 并通过多模态对比学习的方法将该复合特征与编程语言的文本级和功能级特征相融合, 以实现更精准的语义建模. 最后, 在3个真实的公开数据集上进行了实验, 验证了REcomp在提高代码表征准确性方面的有效性.

    Abstract:

    Code representation aims to extract the characteristics of source code to obtain its semantic embedding, playing a crucial role in deep learning-based code intelligence. Traditional handcrafted code representation methods mainly rely on domain expert annotations, which are time-consuming and labor-intensive. Moreover, the obtained code representations are task-specific and not easily reusable for specific downstream tasks, which contradicts the green and sustainable development concept. To this end, many large-scale pretraining models for source code representation have shown remarkable success in recent years. These methods utilize massive source code for self-supervised learning to obtain universal code representations, which are then easily fine-tuned for various downstream tasks. Based on the abstraction levels of programming languages, code representations have four level features: text level, semantic level, functional level, and structural level. Nevertheless, current models for code representation treat programming languages merely as ordinary text sequences resembling natural language. They overlook the functional-level and structural-level features, which bring performance inferior. To overcome this drawback, this study proposes a representation enhanced contrastive multimodal pretraining (REcomp) framework for code representation pretraining. REcomp has developed a novel semantic-level to structure-level feature fusion algorithm, which is employed for serializing abstract syntax trees. Through a multi-modal contrastive learning approach, this composite feature is integrated with both the textual and functional features of programming languages, enabling a more precise semantic modeling. Extensive experiments are conducted on three real-world public datasets. Experimental results clearly validate the superiority of REcomp.

    参考文献
    [1] Rey SJ. Big code. Geographical Analysis, 2023, 55(2):211-224.
    [2] Lu S, Guo D, Ren S, et al. CodeXGLUE:A machine learning benchmark dataset for code understanding and generation. arXiv:2102. 04664, 2021.
    [3] Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023, 59(24):277-288 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2209-0402]
    [4] Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. Journal of Computer Science and Technology, 2011, 26(2):328-342.[doi:10.1007/s11390-011-1135-6]
    [5] Jiang L, Misherghi G, Su Z, et al. Deckard:Scalable and accurate tree-based detection of code clones. In:Proc. of the 29th Int'l Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96-105.
    [6] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In:Proc. of the 17th IEEE Int'l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757-762.
    [7] Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1):1-8 (in Chinese with English abstract).
    [8] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5):122-131
    [9] Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.
    [10] Mou L, Li G, Jin Z, et al. TBCNN:A tree-based convolutional neural network for programming language processing. arXiv:1409. 5718, 2014.
    [11] Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In:Proc. of the 28th Int'l Conf. on Program Comprehension. 2020. 196-207.
    [12] Kim Y. Convolutional neural network for sentence classification[MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.
    [13] Li Z, Wu Y, Peng B, et al. SeCNN:A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181:111036.
    [14] Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In:Proc. of the 2019 34th IEEE/ACM Int'l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13-25.
    [15] Zeng C, Yu Y, Li S, et al. DeGraphCS:Embedding variable-based flow graph for neural code search. ACM Trans. on Software Engineering and Methodology, 2023, 32(2):1-27.
    [16] Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021, 57(20):53-63 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2106-0368]
    [17] Hu X, Li G, Xia X, et al. Deep code comment generation. In:Proc. of the 26th Conf. on Program Comprehension. 2018. 200-210.
    [18] Wen W, Chu J, Zhao T, et al. Code2tree:A method for automatically generating code comments. Hindawi Scientific Programming, 2022. https://doi.org/10.1155/2022/6350686
    [19] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In:Proc. of the IJCAI. 2017. 3034-3040.
    [20] Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37-45.
    [21] Alon U, Zilberstein M, Levy O, et al. Code2vec:Learning distributed representations of code. Proc. of the ACM on Programming Languages, 2019, 3(POPL):1-29.
    [22] Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In:Proc. of the 2021 IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME). IEEE, 2021. 483-494.
    [23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 30
    [24] Devlin J, Chang MW, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding. arXiv:1810. 04805, 2018.
    [25] Liu Y, Ott M, Goyal N, et al. RoBERTa:A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [26] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI, 2018.
    [27] Lewis M, Liu Y, Goyal N, et al. BART:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.
    [28] Ahmad WU, Chakraborty S, Ray B, et al. Unified pre-training for program understanding and generation. arXiv:2103.06333, 2021.
    [29] Feng Z, Guo D, Tang D, et al. CodeBERT:A pre-trained model for programming and natural languages. arXiv:2002.08155, 2020.
    [30] Guo D, Ren S, Lu S, et al. GraphCodeBERT:Pre-training code representations with data flow. arXiv:2009.08366, 2020.
    [31] Wang X, Wang Y, Mi F, et al. SynCoBERT:Syntax-guided multi-modal contrastive pretraining for code representation. arXiv:2108.04556, 2021.
    [32] Guo D, Lu S, Duan N, et al. UniXcoder:Unified cross-modal pre-training for code representation. arXiv:2203. 03850, 2022.
    [33] Husain H, Wu HH, Gazit T, et al. CodeSearchNet challenge:Evaluating the state of semantic code search. arXiv:1909.09436, 2019.
    [34] Svajlenko J, Islam JF, Keivanloo I, et al. Towards a big data curated benchmark of inter-project code clones. In:Proc. of the 2014 IEEE Int'l Conf. on Software Maintenance and Evolution. IEEE, 2014. 476-480.
    [35] Mou L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence (AAAI-16). 2016.
    [36] Lü TG, Hong RC, He J, et al. Multimodal-guided local feature selection for few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5):2068-2082 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j. cnki.jos.006771]
    [37] Gu X, Zhang H, Kim S. Deep code search. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018.
    [38] Yang G, Chen X, Cao J, et al. Comformer:Code comment generation via transformer and fusion method-based hybrid code representation. In:Proc. of the 8th Int'l Conf. on Dependable Systems and Their Applications (DSA). 2021.
    [39] Liu B, Li RL, Feng JF. A brief introduction to deep metric learning. CAAI Trans. on in Telligent Systems, 2019, 14(6):1064-1072 (in Chinese with English abstract).
    [40] Gao T, Yao X, Chen D. SimCSE:Simple contrastive learning of sentence embeddings. arXiv:2104.08821, 2021.
    [41] Bui ND, Yu Y, Jiang L. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In:Proc. of the 44th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. 2021. 511-521.
    [42] Chen Q, Lacomis J, Schwartz EJ, et al. VarCLR:Variable semantic representation pre-training via contrastive learning. In:Proc. of the 44th Int'l Conf. on Software Engineering. 2022. 2327-2339.
    [43] Neelakantan A, Xu T, Puri R, et al. Text and code embeddings by contrastive pre-training. arXiv:2201.10005, 2022.
    [44] Jain P, Jain A, Zhang T, et al. Contrastive code representation learning. arXiv:2007.04973, 2020.
    [45] Du L, Shi X, Wang Y, et al. Is a single model enough? MuCoS:A multi-model ensemble learning for semantic code search. arXiv:2107.04773, 2021.
    [46] Rabin MR, Bui ND, Wang K, et al. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021, 135:106552.
    [47] Wei M, Zhang LP. Research progress of code search methods. Application Research of Computers, 2021, 38(11):3215-3221, 3230 (in Chinese with English abstract).[doi:10.19734/j.issn.1001-3695.2021.04.0096]
    [48] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation:Encoder-decoder approaches. arXiv:1409.1259, 2014.
    附中文参考文献:
    [3] 成思强, 刘建勋, 彭珍连, 等. 以CodeBERT为基础的代码分类研究. 计算机工程与应用, 2023, 59(24):277-288.[doi:10.3778/j.issn.1002-8331.2209-0402].
    [7] 周志华, 陈世福. 神经网络集成. 计算机学报, 2002, 25(1):1-8.
    [16] 王霞, 梁瑶, 谢春丽. 深度学习在代码表征中的应用综述. 计算机工程与应用, 2021, 57(20):53-63.[doi:10.3778/j.issn.1002- 8331.2106-0368]
    [36] 吕天根, 洪日昌, 何军, 等. 多模态引导的局部特征选择小样本学习方法. 软件学报, 2023, 34(5):2068-2082. http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j.cnki.jos.006771]
    [39] 刘冰, 李瑞麟, 封举富. 深度度量学习综述. 智能系统学报, 2019, 14(6):1064-1072.
    [47] 魏敏, 张丽萍. 语义代码检索方法研究进展. 计算机应用研究, 2021, 38(11):3215-3221, 3230.[doi:10.19734/j.issn.1001-3695. 2021.04.0096]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杨宏宇,马建辉,侯旻,沈双宏,陈恩红.基于多模态对比学习的代码表征增强预训练方法.软件学报,2024,35(4):1601-1617

复制
分享
文章指标
  • 点击次数:1614
  • 下载次数: 3848
  • HTML阅读次数: 1667
  • 引用次数: 0
历史
  • 收稿日期:2023-05-15
  • 最后修改日期:2023-07-07
  • 在线发布日期: 2023-09-11
  • 出版日期: 2024-04-06
文章二维码
您是第19708246位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号