Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [55]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Code representation aims to extract the characteristics of source code to obtain its semantic embedding, playing a crucial role in deep learning-based code intelligence. Traditional handcrafted code representation methods mainly rely on domain expert annotations, which are time-consuming and labor-intensive. Moreover, the obtained code representations are task-specific and not easily reusable for specific downstream tasks, which contradicts the green and sustainable development concept. To this end, many large-scale pretraining models for source code representation have shown remarkable success in recent years. These methods utilize massive source code for self-supervised learning to obtain universal code representations, which are then easily fine-tuned for various downstream tasks. Based on the abstraction levels of programming languages, code representations have four level features: text level, semantic level, functional level, and structural level. Nevertheless, current models for code representation treat programming languages merely as ordinary text sequences resembling natural language. They overlook the functional-level and structural-level features, which bring performance inferior. To overcome this drawback, this study proposes a representation enhanced contrastive multimodal pretraining (REcomp) framework for code representation pretraining. REcomp has developed a novel semantic-level to structure-level feature fusion algorithm, which is employed for serializing abstract syntax trees. Through a multi-modal contrastive learning approach, this composite feature is integrated with both the textual and functional features of programming languages, enabling a more precise semantic modeling. Extensive experiments are conducted on three real-world public datasets. Experimental results clearly validate the superiority of REcomp.

    Reference
    [1] Rey SJ. Big code. Geographical Analysis, 2023, 55(2):211-224.
    [2] Lu S, Guo D, Ren S, et al. CodeXGLUE:A machine learning benchmark dataset for code understanding and generation. arXiv:2102. 04664, 2021.
    [3] Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023, 59(24):277-288 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2209-0402]
    [4] Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. Journal of Computer Science and Technology, 2011, 26(2):328-342.[doi:10.1007/s11390-011-1135-6]
    [5] Jiang L, Misherghi G, Su Z, et al. Deckard:Scalable and accurate tree-based detection of code clones. In:Proc. of the 29th Int'l Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96-105.
    [6] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In:Proc. of the 17th IEEE Int'l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757-762.
    [7] Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1):1-8 (in Chinese with English abstract).
    [8] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5):122-131
    [9] Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.
    [10] Mou L, Li G, Jin Z, et al. TBCNN:A tree-based convolutional neural network for programming language processing. arXiv:1409. 5718, 2014.
    [11] Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In:Proc. of the 28th Int'l Conf. on Program Comprehension. 2020. 196-207.
    [12] Kim Y. Convolutional neural network for sentence classification[MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.
    [13] Li Z, Wu Y, Peng B, et al. SeCNN:A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181:111036.
    [14] Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In:Proc. of the 2019 34th IEEE/ACM Int'l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13-25.
    [15] Zeng C, Yu Y, Li S, et al. DeGraphCS:Embedding variable-based flow graph for neural code search. ACM Trans. on Software Engineering and Methodology, 2023, 32(2):1-27.
    [16] Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021, 57(20):53-63 (in Chinese with English abstract).[doi:10.3778/j.issn.1002-8331.2106-0368]
    [17] Hu X, Li G, Xia X, et al. Deep code comment generation. In:Proc. of the 26th Conf. on Program Comprehension. 2018. 200-210.
    [18] Wen W, Chu J, Zhao T, et al. Code2tree:A method for automatically generating code comments. Hindawi Scientific Programming, 2022. https://doi.org/10.1155/2022/6350686
    [19] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In:Proc. of the IJCAI. 2017. 3034-3040.
    [20] Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37-45.
    [21] Alon U, Zilberstein M, Levy O, et al. Code2vec:Learning distributed representations of code. Proc. of the ACM on Programming Languages, 2019, 3(POPL):1-29.
    [22] Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In:Proc. of the 2021 IEEE Int'l Conf. on Software Maintenance and Evolution (ICSME). IEEE, 2021. 483-494.
    [23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 30
    [24] Devlin J, Chang MW, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding. arXiv:1810. 04805, 2018.
    [25] Liu Y, Ott M, Goyal N, et al. RoBERTa:A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [26] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI, 2018.
    [27] Lewis M, Liu Y, Goyal N, et al. BART:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.
    [28] Ahmad WU, Chakraborty S, Ray B, et al. Unified pre-training for program understanding and generation. arXiv:2103.06333, 2021.
    [29] Feng Z, Guo D, Tang D, et al. CodeBERT:A pre-trained model for programming and natural languages. arXiv:2002.08155, 2020.
    [30] Guo D, Ren S, Lu S, et al. GraphCodeBERT:Pre-training code representations with data flow. arXiv:2009.08366, 2020.
    [31] Wang X, Wang Y, Mi F, et al. SynCoBERT:Syntax-guided multi-modal contrastive pretraining for code representation. arXiv:2108.04556, 2021.
    [32] Guo D, Lu S, Duan N, et al. UniXcoder:Unified cross-modal pre-training for code representation. arXiv:2203. 03850, 2022.
    [33] Husain H, Wu HH, Gazit T, et al. CodeSearchNet challenge:Evaluating the state of semantic code search. arXiv:1909.09436, 2019.
    [34] Svajlenko J, Islam JF, Keivanloo I, et al. Towards a big data curated benchmark of inter-project code clones. In:Proc. of the 2014 IEEE Int'l Conf. on Software Maintenance and Evolution. IEEE, 2014. 476-480.
    [35] Mou L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence (AAAI-16). 2016.
    [36] Lü TG, Hong RC, He J, et al. Multimodal-guided local feature selection for few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5):2068-2082 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j. cnki.jos.006771]
    [37] Gu X, Zhang H, Kim S. Deep code search. In:Proc. of the 40th Int'l Conf. on Software Engineering. 2018.
    [38] Yang G, Chen X, Cao J, et al. Comformer:Code comment generation via transformer and fusion method-based hybrid code representation. In:Proc. of the 8th Int'l Conf. on Dependable Systems and Their Applications (DSA). 2021.
    [39] Liu B, Li RL, Feng JF. A brief introduction to deep metric learning. CAAI Trans. on in Telligent Systems, 2019, 14(6):1064-1072 (in Chinese with English abstract).
    [40] Gao T, Yao X, Chen D. SimCSE:Simple contrastive learning of sentence embeddings. arXiv:2104.08821, 2021.
    [41] Bui ND, Yu Y, Jiang L. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In:Proc. of the 44th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. 2021. 511-521.
    [42] Chen Q, Lacomis J, Schwartz EJ, et al. VarCLR:Variable semantic representation pre-training via contrastive learning. In:Proc. of the 44th Int'l Conf. on Software Engineering. 2022. 2327-2339.
    [43] Neelakantan A, Xu T, Puri R, et al. Text and code embeddings by contrastive pre-training. arXiv:2201.10005, 2022.
    [44] Jain P, Jain A, Zhang T, et al. Contrastive code representation learning. arXiv:2007.04973, 2020.
    [45] Du L, Shi X, Wang Y, et al. Is a single model enough? MuCoS:A multi-model ensemble learning for semantic code search. arXiv:2107.04773, 2021.
    [46] Rabin MR, Bui ND, Wang K, et al. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021, 135:106552.
    [47] Wei M, Zhang LP. Research progress of code search methods. Application Research of Computers, 2021, 38(11):3215-3221, 3230 (in Chinese with English abstract).[doi:10.19734/j.issn.1001-3695.2021.04.0096]
    [48] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation:Encoder-decoder approaches. arXiv:1409.1259, 2014.
    附中文参考文献:
    [3] 成思强, 刘建勋, 彭珍连, 等. 以CodeBERT为基础的代码分类研究. 计算机工程与应用, 2023, 59(24):277-288.[doi:10.3778/j.issn.1002-8331.2209-0402].
    [7] 周志华, 陈世福. 神经网络集成. 计算机学报, 2002, 25(1):1-8.
    [16] 王霞, 梁瑶, 谢春丽. 深度学习在代码表征中的应用综述. 计算机工程与应用, 2021, 57(20):53-63.[doi:10.3778/j.issn.1002- 8331.2106-0368]
    [36] 吕天根, 洪日昌, 何军, 等. 多模态引导的局部特征选择小样本学习方法. 软件学报, 2023, 34(5):2068-2082. http://www.jos.org.cn/1000-9825/6771.htm[doi:10.13328/j.cnki.jos.006771]
    [39] 刘冰, 李瑞麟, 封举富. 深度度量学习综述. 智能系统学报, 2019, 14(6):1064-1072.
    [47] 魏敏, 张丽萍. 语义代码检索方法研究进展. 计算机应用研究, 2021, 38(11):3215-3221, 3230.[doi:10.19734/j.issn.1001-3695. 2021.04.0096]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

杨宏宇,马建辉,侯旻,沈双宏,陈恩红.基于多模态对比学习的代码表征增强预训练方法.软件学报,2024,35(4):1601-1617

Copy
Share
Article Metrics
  • Abstract:1699
  • PDF: 4008
  • HTML: 1798
  • Cited by: 0
History
  • Received:May 15,2023
  • Revised:July 07,2023
  • Online: September 11,2023
  • Published: April 06,2024
You are the first2044692Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063