Clone Detection with Pre-training Enhanced Code Representation
Author:
Affiliation:

Clc Number:

TP311

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Code clone detection is an important task in the software engineering community, it is particularly difficult to detect type-IV code clone, which have similar semantics but large syntax gap. Deep learning-based approaches have achieved promising performances on the detection of type-IV code clone, yet at the high-cost of using manually-annotated code clone pairs for supervision. This study proposes two simple and effective pretraining strategies to enhance the representation learning of code clone detection model based on deep learning, aiming to alleviate the requirement of the large-scale training dataset in supervised learning models. First, token embeddings models are pretrained with ngram subword enhancement, which helps the clone detection model to better represent out-of-vocabulary (OOV) tokens. Second, the function name prediction is adopted as an auxiliary task to pretrain clone detection model parameters from token to code fragments. With the two enhancement strategies, a model with more accurate code representation capability can be achieved, which is then used as the code representation model in clone detection and trained on the clone detection task with supervised learning. The experiments on the standard benchmark dataset BigCloneBench (BCB) and OJClone are conducedt, finding that the final model with only a very small number of training instances (i.e., 100 clones and 100 non-clones for BCB, 200 clones and 200 non-clones for OJClone) can give comparable performance than existing methods with over six million training instances.

    Reference
    Related
    Cited by
Get Citation

冷林珊,刘爽,田承霖,窦淑洁,王赞,张梅山.预训练增强的代码克隆检测技术.软件学报,2022,33(5):1758-1773

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 12,2021
  • Revised:October 09,2021
  • Adopted:
  • Online: January 28,2022
  • Published: May 06,2022
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063