Abstract:Code clone detection is an important task in the software engineering community, it is particularly difficult to detect type-IV code clone, which have similar semantics but large syntax gap. Deep learning-based approaches have achieved promising performances on the detection of type-IV code clone, yet at the high-cost of using manually-annotated code clone pairs for supervision. This study proposes two simple and effective pretraining strategies to enhance the representation learning of code clone detection model based on deep learning, aiming to alleviate the requirement of the large-scale training dataset in supervised learning models. First, token embeddings models are pretrained with ngram subword enhancement, which helps the clone detection model to better represent out-of-vocabulary (OOV) tokens. Second, the function name prediction is adopted as an auxiliary task to pretrain clone detection model parameters from token to code fragments. With the two enhancement strategies, a model with more accurate code representation capability can be achieved, which is then used as the code representation model in clone detection and trained on the clone detection task with supervised learning. The experiments on the standard benchmark dataset BigCloneBench (BCB) and OJClone are conducedt, finding that the final model with only a very small number of training instances (i.e., 100 clones and 100 non-clones for BCB, 200 clones and 200 non-clones for OJClone) can give comparable performance than existing methods with over six million training instances.