Enhanced Simhash Algorithm for Code Similarity Detection
Author:
Affiliation:

Clc Number:

Fund Project:

2019 Industrial Internet Innovation Development Project-industrial Software Source Code Security Detection Tool Project; "Advanced Industrial Internet Security Platform" Project of Zhijiang Laborator (2018FD0ZX01)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Code similarity detection is one of the basic tasks in software engineering. It plays an effective and fundamental role in plagiarism, software licensing violation, software reuse analysis, and vulnerability discovery. With the popularization of open source software, open source code has been frequently applied to multiple areas, bringing new challenges to traditional code similarity detection methods.Some existing detection methods based on lexical, grammar, and semantics have problems such as high computational complexity, dependence on analytical tools, high resource consumption, poor portability, having a large number of comparison candidates, and so on. Simhash-based code similarity detection algorithm reduces the dimension of the code to a fingerprint, which can realize fast near-duplicate file retrieval on a large dataset. It controls the similarity of matched results through the Hamming distance threshold. This study verifies existed simhash algorithm with line granularity through experiments, and discovers the line coverage problem in large-scale datasets. Inspired by the idea of TF-IDF algorithm, a language-based line-filtering optimization method is proposed to deal with it. Line sequences of code files is filtered through line filters in various languages to eliminate the impact of lines that appear frequently but contain less semantic information on the results. After a series of comparative experiments, this study verifies that the enhanced method always achieves high precision with Hamming distance threshold set from 0 to 8. Compared to the method before enhancement, the proposed method improves the precision by 98.6% and 52.2% on two different datasets with threshold set to 8. Based on the large-scale code database built from 386 486 112 files in 1.3 million open source projects, it is verified that the proposed method can, keeping the high precision of 97%, efficiently detect similar files with an average speed of 0.43s per file.

    Reference
    Related
    Cited by
Get Citation

李玫,高庆,马森,张世琨,胡文蕙,张兴明.面向代码相似性检测的相似哈希改进方法.软件学报,2021,32(7):2242-2259

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:September 16,2020
  • Revised:October 26,2020
  • Adopted:
  • Online: January 22,2021
  • Published: July 06,2021
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063