Enhanced Simhash Algorithm for Code Similarity Detection

doi:10.13328/j.cnki.jos.006271

微信服务号

微信订阅号

Home > Archive>Volume 32, Issue 7, 2021 >2242-2259. DOI:10.13328/j.cnki.jos.006271

PDF HTML XML Export Cite reminder

Enhanced Simhash Algorithm for Code Similarity Detection
DOI:
                        10.13328/j.cnki.jos.006271
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:
Fund Project:2019 Industrial Internet Innovation Development Project-industrial Software Source Code Security Detection Tool Project; "Advanced Industrial Internet Security Platform" Project of Zhijiang Laborator (2018FD0ZX01)

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Code similarity detection is one of the basic tasks in software engineering. It plays an effective and fundamental role in plagiarism, software licensing violation, software reuse analysis, and vulnerability discovery. With the popularization of open source software, open source code has been frequently applied to multiple areas, bringing new challenges to traditional code similarity detection methods.Some existing detection methods based on lexical, grammar, and semantics have problems such as high computational complexity, dependence on analytical tools, high resource consumption, poor portability, having a large number of comparison candidates, and so on. Simhash-based code similarity detection algorithm reduces the dimension of the code to a fingerprint, which can realize fast near-duplicate file retrieval on a large dataset. It controls the similarity of matched results through the Hamming distance threshold. This study verifies existed simhash algorithm with line granularity through experiments, and discovers the line coverage problem in large-scale datasets. Inspired by the idea of TF-IDF algorithm, a language-based line-filtering optimization method is proposed to deal with it. Line sequences of code files is filtered through line filters in various languages to eliminate the impact of lines that appear frequently but contain less semantic information on the results. After a series of comparative experiments, this study verifies that the enhanced method always achieves high precision with Hamming distance threshold set from 0 to 8. Compared to the method before enhancement, the proposed method improves the precision by 98.6% and 52.2% on two different datasets with threshold set to 8. Based on the large-scale code database built from 386 486 112 files in 1.3 million open source projects, it is verified that the proposed method can, keeping the high precision of 97%, efficiently detect similar files with an average speed of 0.43s per file.

Reference

Cited by

Get Citation

李玫,高庆,马森,张世琨,胡文蕙,张兴明.面向代码相似性检测的相似哈希改进方法.软件学报,2021,32(7):2242-2259

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:September 16,2020
Revised:October 26,2020
Adopted:
Online: January 22,2021
Published: July 06,2021

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History