Abstract:Code similarity detection is one of the basic tasks in software engineering. It plays an effective and fundamental role in plagiarism, software licensing violation, software reuse analysis, and vulnerability discovery. With the popularization of open source software, open source code has been frequently applied to multiple areas, bringing new challenges to traditional code similarity detection methods.Some existing detection methods based on lexical, grammar, and semantics have problems such as high computational complexity, dependence on analytical tools, high resource consumption, poor portability, having a large number of comparison candidates, and so on. Simhash-based code similarity detection algorithm reduces the dimension of the code to a fingerprint, which can realize fast near-duplicate file retrieval on a large dataset. It controls the similarity of matched results through the Hamming distance threshold. This study verifies existed simhash algorithm with line granularity through experiments, and discovers the line coverage problem in large-scale datasets. Inspired by the idea of TF-IDF algorithm, a language-based line-filtering optimization method is proposed to deal with it. Line sequences of code files is filtered through line filters in various languages to eliminate the impact of lines that appear frequently but contain less semantic information on the results. After a series of comparative experiments, this study verifies that the enhanced method always achieves high precision with Hamming distance threshold set from 0 to 8. Compared to the method before enhancement, the proposed method improves the precision by 98.6% and 52.2% on two different datasets with threshold set to 8. Based on the large-scale code database built from 386 486 112 files in 1.3 million open source projects, it is verified that the proposed method can, keeping the high precision of 97%, efficiently detect similar files with an average speed of 0.43s per file.