最长公共子序列嵌入支持下的代码相似性检测
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(62272465, 62441206, 62002361, 62272464); CCF-华为胡杨林基金(CCF-HuaweiSE202310)


Code Similarity Detection Supported by Longest Common Subsequence Embedding
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    最长公共子序列(longest common subsequence, LCS)是一种衡量代码相似度的可行指标. 然而, 经典LCS算法的时间复杂度较高, 难以应对大型数据集, 并且, 由于代码文本序列中的词(token)本质为一种基于离散表示的编码, 直接使用LCS算法无法有效识别文本不同但语义相似的代码片段中的关键语义. 针对这两方面的不足, 提出一种面向LCS的嵌入方法, 将代码间的LCS计算转换为代码低维稠密嵌入向量间的数值运算, 并可以利用近似最近邻算法进一步加速其计算. 为此, 设计了一个可嵌入的基于LCS的距离度量方法, 实验证明这种代码度量在提取函数关键语义的表现上优于对比嵌入工具使用的基于文本的距离或基于树的距离. 同时, 为了在嵌入过程中有重点地保留代码的关键语义, 构建了两种损失函数和相应的训练集, 识别文本上不同但语义上相似的代码元素, 使模型在检测复杂代码克隆时有更好的表现. 实验证明了该方法拥有很强的可扩展性, 且其对复杂克隆的检测能力也保持在很高水平. 将该技术应用于相似缺陷的识别, 上报了23个未知缺陷, 这些缺陷已被开发人员在实际项目中确认, 其中有些复杂缺陷是难以被基于文本的LCS算法检出的.

    Abstract:

    The longest common subsequence (LCS) is a practical metric for assessing code similarity. However, traditional LCS-based methods face challenges in scalability and in effectively capturing critical semantics for identifying code fragments that are textually different but semantically similar, due to their reliance on discrete representation-based token encoding. To address these limitations, this study proposes an LCS-oriented embedding method that encodes code into low-dimensional dense vectors, effectively capturing semantic information. This transformation enables the computationally expensive LCS calculation to be replaced with efficient vector arithmetic, further accelerated using an approximate nearest neighbor algorithm. To support this approach, an embeddable LCS-based distance metric is developed, as the original LCS metric is non-embeddable. Experimental results demonstrate that the proposed metric outperforms tree-based and literal similarity metrics in detecting complex code clones. In addition, two targeted loss functions and corresponding training datasets are designed to prioritize retaining critical semantics in the embedding process, allowing the model to identify textually different but semantically similar code elements. This improves performance in detecting complex code similarities. The proposed method demonstrates strong scalability and high accuracy in detecting complex clones. When applied to similar bug identification, it has reported 23 previously unknown bugs, all of which are confirmed by developers in real-world projects. Notably, several of these bugs are complex and challenging to detect using traditional LCS-based techniques.

    参考文献
    相似文献
    引证文献
引用本文

弓媛君,黄建军,游伟,石文昌,梁彬,边攀,张健.最长公共子序列嵌入支持下的代码相似性检测.软件学报,,():1-15

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-11-06
  • 最后修改日期:2024-08-07
  • 录用日期:
  • 在线发布日期: 2025-05-22
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号