大模型生成代码的开源许可证违规风险洞察与分析
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

王莹,E-mail:wangying@swc.neu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(61932021,62332005,62141210,61902056,61802164,61977014);111项目(B16009).


Insights and Analysis of Open Source License Violation Risks in Large Language Models Generated Code
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    大型语言模型的快速发展极大地影响了软件工程领域. 这些模型利用大量开源仓库代码进行预训练, 能够高效完成诸如代码生成和代码补全等任务. 然而, 开源软件仓库中存在大量受开源许可证约束的代码, 这给大模型带来了潜在的开源许可证违规风险. 本文聚焦于大模型生成代码与开源仓库的许可证违规风险, 基于代码克隆技术开发了一个支持大模型生成代码溯源与版权违规问题的检测框架. 针对9个主流代码大模型生成的135,000个Python代码, 利用该框架在开源社区中溯源并检测开源许可证兼容性. 通过实践调查三个研究问题:“大模型生成的代码多大程度克隆于开源软件仓库?”、“大模型生成的代码是否存在开源许可证违规风险?”、“真实开源软件中包含的大模型生成代码是否存在开源许可证违规风险?”, 探究大模型代码生成对开源软件生态的影响. 实验结果发现在使用功能描述和方法签名所生成的43,130和65,900个大于6行的Python代码中, 分别溯源到了68.5%和60.9%的代码存在克隆的开源代码片段. 其中CodeParrot和CodeGen系列模型的克隆比例最高, GPT-3.5-Turbo最低. 其次, 92.7%的通过功能描述生成的代码中没有开源许可证声明. 通过与溯源代码许可证进行对比, 81.8%的代码存在开源许可证违规风险. 此外, 在收集到的229个GitHub平台开发者使用大模型生成的代码中, 有136个代码溯源了到开源代码片段, 其中38个为Type1和Type2克隆类型, 有30个存在开源许可证违规风险. 我们以问题报告的形式提交给了开发者, 到目前为止, 得到了八位开发者的反馈.

    Abstract:

    The rapid development of large language models (LLMs) has significantly impacted the field of software engineering. These LLMs, pre-trained on extensive open-source code datasets, can efficiently perform tasks such as code generation and completion. However, the presence of numerous licensed codes within these datasets poses a license violation risk for the LLMs. This paper focuses on the risk of license violations between code generated by LLMs and open-source repositories. Based on code clone technology, we developed a detection framework that supports tracing the source of code generated by LLMs and identifying copyright infringement issues. Using this framework, we traced and detected open-source license compatibility in the open-source community for 135,000 Python code samples generated by 9 mainstream code LLMs. Through practical investigation of three research questions: "To what extent is the code generated by large models cloned from open-source software repositories?", "Is there a risk of open-source license violations in the code generated by large models?", and "Is there a risk of open-source license violations in the large model-generated code included in real open-source software?", we explore the impact of large model code generation on the open-source software ecosystem. The experimental results show that among the 43,130 and 65,900 python codes longer than six lines generated using functional descriptions and method signatures by nine LLMs, 68.5% and 60.9% of the codes could be found in open-source codes with code clones. CodeParrot and CodeGen had the highest code clone rates, while GPT-3.5-Turbo had the lowest. Besides, 92.7% of the code generated based on function descriptions lacked license declaration. Comparing these with the licenses of the open-source codes, 81.8% of the codes had potential license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes were cloned from open-source codes, with 38 classified as Type1 and Type2, and 30 having potential license violation risks. We reported these issues to the developers. So far, we have received feedback from eight developers.

    参考文献
    相似文献
    引证文献
引用本文

王毅博,王莹,余跃,许畅,于海,朱志良.大模型生成代码的开源许可证违规风险洞察与分析.软件学报,2025,36(6):0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-24
  • 最后修改日期:2024-10-14
  • 录用日期:
  • 在线发布日期: 2024-12-10
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号