大模型生成代码的开源许可证违规风险洞察与分析
作者:
通讯作者:

王莹,E-mail:wangying@swc.neu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(61932021, 62141210); 111 项目 (B16009)


Insights and Analysis of Open-source License Violation Risks in LLMs Generated Code
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    大型语言模型的快速发展极大地影响了软件工程领域. 这些模型利用大量开源仓库代码进行预训练, 能够高效完成诸如代码生成和代码补全等任务. 然而, 开源软件仓库中存在大量受开源许可证约束的代码, 这给大模型带来了潜在的开源许可证违规风险. 聚焦于大模型生成代码与开源仓库的许可证违规风险, 基于代码克隆技术开发一个支持大模型生成代码溯源与版权违规问题的检测框架. 针对9个主流代码大模型生成的135 000个Python代码, 利用该框架在开源社区中溯源并检测开源许可证兼容性. 通过实践调查3个研究问题来探究大模型代码生成对开源软件生态的影响: (1) 大模型生成的代码多大程度克隆于开源软件仓库? (2) 大模型生成的代码是否存在开源许可证违规风险? (3) 真实开源软件中包含的大模型生成代码是否存在开源许可证违规风险? 实验结果发现在使用功能描述和方法签名所生成的43 130和65 900个大于6行的Python代码中, 分别溯源到了68.5%和60.9%的代码存在克隆的开源代码片段. 其中CodeParrot和CodeGen系列模型的克隆比例最高, GPT-3.5-Turbo最低. 其次, 92.7%的通过功能描述生成的代码中没有开源许可证声明. 通过与溯源代码许可证进行对比, 81.8%的代码存在开源许可证违规风险. 此外, 在收集到的229个GitHub平台开发者使用大模型生成的代码中, 有136个代码溯源了到开源代码片段, 其中38个为Type1和Type2克隆类型, 有30个存在开源许可证违规风险. 以问题报告的形式提交给开发者, 到目前为止, 得到了8位开发者的反馈.

    Abstract:

    The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of codes in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This study focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135 000 Python codes generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43 130 and 65 900 python codes longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the codes respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the codes generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced codes, 81.8% of the codes have open-source license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers.

    参考文献
    相似文献
    引证文献
引用本文

王毅博,王莹,余跃,许畅,于海,朱志良.大模型生成代码的开源许可证违规风险洞察与分析.软件学报,2025,36(6):2536-2558

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-24
  • 最后修改日期:2024-10-14
  • 在线发布日期: 2024-12-10
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号