大模型生成代码的开源许可证违规风险洞察与分析

doi:10.13328/j.cnki.jos.007324

微信服务号

微信订阅号

2025年4月5日 22:10 星期六

首页 > 过刊浏览>2025年第36卷第6期 >2536-2558. DOI:10.13328/j.cnki.jos.007324

PDF HTML阅读 XML下载导出引用引用提醒

大模型生成代码的开源许可证违规风险洞察与分析
DOI:
                        10.13328/j.cnki.jos.007324
                    
CSTR:
                        
                    
作者:
                        王毅博王毅博
东北大学 软件学院, 辽宁 沈阳 110169
在期刊界中查找
在百度中查找
在本站中查找
王莹王莹
东北大学 软件学院, 辽宁 沈阳 110169
在期刊界中查找
在百度中查找
在本站中查找
余跃余跃
国防科技大学 计算机学院, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找
许畅许畅
南京大学 计算机科学与技术系, 江苏 南京 210046;计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210046
在期刊界中查找
在百度中查找
在本站中查找
于海于海
东北大学 软件学院, 辽宁 沈阳 110169
在期刊界中查找
在百度中查找
在本站中查找
朱志良朱志良
东北大学 软件学院, 辽宁 沈阳 110169
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:王莹,E-mail:wangying@swc.neu.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金(61932021, 62141210); 111 项目 (B16009)

Insights and Analysis of Open-source License Violation Risks in LLMs Generated Code

Author:

WANG Yi-Bo
WANG Yi-Bo
Software College, Northeastern University, Shenyang 110169, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Ying
WANG Ying
Software College, Northeastern University, Shenyang 110169, China
在期刊界中查找
在百度中查找
在本站中查找
YU Yue
YU Yue
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
在期刊界中查找
在百度中查找
在本站中查找
XU Chang
XU Chang
Department of Computer Science and Technology, Nanjing University, Nanjing 210046, China;State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210046, China
在期刊界中查找
在百度中查找
在本站中查找
YU Hai
YU Hai
Software College, Northeastern University, Shenyang 110169, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Zhi-Liang
ZHU Zhi-Liang
Software College, Northeastern University, Shenyang 110169, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

大型语言模型的快速发展极大地影响了软件工程领域. 这些模型利用大量开源仓库代码进行预训练, 能够高效完成诸如代码生成和代码补全等任务. 然而, 开源软件仓库中存在大量受开源许可证约束的代码, 这给大模型带来了潜在的开源许可证违规风险. 聚焦于大模型生成代码与开源仓库的许可证违规风险, 基于代码克隆技术开发一个支持大模型生成代码溯源与版权违规问题的检测框架. 针对9个主流代码大模型生成的135 000个Python代码, 利用该框架在开源社区中溯源并检测开源许可证兼容性. 通过实践调查3个研究问题来探究大模型代码生成对开源软件生态的影响: (1) 大模型生成的代码多大程度克隆于开源软件仓库? (2) 大模型生成的代码是否存在开源许可证违规风险? (3) 真实开源软件中包含的大模型生成代码是否存在开源许可证违规风险? 实验结果发现在使用功能描述和方法签名所生成的43 130和65 900个大于6行的Python代码中, 分别溯源到了68.5%和60.9%的代码存在克隆的开源代码片段. 其中CodeParrot和CodeGen系列模型的克隆比例最高, GPT-3.5-Turbo最低. 其次, 92.7%的通过功能描述生成的代码中没有开源许可证声明. 通过与溯源代码许可证进行对比, 81.8%的代码存在开源许可证违规风险. 此外, 在收集到的229个GitHub平台开发者使用大模型生成的代码中, 有136个代码溯源了到开源代码片段, 其中38个为Type1和Type2克隆类型, 有30个存在开源许可证违规风险. 以问题报告的形式提交给开发者, 到目前为止, 得到了8位开发者的反馈.

关键词:大语言模型;开源许可证;开源许可证冲突;代码克隆;代码搜索

Abstract:

The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of codes in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This study focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135 000 Python codes generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43 130 and 65 900 python codes longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the codes respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the codes generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced codes, 81.8% of the codes have open-source license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers.

Key words:large language model (LLM);open-source license;open-source license violation;code clone;code search

引用本文

王毅博,王莹,余跃,许畅,于海,朱志良.大模型生成代码的开源许可证违规风险洞察与分析.软件学报,2025,36(6):2536-2558

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-08-24
最后修改日期:2024-10-14
录用日期:
在线发布日期: 2024-12-10
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码