Insights and Analysis of Open Source License Violation Risks in Large Language Models Generated Code

doi:10.13328/j.cnki.jos.007324

微信服务号

微信订阅号

Home > Archive>Volume 36, Issue 6, 2025 >0-0. DOI:10.13328/j.cnki.jos.007324

PDF HTML XML Export Cite reminder

Insights and Analysis of Open Source License Violation Risks in Large Language Models Generated Code
DOI:
                        10.13328/j.cnki.jos.007324
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:TP311
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The rapid development of large language models (LLMs) has significantly impacted the field of software engineering. These LLMs, pre-trained on extensive open-source code datasets, can efficiently perform tasks such as code generation and completion. However, the presence of numerous licensed codes within these datasets poses a license violation risk for the LLMs. This paper focuses on the risk of license violations between code generated by LLMs and open-source repositories. Based on code clone technology, we developed a detection framework that supports tracing the source of code generated by LLMs and identifying copyright infringement issues. Using this framework, we traced and detected open-source license compatibility in the open-source community for 135,000 Python code samples generated by 9 mainstream code LLMs. Through practical investigation of three research questions: "To what extent is the code generated by large models cloned from open-source software repositories?", "Is there a risk of open-source license violations in the code generated by large models?", and "Is there a risk of open-source license violations in the large model-generated code included in real open-source software?", we explore the impact of large model code generation on the open-source software ecosystem. The experimental results show that among the 43,130 and 65,900 python codes longer than six lines generated using functional descriptions and method signatures by nine LLMs, 68.5% and 60.9% of the codes could be found in open-source codes with code clones. CodeParrot and CodeGen had the highest code clone rates, while GPT-3.5-Turbo had the lowest. Besides, 92.7% of the code generated based on function descriptions lacked license declaration. Comparing these with the licenses of the open-source codes, 81.8% of the codes had potential license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes were cloned from open-source codes, with 38 classified as Type1 and Type2, and 30 having potential license violation risks. We reported these issues to the developers. So far, we have received feedback from eight developers.

Reference

Cited by

Get Citation

王毅博,王莹,余跃,许畅,于海,朱志良.大模型生成代码的开源许可证违规风险洞察与分析.软件学报,2025,36(6):0

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:August 24,2024
Revised:October 14,2024
Adopted:
Online: December 10,2024
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History