Insights and Analysis of Open Source License Violation Risks in Large Language Models Generated Code
Author:
Affiliation:

Clc Number:

TP311

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    The rapid development of large language models (LLMs) has significantly impacted the field of software engineering. These LLMs, pre-trained on extensive open-source code datasets, can efficiently perform tasks such as code generation and completion. However, the presence of numerous licensed codes within these datasets poses a license violation risk for the LLMs. This paper focuses on the risk of license violations between code generated by LLMs and open-source repositories. Based on code clone technology, we developed a detection framework that supports tracing the source of code generated by LLMs and identifying copyright infringement issues. Using this framework, we traced and detected open-source license compatibility in the open-source community for 135,000 Python code samples generated by 9 mainstream code LLMs. Through practical investigation of three research questions: "To what extent is the code generated by large models cloned from open-source software repositories?", "Is there a risk of open-source license violations in the code generated by large models?", and "Is there a risk of open-source license violations in the large model-generated code included in real open-source software?", we explore the impact of large model code generation on the open-source software ecosystem. The experimental results show that among the 43,130 and 65,900 python codes longer than six lines generated using functional descriptions and method signatures by nine LLMs, 68.5% and 60.9% of the codes could be found in open-source codes with code clones. CodeParrot and CodeGen had the highest code clone rates, while GPT-3.5-Turbo had the lowest. Besides, 92.7% of the code generated based on function descriptions lacked license declaration. Comparing these with the licenses of the open-source codes, 81.8% of the codes had potential license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes were cloned from open-source codes, with 38 classified as Type1 and Type2, and 30 having potential license violation risks. We reported these issues to the developers. So far, we have received feedback from eight developers.

    Reference
    Related
    Cited by
Get Citation

王毅博,王莹,余跃,许畅,于海,朱志良.大模型生成代码的开源许可证违规风险洞察与分析.软件学报,2025,36(6):0

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 24,2024
  • Revised:October 14,2024
  • Adopted:
  • Online: December 10,2024
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063