Abstract:The rapid development of large language models (LLMs) has significantly impacted the field of software engineering. These LLMs, pre-trained on extensive open-source code datasets, can efficiently perform tasks such as code generation and completion. However, the presence of numerous licensed codes within these datasets poses a license violation risk for the LLMs. This paper focuses on the risk of license violations between code generated by LLMs and open-source repositories. Based on code clone technology, we developed a detection framework that supports tracing the source of code generated by LLMs and identifying copyright infringement issues. Using this framework, we traced and detected open-source license compatibility in the open-source community for 135,000 Python code samples generated by 9 mainstream code LLMs. Through practical investigation of three research questions: "To what extent is the code generated by large models cloned from open-source software repositories?", "Is there a risk of open-source license violations in the code generated by large models?", and "Is there a risk of open-source license violations in the large model-generated code included in real open-source software?", we explore the impact of large model code generation on the open-source software ecosystem. The experimental results show that among the 43,130 and 65,900 python codes longer than six lines generated using functional descriptions and method signatures by nine LLMs, 68.5% and 60.9% of the codes could be found in open-source codes with code clones. CodeParrot and CodeGen had the highest code clone rates, while GPT-3.5-Turbo had the lowest. Besides, 92.7% of the code generated based on function descriptions lacked license declaration. Comparing these with the licenses of the open-source codes, 81.8% of the codes had potential license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes were cloned from open-source codes, with 38 classified as Type1 and Type2, and 30 having potential license violation risks. We reported these issues to the developers. So far, we have received feedback from eight developers.