Volume 36,Issue 6,2025 Table of Contents

Survey on Fuzzing Based on Large Language Model

LI Yan , YANG Wen-Zhang , ZHANG Yi , XUE Yin-Xing

2025, 36(6):2404-2431. DOI: 10.13328/j.cnki.jos.007323

Abstract (1009) HTML (0) PDF 4.96 M (422) Comment (0) Favorites

Abstract:Fuzzing, as an automated software testing method, aims to detect potential security vulnerabilities, software defects, or abnormal behaviors by inputting a large quantity of automatically generated test data into the target software system. However, traditional fuzzing techniques are restricted by such factors as low automation level, low testing efficiency, and low code coverage, being unable to handle modern large-scale software systems. In recent years, the rapid development of large language models has not only brought significant breakthroughs to the field of natural language processing but also introduced new automation solutions to the field of fuzzing. Therefore, to better enhance the effectiveness of fuzzing technology, existing works have proposed various fuzzing methods combined with large language models, covering modules like test input generation, defect detection, and post-fuzzing. Nevertheless, the existing works lack systematic investigation and discussion on fuzzing techniques based on large language models. To fill the above-mentioned gaps in the review, this study comprehensively analyzes and summarizes the current research and development status of fuzzing techniques based on large language models. The main contents include (1) summarizing the overall process of fuzzing and the relevant technologies related to large language models commonly used in fuzzing research; (2) discussing the limitations of deep learning based fuzzing methods before the era of large language model (LLM); (3) analyzing the application methods of large language models in different stages of fuzzing; (4) exploring the main challenges and possible future development directions of large language model technology in fuzzing.

Detection of Resource Leaks in Java Programs: Effectiveness Analysis of Traditional Models and Language Models

LIU Tian-Yang , YE Jia-Wei , JI Wei-Xing , LIU Hui

2025, 36(6):2432-2452. DOI: 10.13328/j.cnki.jos.007327

Abstract (116) HTML (0) PDF 7.81 M (192) Comment (0) Favorites

Abstract:Resource leaks, which are defects caused by the failure to timely and properly close the limited system resources, are widely present in programs of various languages and possess a certain degree of concealment. The traditional defect detection methods usually predict the resource leaks in software based on rules and heuristic search. In recent years, defect detection methods based on deep learning have captured the semantic information in the code through different code representation forms and by using techniques such as recurrent neural networks and graph neural networks. Recent studies show that language models have performed outstandingly in tasks such as code understanding and generation. However, the advantages and limitations of large language models (LLMs) in the specific task of resource leak detection have not been fully evaluated. The effectiveness of the detection methods based on traditional models, small models, and LLMs in the task of resource leak detection is studied, and various improvement methods such as few-shot learning, fine-tuning and the combination of static analysis and LLMs are explored. Specifically, taking the JLeaks and DroidLeaks datasets as the experimental objects, the performance of different models is analyzed from multiple dimensions such as the root causes of resource leaks, resource types and code complexity. The experimental results show that the fine-tuning technique can significantly improve the detection effect of LLMs in the field of resource leak detection. However, most models still need to be improved in identifying the resource leaks caused by third-party libraries. In addition, the code complexity has a greater influence on the detection methods based on traditional models for resource leak detection.

Survey on Testing of Intelligent Chip Design Program

LI Xiao-Peng , YAN Ming , FAN Xing-Yu , TANG Zhen-Tao , KAI Shi-Xiong , HAO Jian-Ye , YUAN Ming-Xuan , CHEN Jun-Jie

2025, 36(6):2453-2476. DOI: 10.13328/j.cnki.jos.007328

Abstract (547) HTML (0) PDF 8.75 M (200) Comment (0) Favorites

Abstract:In the current intelligent era, chips, serving as the core components of intelligent electronic devices, play a critical role in multiple fields such as artificial intelligence, the Internet of Things, and 5G communication. It is of great significance to ensure the correctness, security, and reliability of chips. During the chip development process, developers first need to implement the chip design into a software form (i.e., chip design programs) by using hardware description languages, and then conduct physical design and finally tape-out (i.e., production and manufacturing). As the basis of chip design and manufacturing, the quality of the chip design program directly impacts the quality of the chips. Therefore, the testing of chip design programs is of important research significance. The early testing methods for chip design programs mainly depend on the test cases manually designed by developers to test the chip design programs, often requiring a large amount of manual cost and time. With the increasing complexity of chip design programs, various simulation-based automated testing methods have been proposed, improving the efficiency and effectiveness of chip design program testing. In recent years, more and more researchers have been committed to applying intelligent methods such as machine learning, deep learning, and large language models (LLMs) to the field of chip design program testing. This study surveys 88 academic papers related to intelligent chip design program testing, and sorts and summarizes the existing achievements in intelligent chip design program testing from three perspectives: test input generation, test oracle construction, and test execution optimization. It focuses on the evolution of chip design program testing methods from the machine learning stage to the deep learning stage and then to the large language model stage, exploring the potential of different stages’ methods in improving testing efficiency and coverage, as well as reducing testing costs. Additionally, it introduces research datasets and tools in the field of chip design program testing and envisions future development directions and challenges.

Exploration and Improvement of Capabilities of LLMs in Code Refinement Task

WANG Zhi-Peng , HE Tie-Ke , ZHAO Ruo-Yu , ZHENG Tao

2025, 36(6):2477-2500. DOI: 10.13328/j.cnki.jos.007325

Abstract (834) HTML (0) PDF 6.70 M (320) Comment (0) Favorites

Abstract:As a crucial part of automated code review, the code refinement task is of great significance for improving development efficiency and code quality. Since large language models (LLMs) have shown far better performance than traditional small-scale pre-trained models in the field of software engineering, this study aims to explore the performance of these two types of models in the task of automatic code refinement, so as to evaluate the comprehensive advantages of LLMs. The traditional code quality evaluation metrics (e.g., BLEU, CodeBLEU, edit progress) are used to evaluate the performance of four mainstream LLMs and four representative small-scale pre-trained models in the code refinement task. Findings indicate that the refinement quality of LLMs in the pre-review code refinement subtask is inferior to that of small-scale pre-trained models. Due to the difficulty of the existing code quality evaluation metrics in explaining the above phenomenon, this study proposes Unidiff-based code refinement evaluation metrics to quantify the change operations in the refinement process, in order to explain the reasons for the inferiority and reveal the tendency of the models to perform change operations: (1) The pre-review code refinement task is rather difficult, the accuracy of the models in performing correct change operations is extremely low, and LLMs are more “aggressive” than small-scale pre-trained models, that is, they tend to perform more code change operations, resulting in their poor performance; (2) Compared with small-scale pretrained models, LLMs tend to perform more ADD and MODIFY change operations in the code refinement task, and the average number of inserted code lines in ADD change operations is larger, further proving their “aggressive” nature. To alleviate the disadvantages of LLMs in the pre-review refinement task, this study introduces the LLM-Vote method based on LLMs and ensemble learning, which includes two sub-schemes: Inference-based and Confidence-based, aiming to integrate the advantages of different base models to improve the code refinement quality. On this basis, a refinement determination mechanism is further introduced to enhance the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the exact match (EM) value and obtains a refinement quality better than all base models, thus effectively alleviating the disadvantages of large language models.

Large Language Model-Based Decomposition of Long Methods

XU Zi-Mao , JIANG Yan-Jie , ZHANG Yu-Xia , LIU Hui

2025, 36(6):2501-2514. DOI: 10.13328/j.cnki.jos.007329

Abstract (448) HTML (0) PDF 1.22 M (242) Comment (0) Favorites

Abstract:Long methods, along with other types of code smells, prevent software applications from reaching their optimal readability, reusability, and maintainability. Consequently, automated detection and decomposition of long methods have been widely studied. Although these approaches have significantly facilitated the decomposition, their solutions often differ significantly from the optimal ones. To address this, the automatable portion of the publicly available dataset containing real-world long methods is investigated. Based on the findings of this investigation, a new method (called Lsplitter) based on large language models (LLMs) is proposed in this study for automatically decomposing long methods. For a given long method, the Lsplitter decomposes the method into a series of shorter methods according to heuristic rules and LLMs. However, LLMs often split out similar methods. In response to the decomposition results of LLMs, Lsplitter utilizes a location-based algorithm to merge physically contiguous and highly similar methods into a longer method. Finally, these candidate results are ranked. Experiments are conducted on 2 849 long methods in real Java projects. The experimental results show that compared with the traditional methods combined with a modularity matrix, the hit rate of Lsplitter is improved by 142%, and compared with the methods purely based on LLMs, the hit rate is improved by 7.6%.

LLM-powered Datalog Code Translation and Incremental Program Analysis Framework

WANG Xi-Zao , SHEN Tian-Qi , BIN Xiang-Rong , BU Lei

2025, 36(6):2515-2535. DOI: 10.13328/j.cnki.jos.007330

Abstract (428) HTML (0) PDF 1012.00 K (188) Comment (0) Favorites

Abstract:Datalog, a declarative logic programming language, is widely applied in various fields. In recent years, there has been a growing interest in Datalog from both the academic and industrial communities, leading to the design and development of multiple Datalog engines and corresponding dialects. However, one problem brought about by the multiple dialects is that the code implemented in one Datalog dialect generally cannot be executed on the engine of another dialect. Therefore, when a new Datalog engine is adopted, the existing Datalog code needs to be translated into the new dialect. The current Datalog code translation techniques can be classified into two categories: manually rewriting the code and manually designing translation rules, which have problems such as being time-consuming, involving a large amount of repetitive work, and lacking flexibility and scalability. In this study, a Datalog code translation technology empowered by large language model (LLM) is proposed. By leveraging the powerful code understanding and generation capabilities of LLM, through the divide-and-conquer translation strategy, the prompt engineering based on few-shot and chain-of-thought prompts, and an iterative error-correction mechanism based on check-feedback-repair, high-precision code translation between different Datalog dialects can be achieved, reducing the workload of developers in repeatedly developing translation rules. Based on this code translation technology, a general declarative incremental program analysis framework based on Datalog is designed and implemented. The performance of the proposed LLM-powered Datalog code translation technology is evaluated on different Datalog dialect pairs, and the evaluation results verify the effectiveness of the proposed code translation technology. This study also conducts an experimental evaluation of the general declarative incremental program analysis framework, verifying the speedup effect of incremental program analysis based on the proposed code translation technology.

Insights and Analysis of Open-source License Violation Risks in LLMs Generated Code

WANG Yi-Bo , WANG Ying , YU Yue , XU Chang , YU Hai , ZHU Zhi-Liang

2025, 36(6):2536-2558. DOI: 10.13328/j.cnki.jos.007324

Abstract (618) HTML (0) PDF 3.01 M (251) Comment (0) Favorites

Abstract:The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of codes in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This study focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135 000 Python codes generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43 130 and 65 900 python codes longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the codes respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the codes generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced codes, 81.8% of the codes have open-source license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers.

Multi-agent Collaborative Code Reviewer Recommendation Based on Large Language Model

WANG Lu-Qiao , ZHOU Yang-Tao , LI Qing-Shan , WANG Ming-Kang , XU Zi-Xuan , CUI Di , WANG Lu , LUO Yi-Xing

2025, 36(6):2559-2576. DOI: 10.13328/j.cnki.jos.007326

Abstract (295) HTML (0) PDF 2.33 M (284) Comment (0) Favorites

Abstract:The pull request (PR)-based software development mechanism is of great significance in the practice of open-source software. Appropriate code reviewers can assist contributors in detecting potential errors in PRs through code review, thus providing quality assurance for the continuous development and integration process. However, the complexity of code change content and the inherent diversity of review behaviors enhance the difficulty of reviewer recommendation. The existing methods mainly concentrate on mining the semantic information of changed codes from PRs or constructing reviewer portraits based on review history and then making recommendations through various static strategy combinations. These studies are restricted by the richness of model training corpora and the complexity of interaction types, leading to unsatisfactory recommendation performance. Given this, this study proposes a novel code reviewer recommendation method based on inter-agent collaboration. This method utilizes advanced large language models to accurately capture the rich textual semantics information of PRs and reviewers. Moreover, the powerful planning, collaboration, and decision-making capabilities of AI agents enable the integration of information from different interaction types, possessing high flexibility and adaptability. The experimental analysis based on real datasets shows that compared with the baseline reviewer recommendation methods, the performance of the proposed method is improved by 4.45% to 26.04%. In addition, the case study proves that the proposed method has outstanding performance in interpretability, further verifying its effectiveness and reliability in practical applications.

微信服务号

微信订阅号

Current Issue

Volume

Issue