TAN Chao , ZHANG Jing-Xuan , WANG Tie-Xin , YUE Tao
2021, 32(7):1926-1956. DOI: 10.13328/j.cnki.jos.006267
Abstract:Complex software systems (e.g., cyber-physical systems, Internet of Things, and adaptive software system) encounter various types of uncertainties in their different phases of development and operation. To handle these uncertainties, researchers have carried out a lot of research work, proposed a series of methods, and achieved considerable results. However, there is still a lack of systematic understanding of the current state-of-the-artapproaches. Motivated by this observation, this paper reports a systematic mapping study of 142 primary studies collected by following a rigorous literature review methodology. The scope of the study is about investigating on how the literature deals with uncertainties appearing in various phases or artifacts produced during a development lifecycle of cyber-physical systems and Internet of Things. Results show that uncertainties mainly appear in the phases of design definition, system analysis, and operation. Based on the 142 primary studies, uncertainties are first defined and classified into external uncertainty, internal uncertainty, and sensor uncertainty, and descriptive statistics are reported in terms of this classification. In order to explore the uncertainty in depth, external uncertainty is subdivided into environmental uncertainty, infrastructure uncertainty, user behavior uncertainty, and economic attribute uncertainty, and internal uncertainty is subdivided into uncertainty in system structure, internal interaction uncertainty, uncertainty in the technology supporting system operation, and uncertainty in the technology dealing with system operation. Furthermore, another classification is presented and descriptive statistics for those primary studies where uncertainties in eight different types of artifacts are discussed, including model uncertainty, data uncertainty, and parametric uncertainty. Results also show that researchers mainly focused on decision-making under uncertainty, uncertainty reasoning, and uncertainty specification/modeling when dealing with uncertainties. Based on the results, the future research trend is commented on in this area.
LI Nian-Yu , CHEN Zheng-Yin , LIU Kun , JIAO Wen-Pin
2021, 32(7):1957-1977. DOI: 10.13328/j.cnki.jos.006259
Abstract:The development of self-adaptive systems has attracted much attention as they can adapt themselves autonomously to environmental dynamics and maintain user satisfaction. However, there are still tremendous challenges remained. One major challenge is to guarantee the reusability of the system and extend the adaptability with changing deployment environments, or open and complex environments with the existence of unknown. To solve these problems, a conceptual self-adaptive model is introduced, decoupling the environment with the system. This model is a two-layer structure based on internal causes and external causes from the attribution theory. The first layer, determining how the internal causes affect the adaptation behaviors, is independently designed and reusable whiles the second layer, mapping the relationship between external causes with internal causes, is replaceable and dynamically bound to different deployment environments. The proposed approach is evaluated by two case studies, a widely used benchmark e-commerce Web application and a destination-oriented robot system with obstacle and turnover avoidance, to demonstrate its applicability and reusability.
WANG Lu , LI Qing-Shan , Lü Wen-Qi , ZHANG He , LI Hao
2021, 32(7):1978-1998. DOI: 10.13328/j.cnki.jos.006268
Abstract:At present, self-adaptive software is providing the ability to adapt to the operating environment for many systems in different fields.How to establish a self-adaptation analysis method which can recognize abnormal events at runtime quickly and achieve the recognition quality assurance, is one of the research issues that must be considered to ensure the long-term stable operation of the self-adaptive software. The uncertainty of the runtime environment brings two challenges to this problem. On the one hand, the analysis method usually recognizes the events by pre-establishing the mapping relationships between the environment state and the events. However, due to the complexity of the operating environment and the unknown changes, it is impossible to establish comprehensive and correct mapping relationships based on experience before the system is running, which affect the accuracy of event recognition; On the other hand, the changing operating environment makes it impossible to accurately predict when and which event will occur. If the current way is used to obtain the environmental status using constant sensing period and recognize events, then the recognition efficiency cannot be guaranteed. However, it is still blank about how to deal with these urgent challenges. Therefore, this study proposes a self-adaptation analysis method for recognition of quality assurance using event relationships (SAFER). SAFER uses sequential pattern mining algorithm, fuzzy fault tree (FFT), and Bayesian network (BN) to extract and model the causalities between events. This study uses the event causal relationships and mapping relationships to recognize events through the BN forward reasoning, which can ensure the accuracy of recognition compared with the traditional analysis methods that only rely on mapping relationships. Moreover, this study establishes the elitist set of monitoring objects through the BN backward reasoning, then modifies the sensing period of monitoring objects in elitist set dynamically in order to obtain the environmental status as soon as possible after the abnormal events occurred, so as to ensure the efficiency of recognition. The experimental results show that SAFER can effectively improve the accuracy and efficiency of the analysis process, and support long-term stable operation of self-adaptive software.
AN Dong-Dong , LIU Jing , CHEN Xiao-Hong , SUN Hai-Ying
2021, 32(7):1999-2015. DOI: 10.13328/j.cnki.jos.006272
Abstract:With the development of technology, new complex systems such as human cyber-physical systems (hCPS) have become indistinguishable from social life. The cyberspace where the software system located is increasingly integrated with the physical space of people's daily life. The uncertain factors such as the dynamic environment in the physical space, the explosive growth of the spatio-temporal data, as well as the unpredictable human behavior are all compromise the security of the system. As a result of the increasing security requirements, the scale and complexity of the system are also increasing. This situation leads to a series of problems that remain unresolved. Therefore, developing intelligent and safe human cyber-physical systems under uncertain environment is becoming the inevitable challenge for the software industry. It is difficult for the human cyber-physical systems to perceive the runtime environment accurately under uncertain surroundings. The uncertain perception will lead to the system's misinterpretation, thus affecting the security of the system. It is difficult for the system designers to construct formal specifications for the human cyber-physical systems under uncertain environment. For safety-critical systems, formal specifications are the prerequisites to ensure system security. To cope with the uncertainty of the specifications, a combination of data-driven and model-driven modeling methodology is proposed, that is, the machine learning-based algorithms are used to model the environment based on spatio-temporal data. An approach is introduced to integrate machine learning method and runtime verification technology as a unified framework to ensure the safety of the human cyber-physical systems. The proposed approach is illustrated by modeling and analyzing a scenario of the interaction of an autonomous vehicle and a human-driven motorbike.
SHI Jian-Jun , JI Wei-Xing , SHI Feng
2021, 32(7):2016-2038. DOI: 10.13328/j.cnki.jos.006265
Abstract:Concurrency bug detection is a hot research topic in the area of programming language and software engineering. In recent years, researchers have made great progress in concurrency bug detection of applications. However, as operating system (OS) kernels always have high concurrency, complex synchronization mechanisms, and large scale of source codes, researches on concurrency bug detection of OS kernels are more challenging than applications. To address this issue, researchers have proposed various approaches to detect concurrency bugs in OS kernels. This study first introduces the basic types, detection techniques, and evaluation indicators of concurrency bug detection, and the limitations of existing concurrency bug detection tools in OS kernels are discussed. Then, researches on concurrency bug detection in OS kernels are described from four aspects:Formal verification, static analysis, dynamic analysis, and combination of both static and dynamic analysis. Some typical approaches are comprehensively compared. Finally, the challenges of concurrency bug detection in OS kernels are discussed, and the future research trends in this field are prospected.
GAO Feng-Juan , WANG Yu , ZHOU Jin-Guo , XU An-Zi , WANG Lin-Zhang , WU Rong-Xin , ZHANG Charles , SU Zhen-Dong
2021, 32(7):2039-2055. DOI: 10.13328/j.cnki.jos.006260
Abstract:With the development of techniques, the uncertainty in software systems is continuously increasing. Data race is a typical bug in current programs, which is a classic type of uncertainty programs. Despite significant progress in recent years, the important problem of practical static race detection remains open. Previous static techniques either suffer from a high false positive rate due to the compromise of precision, or scalability issues caused by a highly precise analysis. This paper presents GUARD, a staged approach to resolve this paradox. First, it performs a lightweight context-sensitive data access analysis, based on the value flow of a program, to identify the candidate data race subpaths instead of the whole program paths. Second, may-happen-in-parallel (MHP) analysis is employedto identify whether two data accesses in a program may execute concurrently. This stage is scalable, due to the design of the thread flow graph (TFG), which encodes thread information to query MHP relationship of the subpaths. Finally, for each subpath whose two data accesses are MHP, the heavyweight path-sensitive analysis is appliedto verify the feasibility of the data races. The evaluation demonstrates that GUARD can finish checking industrial-sized projects, up to 1.3MLoC, in 1 870s with an average false positive rate of 16.0%. Moreover, GUARD is faster than the state-of-the-art techniques with the average speedup 6.08X and significantly fewer false positives. Besides, GUARD has found 12 new race bugs in real-world programs. All of them are reportedtothe developers and 8 of them have been confirmed.
ZHU Xiang-Lei , WANG Hai-Chi , YOU Han-Mo , ZHANG Wei-Heng , ZHANG Ying-Yi , LIU Shuang , CHEN Jun-Jie , WANG Zan , LI Ke-Qiu
2021, 32(7):2056-2077. DOI: 10.13328/j.cnki.jos.006266
Abstract:With the development of artificial intelligence, autonomous vehicles have become a typical application in the field of artificial intelligence. In recent 10 years, autonomous vehicles have already made considerable processes. As an uncertain system, their quality and safety have attracted much attention. Autonomous vehicletesting, especially testing the intellectual systems in autonomous vehicles (such as perception module, decision module, synthetical functional module, and the whole vehicle) gain extensive attention from both industry and academia. This survey offers a systematical review on 56 papers related to autonomous vehicle testing. Besides, this survey analyzes the testing techniques with respect to perception model, decision model, synthetical functional module, and the whole vehicle, including test case generation approaches, testing coverage metrics, as well as datasets and tools widely used in autonomous vehicle testing. Finally, this survey highlights future perspectiveson autonomous vehicle testing and provides reference for researchers in this field.
ZHANG Cheng-Bo , LI Ying , JIA Tong
2021, 32(7):2078-2102. DOI: 10.13328/j.cnki.jos.006269
Abstract:As the growth of graph data scale and complexity of graph processing, the trend of distributed graph processing shall be inevitable. However, graph processing jobs run with severe reliability problems caused by the uncertainty originated from inside and outside the distributed graph processing system. This study first analyzes the uncertainty factors of the distributed graph processing frameworks and the robustness of different types of graph processing jobs; then proposes an evaluation framework of fault tolerance for distributed graph processing based on cost, efficiency, and quality of fault tolerance. This study also analyzes, evaluates, and compares the four fault-tolerant mechanisms of distributed graph processing-checkpointing based fault tolerance, logging based fault tolerance, replication based fault tolerance, and algorithm compensation based fault tolerance-combining related researches. Finally, the direction of future researches is prospected.
2021, 32(7):2103-2117. DOI: 10.13328/j.cnki.jos.006264
Abstract:For data-driven intelligent systems, the data processing algorithms are very important and need to be tested adequately. Because of the high safety requirement, the cost of testing becomes very high and reducing such cost is needed. Regression test selection is an effective mean to control the scale of testing. For data-driven intelligent systems, the coincidental correctness happens frequently because of the weak dynamic information flows, and leads that the regression test sets contain a lot of redundant tests. Therefore, a regression test selection technique is proposed based on the coincidental correctness probability. This method considers the probability of coincidental correctness in addition to the code coverage. The selected tests not only cover the modified code, but have a higher probability to transfer the intermediate results produced by the modified code to the program output. Such selection can reduce the impact of coincidental correctness. The empirical results show that the proposed selection method can improve the precision of selection and reduce the size of the regression tests.
CHEN Xiang , YANG Guang , CUI Zhan-Qi , MENG Guo-Zhu , WANG Zan
2021, 32(7):2118-2141. DOI: 10.13328/j.cnki.jos.006258
Abstract:During software development and maintenance, code comments often have some problems, such as missing, insufficient, or mismatching with code content. Writing high-quality code comments takes time and effort for developers, and the quality can not be guaranteed, therefore, it is urgent for researchers to design effective automatic code comment generation methods. The automatic code comment generation issue is an active research topic in the program comprehension domain. This study conducts a systematic review of this research topic. The existing methods are divided into three categories:Template-based generation methods, information retrieval-based methods, and deep learning-based methods. Related studies are analyzed and summarizedfor each category. Then, the corpora and comment quality evaluation methods that are often used in previous studiesare analyzed, which can facilitate the experimental study for future studies. Finally, the potential research directions in the future aresummarized and discussed.
NIU Chang-An , GE Ji-Dong , TANG Ze , LI Chuan-Yi , ZHOU Yu , LUO Bin
2021, 32(7):2142-2165. DOI: 10.13328/j.cnki.jos.006270
Abstract:Code comments plays an important role in software quality assurance, which can improve the readability of source code and make it easier to understand, reuse, and maintain. However, for various reasons, sometimes developers do not add the necessary comments, which make developers always waste a lot of time understanding the source code and greatly reduces the efficiency of software maintenance. In recent years, lots of work using machine learning to automatically generate corresponding comments for the source code. These methods extract such information as code sequence and structure, and then utilize sequence to sequence (seq2seq) neural model to generate the corresponding comments, which have achieved sound results. However, Hybrid-DeepCom, the state-of-the-art code comment generation model, is still deficient in two aspects. The first is that it may break the code structure during preprocessing, resulting in inconsistent input information of different instances, making the model learning effect poor; the second is that due to the limitations of the seq2seq model, it is not able to generate out-of-vocabulary word (OOV word) in the comment. For example, variable names, method names, and other identifiers that appear very infrequently in the source code are usually OOV words, without them, comments would be difficult to be understood. In order to solve this problem, the automatic comment generation model named CodePtr is proposed in this study. On the one hand, a complete source code encoder is added to solve the problem of code structure being broken; on the other hand, the pointer-generator network module is introduced to realize the automatic switch between the generated word mode and the copy word mode in each step of decoding, especially when encountering the identifier with few times in the input, the model can directly copy it to the output, so as to solve the problem of not being able to generate OOV word. Finally, this study compares the CodePtr and Hybrid-DeepCom models through experiments on large data sets. The results show that when the size of the vocabulary is 30 000, CodePtr is increased by 6% on average in translation performance metrics, and the effect of OOV word processing is improved by nearly 50%, which fully demonstrates the effectiveness of CodePtr model.
JIANG Shu-Juan , ZHANG Xu , WANG Rong-Cun , HUANG Ying , ZHANG Yan-Mei , XUE Meng
2021, 32(7):2166-2182. DOI: 10.13328/j.cnki.jos.006262
Abstract:Software fault localization is a time-consuming and laborious work, so determining how to improve the automation of software fault localization has always been a hot topic in the field of software engineering. The existing spectrum-based fault localization (SBFL) methods rarely use the context information of the program, which is very important for fault localization. To solve this problem, this study proposes a fault localization approach based on path analysis and information entropy (FLPI). Based on the spectrum information technology, this approach introduces the execution context information by analyzing the data dependencies in all execution paths, and introduces the test event information into the suspiciousness formula by using the information entropy theory, so as to maximize the accuracy and efficiency of fault localization. To evaluate the effectiveness of the proposed approach, the experiments are conducted on a set of benchmark programs and open source programs. Experimental results show that the proposed FLPI approach can effectively improve the accuracy and efficiency of fault localization.
2021, 32(7):2183-2203. DOI: 10.13328/j.cnki.jos.006263
Abstract:As the developer community and code-hosting platforms become the primary means for programmers to access code, the number of user's comments on code has increased dramatically. There are a variety of static and dynamic code quality attributes in user's comments. However, as most of the user's comments are complex sentences, it is difficult to identify the code quality attributes in the comments. Judging the code quality attributes of complex user's comments will help to analyze the code quality information in user's comments and to improve code quality for the developers when they know about user's code usage and code quality attributes. In this study, a method is proposed to judge code quality attributes based on complex user's comments. Firstly, complex user's comments are divided into clauses and a dependency syntactic relation directed graph of the clauses is constructed. After that, the topic of the clause is extracted based on the topic judgment rule of the dependency syntactic relation of the clause. Then, according to the initial feature thesaurus of code quality attribute, the code quality attributes corresponding to each topic are identified, and the representation and the representation result of code quality attribute for each topic are acquired. And finally, the representation and the representation result of code quality attribute in the complex user's comments are analyzed based on the topic processing rule. The code quality attribute related result in the complex user's comment is produced, and the initial code quality attribute feature thesaurus is continuously expanded. The experimental results show that the proposed method can judge the code quality attributes of complex user's comments effectively.
JIA Xiu-Yi , ZHANG Wen-Zhou , LI Wei-Wei , HUANG Zhi-Qiu
2021, 32(7):2204-2218. DOI: 10.13328/j.cnki.jos.006257
Abstract:Cross-project defect prediction technology can use the existing labeled defect data to predict new unlabeled data, but it needs to have the same metric features for two projects, which is difficult to be applied in actual development. Heterogeneous defect prediction can perform prediction without requiring the source and target project to have the same set of metrics and thus has attracted great interest. Existing heterogeneous defect prediction models use naive or traditional machine learning methods to learn feature representations between source and target projects, and perform prediction based on it. The feature representation learned by previous studies is weak, causing poor performance in predicting defect-prone instances. In view of the powerful feature extraction and representation capabilities of deep neural networks, this study proposes a feature representation method for heterogeneous defect prediction based on variational autoencoders. By combining the variational autoencoder and maximum mean discrepancy, this method can effectively learn the common feature representation of the source and target projects. Then, an effective defect prediction model can be trained based on it. The validity of the proposed method is verified by comparing it with traditional cross-project defect prediction methods and heterogeneous defect prediction methods on various datasets.
ZHANG Xian , BEN Ke-Rong , ZENG Jie
2021, 32(7):2219-2241. DOI: 10.13328/j.cnki.jos.006261
Abstract:Software defect prediction is an active research topic in the domain of software quality assurance. It can help developers find potential defects and make better use of resources. How to design more discriminative metrics for the prediction system, taking into account performance and interpretability, has always been a research direction that people devote to. Aiming at this challenge, a code naturalness feature based defect predictor method (CNDePor) is proposed. This method improves the language model by taking advantage of the bidirectional code-sequence measurement and weighting the samples by using the quality information, so as to increase the defect discrimination of the cross-entropy (CE) type metrics obtained from the model. Aiming at the shortcomings of coarse-grained defect prediction (e.g. difficulties in focusing on defect areas and high cost of code reviews), a new fine-grained defect prediction problem, statement-oriented slice level defect prediction, is studied. Four metrics are designed for this problem, and the effectiveness of these metrics and CNDePor are verified on two types of security defect datasets. The experimental results show that:CE-type metrics are learnable, which contain the relevant knowledge learned from the corpus by language model; the improved CE metrics are significantly better than the original metrics and traditional size metrics; the CNDePor method has significant advantages over the traditional defect prediction methods and an existing method based on code naturalness, and is of comparable performance and stronger interpretability than a state-of-the-art mothed based on deep learning.
LI Mei , GAO Qing , MA Sen , ZHANG Shi-Kun , HU Wen-Hui , ZHANG Xing-Ming
2021, 32(7):2242-2259. DOI: 10.13328/j.cnki.jos.006271
Abstract:Code similarity detection is one of the basic tasks in software engineering. It plays an effective and fundamental role in plagiarism, software licensing violation, software reuse analysis, and vulnerability discovery. With the popularization of open source software, open source code has been frequently applied to multiple areas, bringing new challenges to traditional code similarity detection methods.Some existing detection methods based on lexical, grammar, and semantics have problems such as high computational complexity, dependence on analytical tools, high resource consumption, poor portability, having a large number of comparison candidates, and so on. Simhash-based code similarity detection algorithm reduces the dimension of the code to a fingerprint, which can realize fast near-duplicate file retrieval on a large dataset. It controls the similarity of matched results through the Hamming distance threshold. This study verifies existed simhash algorithm with line granularity through experiments, and discovers the line coverage problem in large-scale datasets. Inspired by the idea of TF-IDF algorithm, a language-based line-filtering optimization method is proposed to deal with it. Line sequences of code files is filtered through line filters in various languages to eliminate the impact of lines that appear frequently but contain less semantic information on the results. After a series of comparative experiments, this study verifies that the enhanced method always achieves high precision with Hamming distance threshold set from 0 to 8. Compared to the method before enhancement, the proposed method improves the precision by 98.6% and 52.2% on two different datasets with threshold set to 8. Based on the large-scale code database built from 386 486 112 files in 1.3 million open source projects, it is verified that the proposed method can, keeping the high precision of 97%, efficiently detect similar files with an average speed of 0.43s per file.
2021, 32(7):2260-2286. DOI: 10.13328/j.cnki.jos.006309
Abstract:Blockchain technology is a new distributed infrastructure and computation paradigm that generates, stores, manipulates, and validates data through chain structures, consensus algorithms, and smart contracts. The new trust mechanism is built to promote the transformation of Internet technology from Internet of information to Internet of value. Since the data in the blockchain is stored and verified by means of public transaction records and multi-peer consensus confirmation, it poses a great challenge to the transaction privacy protection in the system. This study first analyzes the characteristics of the blockchain system transaction model and its differences from the traditional centralized system in identity authentication, data storage and transaction confirmation, and describes the main contents, key issues and security challenges of identity management in blockchain. Secondly, the different implementation technologies of identity management and privacy protection are analyzed in the current mainstream blockchain platform from three aspects, namely, identity identification, identity authentication, and identity hiding. Finally, the shortcoming of the existing blockchain identity management technology issummarized and the future research directionsare proposed.