LU Shen-Ming , ZUO Zhi-Qiang , WANG Lin-Zhang
2020, 31(5):1243-1254. DOI: 10.13328/j.cnki.jos.005950
Abstract:After years of research, static program analysis has made great progress in many aspects. However, performing sophisticated program analysis over large-scale software systems (such as Linux kernel, Hadoop, etc.) is still challenging due to its high complexity. To address the poor scalability of static analysis, with the rise of multi-core computation architectures, researchers have proposed various parallel static analysis techniques. This paper first introduces the basic concepts of static analysis, the key techniques, and the challenges. Then the traditional optimization approaches are discussed followed by the studies in parallelizing static analysis in three categories—CPU parallelism, distributed and GPU implementation, and the representative parallel static analysis tools. Finally, the potential research trend in parallelizing static analysis is described.
WANG Zan , YAN Ming , LIU Shuang , CHEN Jun-Jie , ZHANG Dong-Di , WU Zhuo , CHEN Xiang
2020, 31(5):1255-1275. DOI: 10.13328/j.cnki.jos.005951
Abstract:With the rapid development of deep neural networks, the emerging of big data as well as the advancement of computational power, Deep Neural Network (DNN) has been widely applied in various safety-critical domains such as autonomous driving, automatic face recognition, and aircraft collision avoidance systems. Traditional software systems are implemented by developers with carefully designed programming logics and tested with test cases which are designed based on specific coverage criteria. Unlike traditional software development, DNN defines a data-driven programming paradigm, i.e., developers only design the structure of networks and the inner logic is reflected by weights which are learned during training. Traditional software testing methods cannot be applied to DNN directly. Driven by the emerging demand, more and more research works have focused on testing of DNN, including proposing new testing evaluation criteria, generation of test cases, etc. This study provides a thorough survey on testing DNN, which summarizes 92 works from related fields. These works are systematically reviewed from three perspectives, i.e., DNN testing metrics, test input generation, and test oracle. Existing achievements are introduced in terms of image processing, speech processing, and natural language processing. The datasets and tools used in DNN testing are surveyed and finally the thoughts on potential future research directions are summarized on DNN testing, which, hopefully, will provide references for researchers interested in the related directions.
WANG Shu-Dong , YIN Wen-Jing , DONG Yu-Kun , ZHANG Li , LIU Hao
2020, 31(5):1276-1293. DOI: 10.13328/j.cnki.jos.005949
Abstract:Sequential storage structures such as array and continuous memory block allocated dynamically by malloc are widely used in C programs. But traditional data flow analysis fails to adequately describe their structures and operations. When pointers are used to access the sequential storage structures in C programs, existing data flow analysis methods basically pay attention to only points-to information and do not consider the numerical properties offset. In addition, it does not consider the unsafe problem caused by out of bounds when offset occurs, which leads to inaccurate analysis for sequential storage structure. To improve the precision for analyzing sequential storage structures, an abstract memory model SeqMM is proposed to describe sequential storage structures, which can effectively describe points-to relationships and offset. Then there are three operations are summarized, such as the pointer-related transfer operation, predicate operation, and loop operation traversing sequential storage structures, and it is also considered that whether the index is out of bounds to ensure the security of operation execution when analyzing these operations. After that, mapping rules are introduced for parameters referencing sequential storage structure to corresponding arguments. Finally, a memory leak detection algorithm is proposed to detect memory leak in 5 open-source projects. The experimental results indicate that SeqMM can effectively describe sequential storage structure and various operations in C programs, and the results of data flow analysis can be used to detect memory leaks when a reasonable balance between accuracy and efficiency occurs.
ZHENG Wei , CHEN Jun-Zheng , WU Xiao-Xue , CHEN Xiang , XIA Xin
2020, 31(5):1294-1313. DOI: 10.13328/j.cnki.jos.005954
Abstract:The occurrence of software security issues can cause serious consequences in most cases. Early detection of security issues is one of the key measures to prevent security incidents. Security bug report prediction (SBR) can help developers identify hidden security issues in the bug tracking system and fix them as early as possible. However, since the number of security bug reports in real software projects is small, and the features are complex (i.e., there are many types of security vulnerabilities with different types of features), this makes the manual extraction of security features relatively difficult and lead to low accuracy of security bug report prediction with traditional machine learning classification algorithms. To solve this problem, a deep-learning-based security bug report prediction method is proposed. The text mining models TextCNN and TextRNN via deep learning are used to construct security bug report prediction models. For extracting textual features of security bug reports, the Skip-Gram method is used to construct a word embedding matrix. The constructed model has been empirically evaluated on five classical security bug report datasets with different scales. The results show that the deep learning model is superior to the traditional machine learning classification algorithm in 80% of the experimental cases, and the performance of the constructed models can improve 0.258 on average and 0.535 at most in terms of F1-score performance measure. Furthermore, different re-sampling strategies are applied to deal with class imbalance problem in gathered SBR prediction datasets, and the experiment results are discussed.
WANG Wei-Wei , LI Yi-Chao , ZHAO Rui-Lian , LI Zheng
2020, 31(5):1314-1331. DOI: 10.13328/j.cnki.jos.005955
Abstract:Parallelization of test case generation for Web applications is an effective way to improve the efficiency of test generation. Due to the characteristics of front-back end separation and event-driven of Web applications, the traditional parallelization technology is difficult to be applied to automatic test case generation of Web applications. Therefore, it becomes a challenging task to parallelize test case generation for Web applications. In this study, parallelized computing is introduced into the test case generation for Web applications based on GA. By means of the design of thread pool and scheduling logic, the management of the multi-browser process and the acquisition of the path coverage of back end code, the parallel execution of individuals on multiple browsers and the parallel computation of fitness values based on the back end path coverage are realized, making test case generation more efficiently. The experiment results show that compared with GA serialization test case generation, the proposed parallelization method can make full use of system resources and greatly improve the efficiency of test case generation for Web applications.
JI Ye , WEI Heng-Feng , HUANG Yu , Lü Jian
2020, 31(5):1332-1352. DOI: 10.13328/j.cnki.jos.005956
Abstract:Conflict-free replicated data types (CRDT) are replicated data types that encapsulate the mechanisms for resolving concurrent conflicts. They guarantee strong eventual consistency among replicas in distributed systems, which requires replicas that have executed the same set of updates be in the same state. However, CRDT protocols are subtle and it is difficult to ensure their correctness. This study leverages model checking to verify the correctness of CRDT protocols. Specifically, a reusable framework is proposed for modelling and verifying CRDT protocols. The framework consists of four layers, i.e., the communication layer, the interface layer, the protocol layer, and the specification layer. The communication layer models the communication among replicas and implements a variety of communication networks. The interface layer provides a uniform interface for existing CRDT protocols, including both the operation-based protocols and the state-based ones. In the protocol layer, users can choose the appropriate underlying communication network required by a specific protocol. The specification layer specifies strong eventual consistency and the eventual visibility property (i.e., all updates are eventually delivered by all replicas) that every CRDT protocol should satisfy. This framework is implemented using a formal specification language called TLA+. It is also demonstrated that how to model CRDT protocols in this framework and how to verify their correctness via the model checking tool called TLC, taking Add-Wins Set as an example.
GUO Jian , DING Ji-Zheng , ZHU Xiao-Ran
2020, 31(5):1353-1373. DOI: 10.13328/j.cnki.jos.005957
Abstract:“How to construct a trustworthy software system” has become an important research area in academia and industry. As a basic component of the software system, the operating system kernel is an important component of constructing a trustworthy software system. In order to ensure the safety and reliability of an operating system kernel, this study introduces formal method into OS kernel verification, and proposes an automatically verifying framework. The verification framework includes following factors. (1) Separate C language programs and mixed language programs (for example, mixed language programs written by C and assembly language) for verification. (2) In the mixed language program verification, establish an abstract model for the assemble program, and then glue the C language program and the abstract model to form a verification model received by a C language verification tool. (3) Extract properties from the OS specification, and automatically verify properties based on a verification tool. (4) Do not limit to a specific hardware architecture. This study successfully applies the verification framework to verify a commercial real-time operating system kernel μC/OS-II of two different hardware platforms. The results show that when kernels on two different hardware platforms are verified, the reusability of the verification framework is very high, up to 83.8%. Of course, the abstract model needs to be reconstructed according to different hardware. During verification of operating system kernels based on two kinds of hardware, 10~12 defaults are found respectively. Among them, two hardware-related defaults on the ARM platform are discovered. This method has certain versatility for analysis and verification of the same operating system on different hardware architectures.
LIU Xiao-Shan , YUAN Zheng-Heng , CHEN Xiao-Hong , CHEN Ming-Song , LIU Jing , ZHOU Ting-Liang
2020, 31(5):1374-1391. DOI: 10.13328/j.cnki.jos.005952
Abstract:The rail transit zone controller is a core sub-system of the communication-based train control system, which is the mainstream choice of China’s rail transit systems. Its outstanding safety makes formal verification of safety requirements a very important issue. However, the complexity of ZC itself and the rail transit domain knowledge make it difficult to apply formal methods to the verification of safety requirements. To solve these problems, this study proposes an automated verification approach for safety requirements. It models and decomposes safety requirements using a semi-formal Problem Frames approach, automatically generates verification model and verification properties, implements the verification problem with Scade automatically, and finally performs formal verification with a model checker Design Verifier. Finally, a sub-problem of zone controller, CAL_EOA from real case is studied. The experiment results show the feasibility and effectiveness of the proposed approach. The safety requirements are automatically decomposed and compositionally verified, thus greatly improving the verification efficiency.
DONG Pan , DING Yan , JIANG Zhe , HUANG Chen-Lin , FAN Guan-Nan
2020, 31(5):1392-1405. DOI: 10.13328/j.cnki.jos.005953
Abstract:Trusted computing is being developed towards the next-generation active protection and monitoring, which requires that the TPM/TCM has the ability to actively measure and intervene the host system. Unfortunately, traditional TPM/TCM cannot satisfy the requirements in the respects of the architecture and the runtime mechanisms. Trusted execution environment (TEE) technology provides a trusted execution environment and the ability of accessing/controlling the host resources during the run-time, which brings a foundation for the next generation TPM/TCM. However, there are still three main challenges: software architecture, secure storage, and secure communication. This study proposes the design and implementation of TZTCM (TrustZone-based trusted cryptography module), which is a TPM/TCM scheme based on ARM TrustZone. TZTCM adopts several key mechanisms to overcome the three challenges. Firstly, the non-uniform core assigned and asynchronous (NUCAA) system architecture is designed to enable the independent and active operation of TZTCM. Secondly, the secure storage mechanism based on physical unclonable functions (PUF) is designed to guarantee the privacy of data in TZTCM. Thirdly, the secure communication mechanism based on universally unique identifier (UUID) is designed to prevent the channel (between host and TZTCM) from malicious activities. Therefore, TZTCM provides a prototype system of the next-generation TPM/TCM. It is shown that TZTCM has the identical security as a hardware TPM/TCM chip via theoretical analysis. An instance of TZTCM is implemented on an ARM development board (Hikey-board 620), and the runtime test shows that TZTCM can achieve higher performance for cipher computing than traditional TPMs. Compared to current TPMs/TCMs, TZTCM has obvious advantages in many aspects: active safeguard capability, only software/ firmware required, easy update, and low power consumption.
2020, 31(5):1406-1434. DOI: 10.13328/j.cnki.jos.005967
Abstract:Blockchain-based distributed ledger aims to provide consistent and tamper-resistant transaction records by integrating various security technologies such as asymmetric cryptosystem, P2P network, consensus algorithm, and smart contract. However, as each node in the blockchain system shares a copy of the public ledger, such data sharing mechanism also introduces vulnerabilities that hackers could exploit to attack private information. Privacy protection of blockchain systems thus gains wide attentions from researchers. Various techniques have been proposed to protect users’ identity, address, and transaction information from security threats. This study investigates blockchain privacy threats. It made a comprehensive survey of state-of-the-art privacy protection technologies which are categorized into three mechanisms including address confusion, information hiding, and channel isolation. The paper introduces the principles, models, and various implementations of each mechanism. It finally discusses the challenges of performance and scalability in practice and future technology advancement directions.
YANG Bo , ZHANG Neng , LI Shan-Ping , XIA Xin
2020, 31(5):1435-1453. DOI: 10.13328/j.cnki.jos.005966
Abstract:Code completion is one of the crucial functions of automation software development. It is an essential component of most modern integrated development environments and source code editors. Code completion provides predictions such as instant class names, method names, keywords, and assists developer to code, which improves the efficiency of software development intuitively. In recent years, with the expanding of the source code and data scale in the open-source software community, and outstanding progress in artificial intelligence technology, the automation software development technology has been much promoted. Intelligent code completion builds a language model for source code, learns features from the existing code corpus, and retrieves the most similar matches in the corpus for recommendation and prediction based on the context code features around the position to be completed. Compared to traditional code completion, intelligence code completion has become one of the hot trends in the field of software engineering with its characteristics like high accuracy, multiple completion forms, and iterative learning ability. Researchers have conducted a series of researches on intelligent code completion. According to the different forms that these completion methods represent and utilize source code information, they can be divided into two research directions: programming language representation and statistical language representation. The programming language is divided into three types: token sequences, abstract syntax tree, and control/data flow graph. The statistical language also has two types: n-gram model and the neural network model. This paper starts from the perspective of code representation and summarizes the research progress of code completion methods in recent years. The main contents include: (1) expounding and classifying existing intelligent code completion methods according to code representation; (2) summarizing the experimental verification methods and performance evaluation indicators used in model evaluation; (3) summarizing the critical issues of intelligent code completion; (4) looking forward to the future development of intelligent code completion.
LU Xi-Dong , DUAN Zhe-Min , QIAN Ye-Kui , ZHOU Wei
2020, 31(5):1454-1464. DOI: 10.13328/j.cnki.jos.005660
Abstract:Aiming at the problem of insufficient accuracy of current static classification method of malicious code, this study maps the malicious code into uncompressed gray-scale image. Then the image is transformed into a constant-size image according to the image transformation method, and the direction gradient histogram is used to extract the features of the image. Finally, a kind of malicious code classification method based on deep forest is proposed. Experiments on malicious code samples from different families verify the effectiveness of the proposed method and the results are superior to the recently proposed SPAM-GIST method.
WANG Jian-Xin , WANG Zi-Ya , TIAN Xuan
2020, 31(5):1465-1496. DOI: 10.13328/j.cnki.jos.005988
Abstract:Natural scene text detection and recognition is important for obtaining information from scenes, and it can be improved by the help of deep learning. In this study, the deep learning-based methods of text detection and recognition in natural scenes are classified, analyzed, and summarized. Firstly, the research background of natural scene text detection and recognition and the main technical research routes are discussed. Then, according to different processing phases of natural scene text information processing, the text detection model, text recognition model and end-to-end text recognition model are further introduced, in which the basic ideas, advantages, and disadvantages of each method are also discussed and analyzed. Furthermore, the common standard datasets and performance evaluation indicators and functions are enumerated, and the experimental results of different models are compared and analyzed. Finally, the challenge and development trends of deep learning-based text detection and recognition in natural scenes are summarized.
2020, 31(5):1497-1510. DOI: 10.13328/j.cnki.jos.005673
Abstract:Previous multi-label learning requires that all or at least a subset of ground truth labels is given for the training example. This study investigates how to utilize the wisdom of crowds for multi-label tasks, where rather than high cost ground truth labels, imperfect annotations from crowds are collected for learning. The target is to infer the instances’ ground truth labels. The key challenge lies in how to aggregate the noisy annotations. Different from previous crowdsourcing works on single-label problems which ignore the correlation between labels, and multi-label works which consider local label correlations whose estimation heavily depends on the annotations’ quality and quantity, this study proposes an approach considering the global low rank structure of the crowds’ annotations. Regarding the crowds’ annotations for multi-label tasks as a three-way tensor (instance, label, worker), the crowds’ annotations are firstly preconditioned using low rank tensor completion, such that it is able to simultaneously correct the observed noisy annotations and at the same time predict the unobserved annotations. After that, the preconditioned annotations are aggregated through some heuristic methods. Three aggregation methods taking into account the crowds’ annotation confidence from different perspectives are tested. Experimental results on real world multi-label crowdsourcing data sets demonstrate the superiority of the proposed approach.
LIU Zhao-Geng , LI Zhan-Shan , WANG Li , WANG Tao , YU Hai-Hong
2020, 31(5):1511-1524. DOI: 10.13328/j.cnki.jos.005654
Abstract:As an important data preprocessing method, feature selection can not only solve the dimensionality disaster problem, but also improve the generalization ability of algorithms. A variety of methods have been applied to solve feature selection problems, where evolutionary computation techniques have recently gained much attention and shown some success. Recent study has shown that feature selection using forest optimization algorithm has better classification performance and dimensional reduction ability. However, the randomness of initialization phase and the artificial parameter setting of global seeding phase affect the accuracy and the dimension reduction ability of the algorithm. At the same time, the algorithm itself has the essential defect of insufficient high-dimensional data processing capability. In this study, an initialization strategy is given from the perspective of information gain rate, parameter is automatically generated by using simulated annealing temperature control function during global seeding, a fitness function is given by combining dimension reduction rate, using greedy algorithm to select the best tree from the high-quality forest obtained, and a feature selection algorithm EFSFOA (enhanced feature selection using forest optimization algorithm) is proposed. In addition, in the face of high-dimensional data processing, ensemble feature selection scheme is used to form an ensemble feature selection framework suitable for EFSFOA, so that it can effectively deal with the problem of high-dimensional data feature selection. Through designing some contrast experiments, it is verified that EFSFOA has significantly improved classification accuracy and dimensionality reduction rate compared with FSFOA, and the high-dimensional data processing capability has been increased to 100 000 dimensions. Comparing EFSFOA with other efficient evolutionary computation for feature selection approaches which have been proposed in recent years, EFSFOA still has strong competitiveness.
HU Cong , WU Xiao-Jun , SHU Zhen-Qiu , CHEN Su-Gen
2020, 31(5):1525-1535. DOI: 10.13328/j.cnki.jos.005680
Abstract:Ladder networks is not only an effective deep learning-based feature extractor, but also can be applied on semi-supervised learning. Deep learning has the advantage of approximating the complicated function and alleviating the optimization difficulty associated with deep models. Autoencoders and restricted Boltzmann machines ignore the manifold information of high-dimensional data and usually achieve unmeaning features which are very difficult to use in the subsequent tasks, such as prediction and recognition. From the perspective of manifold learning, a novel deep representation method Laplacian ladder networks (LLN) is proposed, which is based on ladder networks (LN). When training LLN, LLN reconstructs noisy input and encoder layers, and adds graph Laplacian constrains to learn hierarchical representations for improving the robustness and discrimination of system. Under the condition of limited labeled data, LLN fuses the supervised learning and unsupervised learning to training in a semi-supervised manner. This study performs the experiments on the MNIST and CIFAR-10 datasets. Experimental results show that the proposed method LLN achieves superior performance compared with LN and other semi-supervised methods, and it is an effective semi-supervised method.
GENG Hai-Jun , SHI Xin-Gang , WANG Zhi-Liang , YIN Xia , HU Zhi-Guo
2020, 31(5):1536-1548. DOI: 10.13328/j.cnki.jos.005675
Abstract:The existing routing protection schemes are facing the following two problems: (1) the default path and the backup path contain a large number of common edges, such as ECMP and LFA; (2) in order to calculate two paths which have a small number of common edges, the shortest path cannot be used in the network, such as red-green tree method. To solve the above two problems, this study first describes the problem of computing backup paths and default paths as an integer programming model, and then a heuristic method is proposed to solve the problem. Next, the forwarding algorithm is introduced. Finally, the proposed algorithm is tested by simulation experiment and real experiment. Experiments show that the proposed algorithm not only has lower computational complexity, but also reduces the number of common edges contained in the default paths and the shortest paths, and greatly improve the network availability.
WU Zhi-Jun , LI Hong-JUN , LIU Liang , ZHANG Jing-An , YUE Meng , LEI Jin
2020, 31(5):1549-1562. DOI: 10.13328/j.cnki.jos.005658
Abstract:Low-rate denial of service (LDoS) attack can cause the packets loss of the legitimate users and reduce the transmission performance of the transport system by sending short bursts of packets periodically. The LDoS attack flows always mix with the legitimate traffic, hence, it is hard to be detected. This study designs an LDoS attack classifier based on network model, which uses hidden semi-Markov model (HSMM), and deploys a decision indicator to detect LDoS attacks. In this method, wavelet transform is exploited to compute the network traffic’s wavelet energy spectrum entropy, which is used as the input of the HSMM. The proposed detection method has been evaluated in NS-2 and Test-bed, and experimental results show that it achieves a better performance with detection rate of 96.81%.
SU Mang , WU Bin , FU An-Min , YU Yan , ZHANG Gong-Xuan
2020, 31(5):1563-1572. DOI: 10.13328/j.cnki.jos.005676
Abstract:More and more people select cloud as an important tool for data storing, processing and sharing, as a result, the data in cloud increases rapidly, including some sensitive and privacy information. It is a vital problem to manage the authorizations of hosted data in cloud for confidentiality and effectiveness of access control. This study proposes a proxy re-encryption based assured update scheme of authorization for cloud data (PAUA) in light to solve the above mentioned problem. Firstly, the aims and assumptions of PAUA are given. Secondly, the system model and algorithm are shown. Finally, the comparisons with PAUA and the current status are carried out. The PAUA reduces the encryption and decryption work of personal users. Meanwhile, it ensures the permission updating by dividing the parameters of re-encryption key generation.
HE Ke-Lei , SHI Ying-Huan , GAO Yang
2020, 31(5):1573-1584. DOI: 10.13328/j.cnki.jos.005991
Abstract:Conventional multi-task deep networks typically share most of the layers (i.e., layers for feature representations) across all tasks, which may limit their data fitting ability, as specificities of different tasks are inevitably ignored. This study proposes a hierarchically-fused multi-task fully-convolutional network, called HFFCN, to tackle the challenging task of prostate segmentation in CT images. Specifically, prostate segmentation is formulated into a multi-task learning framework, which includes a main task to segment prostate, and a supplementary task to regress prostate boundary. Here, the second task is applied to accurately delineating the boundary of the prostate, which is very unclear in CT images. Accordingly, the HFFCN uses a two-branch structure consisting of a shared encoding path and two complementary decoding paths. In contrast to the conventional multi-task networks, an information sharing (IS) module is also proposed to communicate at each level between the two decoding branches, by which the HFFCN endows the ability to learn hierarchically the complementary feature representations for different tasks, and also simultaneously preserve the specificities of learned feature representations for different tasks. The HFFCN is comprehensively evaluated on a large CT image dataset, including 313 images acquired from 313 patients. The experimental results demonstrate that the proposed HFFCN outperforms both the state-of-the-art segmentation methods and the conventional multi-task learning methods.