LIU Bin-Bin , DONG Wei , WANG Ji
2018, 29(8):2180-2197. DOI: 10.13328/j.cnki.jos.005529 CSTR:
Abstract:The rapid development of Internet, machine learning and artificial intelligence, as well as the appearance of a large number of open-source software and communities, has brought new opportunities and challenges to the development of software engineering. There are billions of lines of code on the Internet. These codes, especially those of high quality and widely used contains all kinds of knowledge, which has led to the new idea of intelligent software development. It tries to make full use of code resources, knowledge and collective intelligence on the Internet to effectively improve the efficiency and quality of software development. The key technology is program search and construction, providing great theoretical and practical value. At present, the research work of these areas mainly focuses on code search, program synthesis, code recommendation and completion, defect detection, code style improvement, and automatic program repair. This paper surveys the current main research work from the above aspects, sorts out the specific theoretical and technical approaches in detail and summarizes the challenges in the current research process. Several directions of research in the future are also proposed.
WU Jun-Wei , SHEN Li-Wei , GUO Wu-Nan , WANG Chao , ZHAO Wen-Yun
2018, 29(8):2198-2209. DOI: 10.13328/j.cnki.jos.005527 CSTR:
Abstract:Android developers need to accumulate experience to enhance their ability to design Android interface and behavior. Code recommendation has been one of the focuses in data driven software development. In this context, this paper proposes a method of UI interaction pattern extraction and retrieval for Android applications. The method offers the ability to retrieve and recommend UI related codes, so that developers' effort of selecting, using and learning Android applications can be reduced. The UI interaction pattern of an activity represents the interface composition and the interaction behavior of the activity. Taking the pattern as the target, this method extracts the UI interaction pattern of each activity from a set of open source Android applications. Consequently, the method supports users to retrieve the related design details of activities by constructing queries. The method is implemented as a set of tool chains that provide automatic support for extracting and retrieving. Furthermore, the accuracy and effectiveness of the method are verified by two working examples.
ZHU Zi-Xiao , ZOU Yan-Zhen , HUA Chen-Yan , SHEN Qi , ZHAO Jun-Feng
2018, 29(8):2210-2225. DOI: 10.13328/j.cnki.jos.005533 CSTR:
Abstract:Functional specification documents are very important for the developers who want to understand and reuse unfamiliar software libraries. Due to high cost of human effort and time, lots of software do not provide the official functional documentation. However, some software communication records produced in software developing processes contain valuable information regarding software functions and usages. In this paper, an approach is proposed to automatically mining and organizing functional features for open source software based on StackOverflow data. By describing functional features in the form of verb phrases, this approach generates hierarchical list of software functional features as the supplement of software documentation. In the experimental evaluation on some real-world subjects, the automatically generated documents have covered 97.6% of the frequent-used functional features in the official documents. At the same time, the proposed approach can be adapted to different types of software communication records, and applied to software in different domains.
HUANG Yuan , JIA Nan , ZHOU Qiang , CHEN Xiang-Ping , XIONG Ying-Fei , LUO Xiao-Nan
2018, 29(8):2226-2242. DOI: 10.13328/j.cnki.jos.005528 CSTR:
Abstract:Code comment is quite important to help developer review and comprehend source code. Strategic comment decision is desired to cover core code snippets of software system without incurring unintended trivial comments. However, in current practice, there is a lack of rigorous specifications for developers to make their comment decisions. Commenting has become an important yet tough decision which mostly depends on the personal experience of developers. To reduce the effort on making comment decisions, this paper investigates a unified commenting regulation from a large number of commenting instances. A method, CommentAdviser, is proposed to guide developers in placing comments in source code. Since making comment is closely related to the context information of source code themselves, the method identifies this important factor for determining where to comment and extract them as structural context feature and semantic context feature. Next, machine learning techniques are applied to identify the possible commenting locations in source code. CommentAdviser is evaluated on 10 data sets from GitHub. The experimental results, as well as a user study, demonstrate the feasibility and effectiveness of CommentAdviser.
2018, 29(8):2243-2257. DOI: 10.13328/j.cnki.jos.005531 CSTR:
Abstract:To detect the method calls with incorrect arguments in software systems, an association analysis and N-Gram based static anomaly detection approach (ANiaD) is proposed. Based on the massive open source code, an association analysis model is constructed to mine the strong association rules between arguments. An N-Gram model is constructed for method calls with strong association rules between arguments. Using the trained N-Gram model, the probability of a given method call statement is calculated. Low probability method calls are reported as potential bugs. The proposed approach is evaluated based on 10 open-source Java projects. The results show that the accuracy of the proposed approach is about 43.40%, significantly greater than that of similarity-based approach (25% accuracy).
YIN Gang , WANG Tao , LIU Bing-Xun , ZHOU Ming-Hui , YU Yue , LI Zhi-Xing , OUYANG Jian-Quan , WANG Huai-Min
2018, 29(8):2258-2271. DOI: 10.13328/j.cnki.jos.005524 CSTR:
Abstract:Crowd-Based software production model in global open source software ecosystem is rapidly becoming a new paradigm in promoting software productivity, and has great impacts on many stages of software development and applications. Crowd-Based software production generates large amounts of software data, continuously expands its collaboration scopes, and highly simplifies its project management. These globalization features present many challenges to crowd-based software production in software reuse, collaboration development and knowledge management, which urgently require new theories and supporting tools. This paper first classifies the distribution, basic process and data form of crowd-based software production activities. Then it analyzes the studies of software communities on data mining technology from the three core aspects-software reuse, collaborative development and knowledge management. Finally, the paper summarizes the problems and future trends of research works in this field.
WU Zhe-Fu , ZHU Tian-Tong , XUAN Qi , YU Yue
2018, 29(8):2272-2282. DOI: 10.13328/j.cnki.jos.005521 CSTR:
Abstract:How to authentically evaluate the contribution of developers and distinguish the core developers and the peripheral developers in the open source software is an important research question. Based on the analysis of 9 Apache projects, the developers' contribution to the project can be analyzed by designing the contribution allocation algorithm for project files, which also contributes to effectively distinguish the core developers and the peripheral developers. The feasibility and accuracy of the proposed algorithm are verified by checking the list of official developers' regions and comparing different traditional evaluation schemes on the similarity of the real list. Finally, the classification model of the support vector machine is established, and the accuracy of the developer classification is improved by combining the key factors that affect the role of the developers.
TAN Xin , LIN Ze-Yan , ZHANG Yu-Xia , ZHOU Ming-Hui
2018, 29(8):2283-2293. DOI: 10.13328/j.cnki.jos.005522 CSTR:
Abstract:In the process of software development, one code file is often developed and maintained by more than one developer and each developer contributes different amount of code to the file, which forms a unique contribution composition. Whether the contribution of the code file is reasonable or not directly affects the task allocation, which in turn affects the quality of software and development efficiency. For different types of code files, how to measure and determine their contribution composition becomes an urgent problem to be solved. Due to the maturity of supporting tools in collaborative development, the activities of developers can be recorded effectively. Therefore, the huge amount of data generated by developers lays the foundation for data-driven intelligent software development. Firstly in this paper, based on code ownership, a set of metrics is established to describe the contribution composition of code files from the three dimensions:concentration, complexity and stability. Secondly, taking Nova (one of the OpenStack' core projects) as a case study with its' version control data and metrics, a measure of contribution composition is established to summarize 12 common file types, resulting in 3 contribution composition patterns. Finally, the validity of the metrics and the rationality of contribution composition patterns are verified by combining mail-in and in-person interviews, and some instructive suggestions for software development process are presented.
SUN Xiao-Bing , ZHOU Cheng , YANG Hui , LI Bin
2018, 29(8):2294-2305. DOI: 10.13328/j.cnki.jos.005523 CSTR:
Abstract:Security bugs are commonly emerged bugs during the software development and maintenance, which cause security risks during software deployment. Security bugs need to be fixed with high quality and patched faster than other types of bugs. Recommending developers to fix security bugs is one of the important tasks during the security bug fixing process. Some developer recommendation techniques have been proposed to fix the bugs, but most of these techniques did not recommend developers considering their security experience and bug fixing quality. In this paper, an approach, SecDR (security developer recommendation), is proposed to recommend developers by considering the historical data on the quality and complexity of their security bug fixes. In addition, SecDR recommends junior developers for simple bugs, and recommends senior developers for complex bugs. An empirical study on three open source subjects (Mozilla, Libgdx and ElasticSearch) are conducted to evaluate the effectiveness of SecDR. In this study, SecDR is also compared with the state-of-art developer recommendation technique, DR_PSF, to evaluate the effectiveness of developer recommendation. Results show that the accuracy of SecDR is improved over DR_PSF with gain values ranging from 19% to 42%. Moreover, the results of SecDR is also compared with actual developer allocation, and results show that SecDR can effectively recommend developers, which is even better than the developer allocation in the real bug assignment environment.
XIE Xin-Qiang , YANG Xiao-Chun , WANG Bin , ZHANG Xia , JI Yong , HUANG Zhi-Gang
2018, 29(8):2306-2321. DOI: 10.13328/j.cnki.jos.005525 CSTR:
Abstract:The capability evaluation and collaborative relationship recommendation of software developers is a hot topic in the field of software intelligent development in big data environment. By analyzing the internet developer community and the enterprise internal development environment, a developer ability model based on fuzzy comprehensive evaluation is designed in this paper. Subsequently, the three different dimensions of the dynamic interaction behavior, static matching, and developer capabilities are extracted by mining the dynamic interaction between the developer and the task. Furthermore, by combining matrix decomposition techniques, a multi-feature fusion enhanced method based on capability and behavior for collaborative filtering developer recommendation is proposed. The method ultimately solves the evaluation matrix sparseness and cold start problem of developer recommendation, and improves the personalized precision recommendation efficiency. From the system level, a prototype of multi feature fusion recommendation system suitable for big data environment is presented, and the optimization of existing open source technology framework is improved. Experiment is conducted based on the internet Q&A community StackOverflow and the internal institution GitLab environment. Finally, the possible issues and ideas for future research are addressed.
XI Sheng-Qu , YAO Yuan , XU Feng , LÜ Jian
2018, 29(8):2322-2335. DOI: 10.13328/j.cnki.jos.005532 CSTR:
Abstract:With the increasing size of open source software projects, assigning suitable developers for bug reports (i.e., bug triaging) is becoming more and more difficult. Moreover, the efficiency of bug repairing will likely be reduced if the bugs are assigned to inappropriate developers. Therefore, it is necessary to provide an automatic bug triaging technique for the project managers to better assign bug reports. Existing work for this task mainly focuses on analyzing the text and metadata in bug reports to characterize the relationships between developers and bug reports, while the active level of developers is largely ignored. A shortcoming of these methods is that they may lead to poor performance when developers with different active levels have similar characteristics. This paper proposes a learning model named DeepTriage based on the recurrent neural networks. On the one hand, the ordered natural language text in bug reports is mapped into high-level features by a bidirectional RNN. On the other hand, developer's active level is extracted and transformed into high-level features through a single directional RNN. Then, the features of text and developer's active level are combined and learned from bug reports with known fixers. Experimental results on four different open-source data sets (e.g., Eclipse) show that DeepTriage has significantly improved the accuracy of bug triaging compared with existing work.
ZHANG Yi-Fan , TANG En-Yi , SU Yan-Zi , YANG Kai-Mao , KUANG Hong-Yu , CHEN Xin
2018, 29(8):2336-2349. DOI: 10.13328/j.cnki.jos.005526 CSTR:
Abstract:Software safety is a key property that determines whether software is vulnerable to malicious attacks. Nowadays, Internet attacks are ubiquitous, thus it is important to evaluate the number and category of defects in the software. Users need not only evaluate the safety of software that is unreleased or released recently, but also evaluate the software that is already published for a while. For example, when users want to evaluate the safety of several competitive software systems before they decide their purchase, they need a low cost, objective evaluation approach. In this paper, a natural language data driven approach is proposed for evaluating the safety of software that is released already. This approach crawls natural language data adaptively, and applies a dual training to evaluate the software safety. As the self-adaptive Web crawler adjusts feature words from the feedback and acquires heterogeneous data from search engines, software safety evaluation utilizes extensive data sources automatically. Furthermore, by customizing a machine translation model, it is quite efficient to convert natural language to its semantic encoding. Hence, a machine learning model is built for intelligently evaluating software safety based on semantic characteristics of natural language. Experiments are conducted on the Common Vulnerabilities and Exposures (CVE) and the National Vulnerability Database (NVD). The results show that the presented approach is able to make safety evaluations precisely on the amount, impact and category of defects in software.
WANG Fei , YANG Zhi-Bin , HUANG Zhi-Qiu , ZHOU Yong , LIU Cheng-Wei , ZHANG Wen-Bing , XUE Lei , XU Jin-Miao
2018, 29(8):2350-2370. DOI: 10.13328/j.cnki.jos.005530 CSTR:
Abstract:As embedded software systems are widely used in many crucial areas such as automotive, energy industries and aerospace, failures of these systems will cause pollution of environment, property losses and even casualties. Therefore, safety analysis has been critical for developing these systems. The traditional safety analysis method is mainly used in the software requirement analysis stage and the design stage. However, the gap between requirement and design is a challenge in software engineering area, for it is difficult to transmit and reflect the analysis result of the requirement analysis stage into software designing. The primary reason is that the current software requirement is mainly described in natural language, in which there is ambiguity and fuzziness, and that makes it difficult to be automatically processed. To solve this problem, this paper first focuses on component embedded software and proposes a set of requirement template based on restricted natural language to reduce the ambiguity and fuzziness of natural language requirements. Then, to lessen the complexity of automated processing, requirement abstract syntax diagrams are used as the intermediate model to realize the transition between software requirement specified by restricted natural language template and AADL model, and automatically record the traceability relations between them. Finally, a tool for the method proposed above is developed based on the AADL open source system OSATE, and an example validation is carried out through the spacecraft guidance, navigation and control system GNC (guidance, navigation and control).
ZHAO Jie , LI Ying-Ying , ZHAO Rong-Cai
2018, 29(8):2371-2396. DOI: 10.13328/j.cnki.jos.005563 CSTR:
Abstract:Polyhedral compilation has evolved for nearly three decades, being implemented as a building block or an optional extension of numerous open-source and/or commercial compilers. On the one hand, the polyhedral model is equipped with wider range of applications, more powerful expressiveness and greater optimization space when compared with those traditional models adopted for parallelizing compilers, thus representing the state of the art of almost each domain of parallelizing compilers and becoming a hot topic to a great number of international research teams working on compilers. On the other hand, the polyhedral model is also characterized as being difficult in theory, complex for manipulation, and diverse with challenges, hampering its adoption in underdeveloped countries and areas and drawing few researchers working on this topic from the domestic compiler community. Aiming at opening the "black-box" of the polyhedral model, this paper conducts a survey on the "black magic" of polyhedral compilation. First, the underlaying theory behind the polyhedral model is introduced along with a desciption of the general compilation process and an overview of the research directions. Next, the research progress of the polyhedral compilation targeting on parallelism, data locality, and extensions in various application domains is presented. Last but not least, open challenges faced by the polyhedral community and potential research directions on this topic are disussed. The purpose of this work is to provide an important reference for the domestic compiler community by reviewing and summarizing current trends on the polyhedral compilation, and to promote Chinese compiler researchers in making progress on this topic.
RAO Yuan , WU Lian-Wei , WANG Yi-Ming , FENG Cong
2018, 29(8):2397-2426. DOI: 10.13328/j.cnki.jos.005564 CSTR:
Abstract:With the development of machine learning and application of big data, semantic-based emotional computing and analysis technology plays a significant role in the research on human perception, attention, memory, decision-making, and social communication. It affects not only the development in artificial intelligence technology, but also human/machine interaction and smart robot technology, therefore drawing widespread interest from the academic and business communities. In this paper, based on the definition of affection and the analysis of more than 90 emotional models, six vital problems and challenges in emotional computing are summarized as follows:where is emotion stem from and how to represent their essential features; how to analyze and compute the emotion under the multi-model environment; how to measure the influence of external factors on the process of emotional evolution; how to measure individual emotion by various of personalized characteristic; how to measure the crowed psychology and emotion and to analyze the mechanism about propagation dynamics; and how to express the subtle emotion and optimize algorithms. Meanwhile, some theoretical research, technical analysis and practical application are brought up to introduce the current work progress and trend for these technical challenges in order to provide new research clues and directions for further study in the field of the semantic-based emotional computing.
QIAN Zhong , LI Pei-Feng , ZHOU Guo-Dong , ZHU Qiao-Ming
2018, 29(8):2427-2447. DOI: 10.13328/j.cnki.jos.005485 CSTR:
Abstract:Speculation and negation information extraction is an important task and research focus in natural language processing (NLP). This paper proposes a two-layer bidirectional long short-term memory (LSTM) neural network model for speculation and negation scope detection. Firstly, a bidirectional LSTM neural network is utilized in the first layer to learn useful feature representations from the syntactic path which is from the cue to the token. Then, lexical features and syntactic path features are concatenated into the feature representations of the token. Finally, taking the scope detection problem as a sequence labeling task, another bidirectional LSTM neural network is employed in the second layer to identify the scope of the current cue. The experimental results show that the presented model is superior to other neural network models and attains excellent performances on BioScope corpus. Particularly, the model achieves the accuracy (percentage of correct scopes) of 86.20% and 80.28% on speculation and negation scope detection on Abstracts subcorpus, respectively.
YU Qian-Cheng , YU Zhi-Wen , WANG Zhu , WANG Xiao-Feng
2018, 29(8):2448-2469. DOI: 10.13328/j.cnki.jos.005558 CSTR:
Abstract:Exchangeability is a key to model network data with Bayesian model. The Aldous-Hoover representation theorem based exchangeable graph model can't generate sparse network, while empirical studies of networks indicate that many real-world complex networks have a power-law degree distribution. Kallenberg representation theorem based exchangeable graph model can admit power-law behavior while retaining desirable exchangeability. This article offers an overview of the emerging literature on concept, theory and methods related to the sparse exchangeable graph model with the Caron-Fox model and the Graphex model as examples. First, developments of random graph models, Bayesian non-parametric mixture models, exchangeability representation theorem, Poisson point process, discrete non-parametric prior etc. are discussed. Next, the Caron-Fox model is introduced. Then, simulation of the sparse exchangeable graph model and related methods such as truncated sampler, and marginalized sampler are summarized. In addition, techniques of model posterior inference are viewed. Finally, state-of-the-art and the prospects for development of the sparse exchangeable graph model are demonstrated.
YU Yan-Wei , JIA Zhao-Fei , CAO Lei , ZHAO Jin-Dong , LIU Zhao-Wei , LIU Jing-Lei
2018, 29(8):2470-2484. DOI: 10.13328/j.cnki.jos.005289 CSTR:
Abstract:This paper proposes a simple but efficient density-based clustering, named CBSCAN, to fast discover cluster patterns with arbitrary shapes and noises from location big data effectively. Firstly, the notion of Cell is defined and a distance analysis principle based on Cell is proposed to quickly find core points in high density areas and density relationships with other points without distance computing. Secondly, a Cell-based cluster that maps point-based density cluster to grid-based density cluster is presented. By leveraging exclusion grids and relationships with their adjacent grids, all inclusion grids of Cell-based cluster can be rapidly determined. Furthermore, a fast density-based algorithm based on the distance analysis principle and Cell-base cluster is implemented to transform DBSCAN of point-based expansion to Cell-based expansion clustering. The proposed algorithm improves clustering efficiency significantly by using inherent property of location data to reduce huge number of distance calculations. Finally, comprehensive experiments on benchmark datasets demonstrate the clustering effectiveness of the proposed algorithm. Experimental results on massive-scale real and synthetic location datasets show that CBSCAN improves 525 fold, 30 fold and 11 fold of efficiency compared with DBSCAN, DBSCAN with PR-Tree and Grid index optimization respectively.
YANG Yang , YANG Jia-Hai , WEN Hao-Sen
2018, 29(8):2485-2500. DOI: 10.13328/j.cnki.jos.005543 CSTR:
Abstract:Traffic engineering based on SDN (software defined network) can select routing paths dynamically in order to evade the risk of congestion through global view of network in data centers. However, the design of routing strategy often needs to change routing path during packet transmission, especially for elephant flows, which may commonly result in the problem of packet losses and out-of-order at receivers. To address the problem, an algorithm named "flowlet-binned algorithm based on timeslot (FLAT)" is proposed. FLAT is able to gather the information of link state and calculate the proper transmission timeslot under centralized control, which can solve the problem of packet losses and out-of-order. In the meantime, traffic balance with high efficiency and fine granularity can be achieved under considerable use of the redundant links in data centers. Finally, simulation results show better performance of FLAT in Mininet platform compared with ECMP and GFF routing strategies with the packet loss rate respectively falling by 90% and 80%, and the throughput increasing by 44% and 11%, especially under the condition of high load of links.
2018, 29(8):2501-2510. DOI: 10.13328/j.cnki.jos.005281 CSTR:
Abstract:The existing steganographic cover selection indicators based on image texture complexity modeling are not compatible with JPEG steganography. To solve this problem, a JPEG steganography is proposed based on cover selection using Haar wavelet domain indicators. This method establishes the relationship of JPEG image pixels by taking high-ordered Haar wavelet translation as the model, and calculates the average norm of the decomposition image matrix in each direction to select highly undetectable covers. Moreover, the proposed indicator, which performs better than most of the existing models in the inter pixel modeling ability, can enhance the concealment of JPEG steganography in cover selection. Experimental results show that, in most cases, the proposed JPEG steganography using cover selection achieves higher concealment than that without selecting covers by an average value of about 7.7%. This figure has higher concealment than the existing cover selection indicators by an average value of 2.0%. Therefore, the proposed steganography attains better concealment.
ZHANG Zhang-Kai , LI Zhou-Jun , XIA Chun-He , MA Jin-Xin , CUI Jin-Hua
2018, 29(8):2511-2526. DOI: 10.13328/j.cnki.jos.005492 CSTR:
Abstract:Widely used on the Android phones, the technology of ARM TrustZone divides the hardware resources of Android phones into two worlds:non-secure world and secure world. The Android operating system used by user is running in the non-secure world, while the non-secure world's introspection systems (e.g., KNOX, Hypervisor) that are based on TrustZone are running in the secure world. These introspection systems have the high privilege. They can dynamically check Android kernel integrity and perform memory management of non-secure world instead of Android kernel. But TrustZonecan can not completely introspect the hardware resources (e.g., Cache) of non-secure world because of the world gap (introspection systems and Android system are in the different worlds). TrustZone's inferior interception capabilities and memory access control capabilities make its introspection capabilities weaker. This article first proposes an extendable frame system HTrustZone that utilizes Hypervisor to extend TrustZone's introspection capabilities to defeat world gap attacks and strengthen interception capabilities and memory access control capabilities. HTrustZone can help TrustZone make great progress on system introspection and give more security protection to the operating system in non-secure world. HTrustZone system is implemented on Raspberry Pi2 development board and the experiment results show that the overhead of HTrustZone is about 3%.
CONG Run-Min , LEI Jian-Jun , FU Hua-Zhu , WANG Wen-Guan , HUANG Qing-Ming , NIU Li-Jie
2018, 29(8):2527-2544. DOI: 10.13328/j.cnki.jos.005560 CSTR:
Abstract:As a hot topic in computer vision community, video saliency detection aims at continuously discovering the motion-related salient objects from the video sequences by considering the spatial and temporal information jointly. Due to the complex backgrounds, diverse motion patterns, and camera motions in video sequences, video saliency detection is a more challenging task than image saliency detection. This paper summarizes the existing methods of video saliency detection, introduces the relevant experimental datasets, and analyze the performance of some state-of-the-art methods on different datasets. First, an introduction of low-level cues based video saliency detection methods including transform analysis based method, sparse representation based method, information theory based method and visual prior based method, is presented. Then, the learning-based video saliency detection methods, which mainly include traditional methods and depth learning based methods, are discussed. Subsequently, the commonly used datasets for video saliency detection are presented, and four evaluation measures are introduced. Moreover, some state-of-the-art methods with qualitative and quantitative comparisons on different datasets are analyzed in experiments. Finally, the key issues of video saliency detection are summarized, and the future development trend is discussed.