YANG Bo , YU Qian , ZHANG Wei , WU Ji , LIU Chao
2017, 28(6):1330-1342. DOI: 10.13328/j.cnki.jos.005222
Abstract:This paper proposes an approach to analyze the correlation of between these factors. It demonstrates the existence of certain relationship among some factors through experiments. The paper also gives some suggestions on GitHub open source software development process based on the experiment.
ZHANG Yu-Xia , ZHOU Ming-Hui , ZHANG Wei , ZHAO Hai-Yan , JIN Zhi
2017, 28(6):1343-1356. DOI: 10.13328/j.cnki.jos.005227
Abstract:There are a lot of differences between the open source software development approaches and the traditional software engineering methods. If commercial organizations want to join the open source community, they must make some adjustments in their own original software development approach and business model. In this case, an urgent problem needs to be solved immediately is what involvement model the commercial organizations should adopt to achieve their goals of joining the open source community. This paper first collects project text data from the Internet as a basis for qualitative analysis using snowball-sampling collection mechanisms. Then, based on the classical grounded theory, it summarizes different commercial organizations' involvement model in open source projects through filtering and analyzing these data. Finally, the study extracts four kinds of general involvement model which can provide decision supports and experience references to those commercial organizations who want to join the open source software projects.
YANG Cheng , FAN Qiang , WANG Tao , YIN Gang , WANG Huai-Min
2017, 28(6):1357-1372. DOI: 10.13328/j.cnki.jos.005230
Abstract:With the deep integration of software collaborative development and social networking, social coding represents a new style of software production and creation paradigm. Due to the flexibility and openness, a large number of external contributors are attracted to the open source communities. They are playing a significant role in open source development. However, the online open source development is a globalized and distributed cooperative work. If left unsupervised, the contribution process may result in inefficiency. It takes contributors a lot of time to find suitable projects or tasks to work on from thousands of open source projects in the communities. In this paper, a new approach, called RepoLike, is proposed for recommending repositories to developers based on linear combination and learning to rank. It utilizes the project popularity, technical dependencies among projects and social connections among developers to measure the correlations between a developer and the given projects. The experiment results show that this new approach can achieve over 25% of hit ratio when recommending 20 candidates, which means it can recommend closely correlated repositories to social developers.
WANG Hao-Yu , GUO Yao , MA Zi-Ang , CHEN Xiang-Qun
2017, 28(6):1373-1388. DOI: 10.13328/j.cnki.jos.005221
Abstract:Third-Party libraries are widely used in mobile applications such as Android apps. Much research on app analysis or access control needs to detect or classify third-party libraries first in order to provide accurate results. Most previous studies use a whitelist to identify third-party libraries and manually categorize them. However, it is impossible to build a complete whitelist of third-party libraries and classify them because:(1) there are too many of them; and (2) common techniques such as library obfuscation and library masquerading cannot be handled with a whitelist. In this paper, an automated approach is proposed to detect and classify frequently-used third-party libraries in Android apps. A multi-level clustering based method is presented to identify third-party libraries, and a machine learning based technique is applied to classify the libraries. Experiments on more than 130 000 apps show that 4 916 third-party libraries can be detected without prior knowledge. The classification result of 10-folds cross validation on sampled libraries is 84.28%. With the trained classifier, the proposed approach is able to classify more than 75% of the 4 916 libraries into six categories with an accuracy of 75%.
XU Pei-Xing , CHEN Wei , WU Guo-Quan , GAO Chu-Shu , WEI Jun
2017, 28(6):1389-1404. DOI: 10.13328/j.cnki.jos.005224
Abstract:Configuration management tool (CMT), as an essential part of automated system operations, is an important technique to achieve DevOps (development and operations). There are a large amount of reusable CMT artifacts in the internet-scale open source communities and repositories. However, the lack of effective hierarchical categorization leads to the difficulties of effective retrieval and usage of those artifacts. This paper addresses the issue by proposing a hierarchical categorization method for CMT artifacts based on their online unstructured descriptions. This method firstly constructs a category system based on the co-occurrences of tags, and then designs the classifiers based on the features of CMT artifacts, including name and description. To improve the effectiveness of classifications affected by the unbalanced data set, the method builds a hybrid model to divide the sample data. Finally, extensive experiments are carried out to evaluate the method on more than 11000 CMT artifacts. The results show that this improved method based on hybrid model achieves up to 0.81 precision, 0.88 recall and 0.85 F-measure. Comparing to traditional approaches, the recall and F-measure of CMT artifacts classification improve significantly. The effectiveness of this method is verified.
LI Xuan , WANG Qian-Xiang , JIN Zhi
2017, 28(6):1405-1417. DOI: 10.13328/j.cnki.jos.005226
Abstract:Effectively searching code for specific programming task from code base has become an important research field of software engineering. This paper presents a description reinforcement based code search (DERECS) approach. DERECS first builds a codedescription pair corpus, analyzes both code and its natural language description, and extracts features about method calls and code structure. DERECS reinforces the description of code based on the method calls and code structure features, reduces the gaps between code snippet and natural language query, and expands the search scope. Evaluation is conducted against real-world queries, and the results show DERECS is significantly better than SNIFF and Krugle.
HUANG Yuan , LIU Zhi-Yong , CHEN Xiang-Ping , XIONG Ying-Fei , LUO Xiao-Nan
2017, 28(6):1418-1434. DOI: 10.13328/j.cnki.jos.005225
Abstract:Code commit is one of the most important software evolution data, and it is widely used in the software review and code comprehension. A commit involving multiple modified classes and code makes the review of code changes difficult. By analyzing a large amount of commit data, this study discovers that identifying the core modified classes in a commit can speed up commit review for developers. Inspired by the effectiveness of machine learning techniques in classification, the paper models the core class identification as a binary classification problem (i.e., core and non-core) and proposes discriminative features from a large number of commits to characterize the core modified classes. The experiments results show that the proposed approach achieves 87% accuracy, and using core class in commit review provides significant improvement than the ones without core class.
WANG Zi-Yong , WANG Tao , ZHANG Wen-Bo , CHEN Ning-Jiang , ZUO Chun
2017, 28(6):1435-1454. DOI: 10.13328/j.cnki.jos.005223
Abstract:Microservice architecture is gradually adopted by more and more applications. How to effectively detect and locate faults is a key technology to guarantee the performance and reliability of microservices. Current approaches typically monitor physical metrics, and manually set alarm rules according to the domain knowledge. However, these approaches cannot automatically detect faults and locate root causes in fine granularity. To address the above issues, this work proposes a fault diagnosis approach for microservices based on the execution trace monitoring. First, dynamic instrumentation is used to monitor the execution traces crossing service components, and then call trees are used to describe the execution traces of user requests. Second, for the faults affecting the structure of execution traces, the tree edit distance is used to assess the abnormality degree of processing requests, and the method calls leading to failures are located by analyzing the difference between execution traces. Third, for the performance anomalies leading to the response delay, principal component analysis is used to extract the key method invocations causing unusual fluctuations in performance metrics. Experimental results show that this new approach can accurately characterize the execution trace of processing requests, and locate the methods that cause system failures and performance anomalies.
HE Ji-Yuan , MENG Zhao-Peng , CHEN Xiang , WANG Zan , FAN Xiang-Yu
2017, 28(6):1455-1473. DOI: 10.13328/j.cnki.jos.005228
Abstract:Software defect prediction can help developers to optimize the distribution of test resources by predicting whether or not a software module is defect-prone. Most defect prediction researches focus on within-project defect prediction which needs sufficient training data from the same project. However, in real software development, a project which needs defect prediction is always new or without any historical data. Therefore cross-project defect prediction becomes a hot topic which uses training data from several projects and performs prediction on another one. The main research challenges in cross-project defect prediction are the variety of distribution from source project to target project and class imbalance problem among datasets. Inspired by search based software engineering, this paper proposes a search based semi-supervised ensemble learning approach S3EL. By adjusting the ratio of distribution in training dataset,several Naïve Bayes classifiers are built as the base learners, then a small amount of labeled target instances and genetic algorithm are used to combine these base classifiers as a final prediction model. S3EL is compared with other up-to-date classical cross-project defect prediction approaches (such as Burak filter, Peters filter, TCA+, CODEP and HYDRA) on AEEEM and Promise dataset. Final results show that S3EL has the best prediction performance in most cases under the F1 measure.
TSAI Wei-Tek , YU Lian , WANG Rong , LIU Na , DENG En-Yan
2017, 28(6):1474-1487. DOI: 10.13328/j.cnki.jos.005232
Abstract:This paper presents a blockchain definition independent of any digital currency, and describes its characteristics including consensus protocols, design patterns, scalability, databases, and chaincode. The paper then presents a permissioned blockchain, called Beihangchain, with its unique consensus algorithms, interfaces, and design. It also proposes ABC (account blockchain) and TBC (trading blockchain), to be used for a variety of applications including copyright protection and digital payment. Finally, this paper analyzes chaincode requirements and provides guidelines for effective chaincode.
2017, 28(6):1488-1497. DOI: 10.13328/j.cnki.jos.005229
Abstract:The problem frame method typically uses domain knowledge in order to demonstrate that a software system can satisfy the requirements of stakeholders by specifying how machine relates to stakeholders' problems. Qualitatively, satisfiability discourse can guide a software engineer to make early decisions on what the right solution is to the right problem. However, mobile apps deployed to app stores often not only need to accommodate millions of individual users whose requirements have subtle differences, but also may change at runtime under varying application contexts. Requirements of such apps can no longer be analyzed qualitatively to cover all situations. Big data analysis through deep learning has been increasingly adopted in practice to replace deep requirements analysis. Although effective in making statistically sound decisions, the conclusions of pure big data analysis are merely a set of unexplainable parameters, which cannot be used to show that individual users' requirements are satisfied, nor can they reliably validate the trustworthiness and dependability in terms of security and privacy. After all, training with more datasets could only improve statistical significance, but cannot prevent software systems from the malicious exploitation of outliers. This paper attempts to follow Jackson's teaching of abstract goal behaviors as intermediate between requirements and software domains, and proposes an algebraic approach to analyzing the consequences of probabilistic software behavior models, so as to circumvent some blind spots of purely data-driven approaches. Through examples in security and privacy areas, the challenges and limitations to big data software requirement analysis are discussed.
ZHU Mei-Ling , LIU Chen , WANG Xiong-Bin , HAN Yan-Bo
2017, 28(6):1498-1515. DOI: 10.13328/j.cnki.jos.005220
Abstract:Companion vehicle discovery is a newly emerging intelligent transportation application. Aiming at it, this paper redefines the Platoon companion pattern over a special type of spatio-temporal data stream, or ANPR (automatic number plate recognition data). Accordingly, a PlatoonFinder algorithm is also proposed to mine Platoon companions over ANPR data stream instantly. First, Platoon discovery problem is transformed into frequent sequence mining problem with customized spatio-temporal constraints. Compared to traditional frequent sequence mining algorithms, this new algorithm can effectively handle complex spatio-temporal relationships among sequence elements rather than their positions. Second, the new algorithm also integrates several optimization techniques such as pseudo projection to greatly improve the efficiency. It can efficiently deal with high speed and large scale ANPR data stream so as to instantly discover Platoon companions. Experiments show that the latency of the algorithm is significantly lower than classic frequent pattern mining algorithms including Apriori and Prefixspan. Furthermore, it is also lower than the minimum time interval between any two real ANPR data records. Hence, the proposed algorithm can discover Platoon companions effectively and efficiently.
2017, 28(6):1516-1528. DOI: 10.13328/j.cnki.jos.005231
Abstract:Big data technology is widely adopted across many disciplines. In order to build sustainable big data application systems and facilitate its rapid development and delivery of expected values with minimum efforts, innovative software engineering methodology and an integrated development and management platform for big data applications are in dire needs. Big data is complex, volatile, lack of correlation and value scarce by nature, which makes it difficult to form standardized and systematic technological solutions to meet the diversified requirements for life cycle management of big data in different application domain. Software engineering in big data era has to address two major challenges:data life cycle management with integrated development environment and software life cycle management using run-time behavior analysis tool. This paper proposes a domain requirements driven approach for big data application systems development and run-time support platform, covering the entire big data life-cycle, including dada collection, storage, computation, analysis, visualization, as well as the software systems life cycle. This platform forms a self-managing, self-adaptive, self-optimizing solution. The proposed techniques are applied in specific application domains such as industry 4.0 and meteorological engineering to provide an illustration and validation of the new platform.
LI Zhi-Yong , HUANG Tao , CHEN Shao-Miao , LI Ren-Fa
2017, 28(6):1529-1546. DOI: 10.13328/j.cnki.jos.005259
Abstract:Constrained optimization evolutionary algorithm, which mainly studies how to use evolutionary computation method to solve constrained optimization problems, is an important research topic in evolutionary computation field. Discrete constraint, equality constraint, nonlinear constraints are challenges to solving constraint optimization. The basis of this problem solving is how to handle the relationship between feasible solution and infeasible solution. In this study, the definition of constrained optimization problem is firstly provided, and then, the existing constrained optimization approaches are systematically analyzed. Meanwhile, algorithms are classified into six categories (i.e., penalty function method, feasible rules, stochastic ranking, ε-constraint, multi-objective constraint handling, and hybrid method), and the state-of-art constrained optimization evolutionary algorithms (COEAs) are surveyed with respect to constraint-handling techniques. Research progress and challenges of the six categories of constraint handling techniques are discussed in detail. Finally, the issues and research directions of constraint handling techniques are discussed.
2017, 28(6):1547-1564. DOI: 10.13328/j.cnki.jos.005260
Abstract:In recent years, with the great success of compressed sensing (CS) in the field of signal processing, matrix completion (MC), derived from CS, has increasingly become a hot research topic in the field of machine learning. Many researchers have done a lot of fruitful studies on matrix completion problem modeling and their optimization, and constructed relatively complete matrix completion theory. In order to better grasp the development process of matrix completion, and facilitate the combination of matrix completion theory and engineering applications, this article reviews the existing matrix completion models and their algorithms. First, it introduces the natural evolution process from CS to MC, and illustrates that the development of CS theory has laid the foundation for the formation of MC theory. Second, the article summarizes the existing matrix completion models into the four classes from the perspective of the relaxation of non-convex and non-smooth rank function, aiming to provide reasonable solutions for specific matrix completion applications; Third, in order to understand the inherent optimization techniques and facilitate solving new problem-dependent matrix completion model, the article studies the representative optimization algorithms suitable for various matrix completion models. Finally, article analyzes the existing problems in current matrix completion technology, proposes possible solutions for these problems, and discusses the future work.
ZHOU Xiao-Ping , LIANG Xun , ZHAO Ji-Chao , LI Zhi-Yu , MA Yue-Feng
2017, 28(6):1565-1583. DOI: 10.13328/j.cnki.jos.005249
Abstract:Social network (SN) has become a popular research field in both academia and industry. However, most of the current studies in this field mainly focuses on a single SN. Obviously, the integration of SNs, termed as social network integration (SNI), provides more sufficient user behavior data and more complete network structure for the studies on SN such as social computing. Additionally, SNI is more effective in excavating and understanding human society through SNs. Thus, it has significant theoretical and practical value to explore problems in SNI. Correlating users refer to the user accounts belonging to the same individual in different SNs. Since users naturally bridge the SNs, correlating user mining problem is the fundamental task of SNI, hence having attracted extensive attention. Due to the unfavorable characteristics of SN, correlating user mining problem is still a hard nut to crack. In this paper, the difficulties in the correlating user mining task are analyzed, and the methods addressing this issue are summarized. Finally, some potential future research work is suggested.
ZHANG Xiang-Ling , CHEN Yue-Guo , MA Deng-Hao , CHEN Jun , DU Xiao-Yong
2017, 28(6):1584-1605. DOI: 10.13328/j.cnki.jos.005256
Abstract:Entity search differs from traditional search engines in that the results of traditional search engines are Web pages, whereas the results of entity search are entities which can enhance the user's search experience. Entity search can be further categorized into the task of related entity search and the task of similar entity search. In this paper, a survey is presented on the techniques of entity search. Firstly, entity search is defined formally, and frequently used evaluation measures are introduced as well. Secondly, the algorithms of the two different types of entity search on two different data sources (unstructured data and structured data) are reviewed in details. Finally, open research issues and possible future research directions are discussed.
FENG Jun , ZHANG Li-Xia , LU Jia-Min , WANG Chong
2017, 28(6):1606-1628. DOI: 10.13328/j.cnki.jos.005254
Abstract:Currently, LBS (location-based service) is widely employed in many mobile devices, making the technology for processing moving object data underlying the road network to become a research hotspot in the community of spatio-temporal processing techniques. This paper intends to survey the previous work from three aspects including index structures, query approaches and privacy protection. First, the various index structures are classified into three groups:hierarchical, distributed and broadcast, and comparisons are made based on in-depth analysis. Second, the query approaches are divided into four categories by their purposes:single-object continuous query, multi-object parallel query, shortest path query and road-network keyword query. For each category, its basic strategies are introduced. In addition, methods on moving object privacy protection are also studied. The challenges on these technologies are projected in the end.