Volume 30,Issue 5,2019 Table of Contents

Program Generation and Code Completion Techniques Based on Deep Learning: Literature Review

2019, 30(5):1206-1223. DOI: 10.13328/j.cnki.jos.005717

Abstract (4917) HTML (5687) PDF 1.91 M (10328) Comment (0) Favorites

Abstract:Automatic software development has always been a research hotspot in the field of software engineering. Currently, Internet technology has promoted the development of open source software and open source communities. These large-scale code and data are opportunities for automatic software development. At the same time, deep learning is beginning to be applied in various software engineering tasks. How to use deep learning technology for large-scale code learning and realize automatic programming of machines is a common expectation in the field of artificial intelligence and software engineering. The machine automatically writes program to assist or even replace the programmer to develop the program to a certain extent, which greatly reduces the development burden of the programmer and improves the efficiency and quality of the software development. At present, automatic programming based on deep learning methods is mainly implemented from two aspects, program generation and code completion. This study introduces these two aspects and the deep learning models.

Hybrid Programming System of Differentiable Abstract Machines

ZHOU Peng , WU Yan-Jun , ZHAO Chen

2019, 30(5):1224-1242. DOI: 10.13328/j.cnki.jos.005719

Abstract (2621) HTML (3457) PDF 2.00 M (5229) Comment (0) Favorites

Abstract:Automated programming is one of the central challenges of intelligent software. Learning program by program execution traces or input output pairs are typical automatic programming research methods, but these methods can not bridge the gap between normal program elements and neural network components, can not absorb programing experience as input, and lack of programming control interface. This paper presents a hybrid programming model that seamlessly combines advanced programming language with neural network components. The program is composed of a mixture of elements from high-level programming language and neural network component, in which the language describes the sketch to provide experience information, with the key complex parts placed with undetermined and learnable neural network components. The program runs on differentiable abstraction machines to generate its continuous differentiable computational graph representation. Then, input-output pairs are used to train the graph by differentiable optimization method to learn to generate the complete program automaticly. This programming model provides an automatic program generation method which can combine programmer experience with neural network self-learning, bridges the gap between elements from programming language and neural network, which integrate the advantages of procedural and neural network programming, the complex details are automatically generated by neural networks to reduce the difficulty or workload of programming. Experience input is heuristic-helpful to the learning of undetermined parts and provides an input interface for reusing valuable programming experience accumulated over a long period of time.

Automatic Defect Repair and Validation Approach for C/C++ Programs

ZHOU Feng-Shun , WANG Lin-Zhang , LI Xuan-Dong

2019, 30(5):1243-1255. DOI: 10.13328/j.cnki.jos.005729

Abstract (2994) HTML (3835) PDF 1.37 M (10181) Comment (0) Favorites

Abstract:In computer software, program defects are inevitable and are highly likely to cause significant losses. Therefore, it is a common consensus in academia and industry to find and eliminate potential defects in the program as early as possible. Most of the current automatic program repair methods follow the process of defect location, candidate generation, candidate verification. However, when the program is repaired, there is a problem that the repair rate is low and the repair result cannot be guaranteed. This study proposes a method for automatic repair of defects in C/C++ program based on program synthesis. Firstly, the error mode and its corresponding repair methods are summarized from the assembly that satisfies the same specification, and use the rewrite rules to express the error mode and its corresponding repair methods. On this basis, a defect-location method is implemented based on rewrite rules and program spectrum to obtain possible defect locations in the program. Secondly, the candidate-generation method is used to get the repair candidate based on the rewrite rule. At the same time, the correct structure of the program through deep learning is learnt to help predict the correct sentence structure of the wrong program error point. These two ways improve the quality of the candidate and the repair rate. Finally, in the candidate-verification process, the method of program synthesis is used. The sample program is used as a constraint to ensure the correctness of the synthesized code. Based on the above methods, the prototype tool AutoGrader is implemented and it is experimented on student program. The experimental results show that the proposed method has a high repair rate for the defects in the student program, and also ensures the correctness of the code after the repair.

Example-evolution-driven Automatic Repair of Student Programs

WANG Tian-Tian , XU Jia-Huan , WANG Ke-Chao , SU Xiao-Hong

2019, 30(5):1256-1268. DOI: 10.13328/j.cnki.jos.005716

Abstract (2837) HTML (3471) PDF 1.30 M (6084) Comment (0) Favorites

Abstract:Most existing program repair researches are oriented to industrial software. Student program debugging has many unique problems, such as multiple bugs and complex bug types. Therefore, according to the application background of student programming, the automatic repair method is studied, and template programs are used to guide the evolution of patches. Genetic programming algorithm has been improved, such as fitness calculation, mutants generation, and mutation position and operator selection, to make it more suitable for repairing student programs. A static fault location method based on sample programs is proposed, which identifies the difference between the defect program and the sample program and recognizes the possible mutation operators. It can effectively reduce the search space of the patch and improves the accuracy of the program repair. A variable mapping method based on execution value sequence is proposed to reduce compilation errors of mutants and improve the accuracy of program repair. On this basis, an example-evolution-driven system for repairing students' Java programs was designed and implemented. The experimental results show that the method can repair student programs with multiple bugs. For the test set, the repair rate is nearly 100% when the student programs have only 1~2 bugs. When there are 3 bugs, the repair rate is about 70%. When there are 4 or more bugs in the student programs, the repair rate is about 50%.

Evolutionary Algorithm for Optimization of Energy Consumption at GCC Compile Time Based on Frequent Pattern Mining

NI You-Cong , WU Rui , DU Xin , YE Peng , LI Wang-Biao , XIAO Ru-Liang

2019, 30(5):1269-1287. DOI: 10.13328/j.cnki.jos.005734

Abstract (2933) HTML (3220) PDF 2.19 M (6949) Comment (0) Favorites

Abstract:The evolutionary algorithms have been used to improve the energy consumption of executable code of embedded software by searching the optimal compilation options of GCC compiler. However, such algorithms do not consider the possible interaction between multiple compilation options so that the quality of their solutions is not high, and their convergence speed is slow. To solve this problem, this study designs an evolutionary algorithm based on frequent pattern mining, called GA-FP. In the process of evolution, GA-FP uses frequent pattern mining to obtain a set of compilation options which are of high-frequency and contribute to significant improvement on energy consumption. The derived options are used as the heuristic information and two mutation operators of ADD and DELETE are designed to increase the quality of solution and accelerate the convergence speed. The comparative experiments are done on 8 typical cases in 5 different fields between Tree-EDA and GA-FP. The experimental results indicate that the GA-FP can not only reduce the energy consumption of software more effectively (the average and maximal reduction ratios are 2.5% and 21.1% respectively), but also converge faster (the average of 34.5% faster and up to 83.3% faster) when the energy optimization effect obtained by GA-FP is no less than that of Tree-EDA. The correlation analysis of compilation options in the optimal solution further validates the effectiveness of the designed mutation operators.

Just-in-time Software Defect Prediction: Literature Review

CAI Liang , FAN Yuan-Rui , YAN Meng , XIA Xin

2019, 30(5):1288-1307. DOI: 10.13328/j.cnki.jos.005713

Abstract (4261) HTML (4493) PDF 2.02 M (8953) Comment (0) Favorites

Abstract:Software defect prediction is always one of the most active research areas in software engineering. Researchers have proposed a lot of defect prediction techniques. These techniques consist of module-level, file-level, and change-level defect prediction according to the granularity. Change-level defect prediction can predict the defect-proneness of changes when they are initially submitted. Hence, such a technique is referred to as just-in-time defect prediction. Recently, just-in-time defect prediction becomes the hot area in defect prediction because of its timely manner and fine granularity. There are a lot of achievements in this area and there are also many challenges in data labeling, feature extraction, and model evaluation. More advanced and unified theoretic and technical guidelines are needed to enhance just-in-time defect prediction. Therefore, in this study, a literature review for prior just-in-time defect prediction studies is presented in three folds, data labeling, feature extraction, and model evaluation. In summary, the contributions of this study are:(1) The data labeling methods and their advantages and disadvantages are concluded; (2) The feature categories and computing methods are concluded and classified; (3) The modeling techniques are concluded and classified; (4) The model validation and performance measures in model evaluation are concluded; (5) The current problems in this area are highlighted; and (6) The trends of Just-in-Time defect prediction are concluded.

Cross-project Defect Prediction Method Based on Feature Transfer and Instance Transfer

NI Chao , CHEN Xiang , LIU Wang-Shu , GU Qing , HUANG Qi-Guo , LI Na

2019, 30(5):1308-1329. DOI: 10.13328/j.cnki.jos.005712

Abstract (3383) HTML (3305) PDF 2.49 M (6701) Comment (0) Favorites

Abstract:In real software development, a project, which needs defect prediction, may be a new project or maybe has less training data. A simple solution is to use training data from other projects (i.e., source projects) to construct the model, and use the trained model to perform prediction on the current project (i.e., target project). However, datasets among different projects may have large distribution difference. To solve this problem, a novel two phase cross-project defect prediction method FeCTrA is proposed, which considers both feature transfer and instance transfer. In the feature transfer phase, FeCTrA uses cluster analysis to select features, which have high distribution similarity between the source project and the target project. In the instance transfer phase, FeCTrA utilizes TrAdaBoost, which selects relevant instances from the source project when give some labeled instances in the target project. To verify the effectiveness of FeCTrA, Relink and AEEEM datasets are choosen as the experimental subjects and F1 as the performance measure. Firstly, it is found that FeCTrA outperforms single phase methods, which only consider feature transfer or instance transfer. Then after comparing with state-of-the-art baseline methods (i.e., TCA+, Peters filter, Burak filter, and DCPDP), the performance of FeCTrA improves 23%, 7.2%, 9.8%, and 38.2% on Relink dataset and the performance of FeCTrA improves 96.5%, 108.5%, 103.6%, and 107.9% on AEEEM dataset. Finally, the influence of factors in FeCTrA is analyzed and a guideline to effectively use this method is provided.

Memory Leak Intelligent Detection Method for C Programs

ZHU Ya-Wei , ZUO Zhi-Qiang , WANG Lin-Zhang , LI Xuan-Dong

2019, 30(5):1330-1341. DOI: 10.13328/j.cnki.jos.005715

Abstract (2982) HTML (4292) PDF 1.25 M (7206) Comment (0) Favorites

Abstract:Memory leak is a common code bug for C programs which uses explicit memory management mechanisms. At present, the main detection methods of memory leaks are static analysis and dynamic detection. Dynamic detection has huge overhead and it is highly dependent on test cases. Static analysis is widely used by academic and industry, but there are a large number of false positives, which need to be manually confirmed. Inaccuracy in the analysis of pointers, branch statements, and global variables leads to false positives in static analysis of memory leaks. In this study, an intelligent detection method is proposed for memory leak. By using machine learning algorithms to learn the correlation between program's features and memory leaks, a machine learning classifier is built and applied to improve the accuracy of static analysis of memory leaks. Firstly, a machine learning classifier is trained. Then, the sparse value flow graph (SVFG) starting from allocation should be constructed by using the static analysis, the features related to memory leaks can be extracted from the SVFG. Lastly, the memory leaks are detected by using rules and machine learning classifier. Experimental results show that the proposed method is effective in analyzing pointers, branch statements, and global variables, and can reduce the false positives of memory leak detection. At the end of this paper, the feasibility of future research and the upcoming challenges are presented.

API Misuse Bug Detection Based on Deep Learning

WANG Xin , CHEN Chi , ZHAO Yi-Fan , PENG Xin , ZHAO Wen-Yun

2019, 30(5):1342-1358. DOI: 10.13328/j.cnki.jos.005722

Abstract (3674) HTML (4073) PDF 1.80 M (6069) Comment (0) Favorites

Abstract:Developers often need to use various application programming interfaces (API) to reuse existing software frameworks, class libraries, and so on. Because of the complexity of the API itself, or the lack of documentation, developers often make some API misuses, which can lead to some code defects. In order to automatically detect API misuse defects, the API use specification is required and the API is tested according to the specification. However, API specifications that can be used for automatic detection are difficult to obtain, and the cost of manual writing and maintenance is high. To address the issue, this study applies the recurrent neural network model of deep learning to the task of learning API use specifications and the task of detecting the API misuse defect. In this study, based on a large number of open source Java code, the training sample of API use specification is extracted based on static analysis method, and then use the training sample to set up the recurrent neural network to learning API use specification. On this basis, this study makes a context-based prediction on the API use code, and finds out the potential API misuse defects by comparing the prediction results with the actual code. The method above is implemented, and it is evaluated with experiments about Java encryption related APIs and their used code. The results show that the proposed approach has the ability to a certain extent to automatically detect API misuse defects.

God Class Detection Approach Based on Deep Learning

BU Yi-Fan , LIU Hui , LI Guang-Jie

2019, 30(5):1359-1374. DOI: 10.13328/j.cnki.jos.005724

Abstract (3633) HTML (3844) PDF 1.62 M (7042) Comment (0) Favorites

Abstract:God class refers to certain classes that have assumed more than one functionality, which obey the single responsibility principle and consequently impact on the maintainability and intelligibility of software system. Studies, detection and refactoring included, of god class have always attracted research attentions because of its commonness. As a result, a neural network based detection approach is proposed to detect god class code smell. This detection technology not only makes use of common metrics in software, but also exploits the textual information in source code, which is intended to reveal the main roles that the class plays through mining text semantics. In addition, in order to solve the massive labeled data required for supervised deep learning, an approach is proposed to construct labeled data based on open source code. Finally, the proposed approach is evaluated on an open source data set. The result of evaluation shows that the proposed approach outperforms the current method, especially the recall has been greatly improved by 35.58%.

Fault Cause Identification Method for Aircraft Equipment Based on Maintenance Log

WANG Rui-Guang , WU Ji , LIU Chao , YANG Hai-Yan

2019, 30(5):1375-1385. DOI: 10.13328/j.cnki.jos.005730

Abstract (2831) HTML (4326) PDF 1.30 M (6920) Comment (0) Favorites

Abstract:In the process of aircraft maintenance, the aviation maintenance company has accumulated a large number of empirical maintenance log data. Machine learning methods can be used to help maintenance staff to make correct fault diagnosis decisions, using this type of maintenance log reasonably. Firstly, according to the particularity of the maintenance log, an iterative fault diagnosis process is proposed. Secondly, based on the traditional text feature extraction technology, the text feature extraction method based on convolution neural network (CNN) with the information in the domain is proposed, which is used in the case of small sample size. The method uses the target vector to train word vector to get more adequate text features. Finally, the random forest (RF) model is used in combination with other fault characteristics to determine the cause of aircraft equipment failure. The convolutional neural network aims at the cause of the failure, and pre-trains the word vector in the fault phenomenon to obtain a text feature that better reflects the field. Compared with other text feature extraction methods, the method obtains better results in the case of small sample size. At the same time, the convolutional neural network and random forest model are applied to the identification of aircraft equipment failure, and compared with other text feature extraction methods and machine learning prediction models, which illustrates the rationality and necessity of the method of text feature extraction and the method of fault cause identification.

Approach of Bug Reports Classification Based on Cost Extreme Learning Machine

ZHANG Tian-Lun , CHEN Rong , YANG Xi , ZHU Hong-Yu

2019, 30(5):1386-1406. DOI: 10.13328/j.cnki.jos.005725

Abstract (2686) HTML (3228) PDF 3.42 M (5824) Comment (0) Favorites

Abstract:Bug is an unavoidable problem in the development of all software systems. For developers of software system, bug report is a powerful tool for fixing bugs. However, manual recognition on bug reports tends to be time-consuming and not economical. It thus becomes significant to advance the automated classification approach to provide clear guidelines on how to assign a reasonable severity to a reported bug. In this study, several algrithoms are proposed based on extreme learning machine to automatically classify bug reports. Concretely, this study focuses on three problems in the field of bug report classification. The first one is the imbalanced class distribution in bug report dataset; the second is the insufficient labeled sample in bug report dataset; the last is the limited training data available. In order to solve these issues, three methods are proposed based on cost-sensitive supervised classification, semi-supervised learning, and sample transferring, respectively. Extensive experiments on real bug report datasets are conducted, and the results demonstrate the practicability and effectiveness of the proposed methods.

Hybrid Approach for Linking Related Issues Based on Embedding Models

ZHANG Yang , WANG Tao , WU Yi-Wen , YIN Gang , WANG Huai-Min

2019, 30(5):1407-1421. DOI: 10.13328/j.cnki.jos.005732

Abstract (2771) HTML (3275) PDF 1.97 M (5595) Comment (0) Favorites

Abstract:Social coding facilitates the sharing of knowledge in Open-source community. In particular, issue reports, as an important knowledge in the software development, usually contain relevant information, and can thus be linked to other related issues manually. In a project, identifying and linking issues to potentially related issues would provide developers more targeted resource and information when they resolve target issues, thus improving the issue resolution efficiency. However, the current manual linking approach is in general time-consuming and mainly depends on the experience and knowledge of the individual developers. Therefore, investigating how to link related issues timely is a meaningful task which can improve development efficiency of open-source projects. In this study, the problem of linking related issues is formulated as a recommendation problem and a hybrid approach based on embedding models is proposed, combining the traditional information retrieval technique, i.e., TF-IDF, and the embedding models in deep learning techniques, i.e., word embedding and document embedding. The evaluation results show that, the proposed approach can improve the performance of traditional approaches, with a very strong application scalability.

Empirical Study of Code Smell Impact on Software Evolution

ZHANG Xiao-Fang , ZHU Can

2019, 30(5):1422-1437. DOI: 10.13328/j.cnki.jos.005735

Abstract (2893) HTML (3057) PDF 1.95 M (5937) Comment (0) Favorites

Abstract:Code smells refer to poor design patterns or design defects that are considered to have negative impacts on software evolution and maintenance. Many researchers have been devoted into studies on these effects and correlations in recent years. Previous researches indicated that code smells might vary with the evolution of software. In normal cases, the software evolution involves addition, modification, and deletion of source files. Therefore, the understanding of the correlations between code smells and software evolution will be helpful for developers in scheduling the development process and in code refactoring. Thus, in this study, on 8 popular Java projects with 104 released versions, an extensive empirical study is conducted to investigate 13 kinds of code smells. It is found that, as the software evolves, the proportion of files that contain code smell in all files reflects different characteristics in different projects. Additionally, the files containing smells are prone to be modified while the smells are not strongly correlated with adding or deleting files. Furthermore, among all the smells studied, some certain ones have significant impact on the file changes and obvious overlap exists in these specific smelly files. These findings are beneficial for developers to achieve in-depth comprehension of code smells, which will lead to better software maintenance.

Reward of Reinforcement Learning of Test Optimization for Continuous Integration

HE Liu-Liu , YANG Yang , LI Zheng , ZHAO Rui-Lian

2019, 30(5):1438-1449. DOI: 10.13328/j.cnki.jos.005714

Abstract (3081) HTML (3892) PDF 1.37 M (6918) Comment (0) Favorites

Abstract:Testing in continuous integration environment is characterized by constantly changing test sets, limited test time, fast feedback, and so on. Traditional test optimization methods are not suitable for this. Reinforcement learning is an important branch of machine learning, and its essence is to solve sequential decision problems, thus it can be used in test optimization in continuous integration. However, in the existing reinforcement learning based methods, the reward function calculation only includes the execution information of the test case in the current integration cycle. The research is carried out from two aspects:reward function design and reward strategy. In the design of reward function, complete historical execution information of the test case is used to replace the current execution information and the total number of historical failures and historical failure distribution information of the test case is also considered. In terms of the reward strategy, two reward strategies are proposed, which are overall reward for test cases in current execution sequence and partial reward only for failed test cases. In this study, experimental research is conducted on three industrial-level programs. The results show that:(1) Compared with the existing methods, reinforcement learning methods based on reward function with complete historical information proposed in this study can greatly improve the error detection ability of test sequences in continuous integration; (2) Test case historical failure distribution can help to identify potential failure test cases, which is more important for the design of the reward function in reinforcement learning; (3) The two reward strategies, i.e. overall reward and partial reward, are influenced by various factors of the system under test, therefore the reward strategy need to be selected according to the actual situation; and (4) History-based reward functions have longer time consumption, though the test efficiency is not affected.

Generation Method for Test Oracle Based on Sensitive Variables and Linear Perceptron

MA Chun-Yan , LI Shang-Ru , WANG Hui-Chao , ZHANG Lei , ZHANG Tao

2019, 30(5):1450-1463. DOI: 10.13328/j.cnki.jos.005720

Abstract (2652) HTML (3553) PDF 1.47 M (5797) Comment (0) Favorites

Abstract:Test oracle generation technology is one of the hotspots in the testing field of software engineering. There are no historical test case sets available, which are common assumptions about existing test oracle generation techniques. However, this assumption may not always hold, and where it does not, there may be a missed opportunity; perhaps the pre-existing test cases could be used to assist the automated oracle generation of new test cases. In the case of the existing test case set, an automatic test oracle generation method for a new test case based on sensitive variables and linear perceptrons is proposed. Firstly, the statement coverage and memory value set executed by some known test cases are collected, and a set of test cases with high similarity to the execution coverage information of the new test case is computed. Secondly, the memory sensitive variable set solving algorithm of the program at a given breakpoint is given. Thirdly, the known test case set as the training set and the perceptron is used to solve the threshold value at each breakpoint. And on this base an automatic oracle generation method of the new test case is proposed. Finally, 129 fault versions of seven programs are used as experimental objects to generate test oracles of 14 300 new test cases. The empirical evaluation shows that the generated test oracle can achieve 96.2% accuracy on average. The results of the research can form the "snowball effect" of the test case set construction, and iteratively automatically generate test oracles for new test cases.

Second-order Mutant Reduction Based on SOM Neural Network

SONG Li , LIU Jing

2019, 30(5):1464-1480. DOI: 10.13328/j.cnki.jos.005723

Abstract (2430) HTML (3260) PDF 1.66 M (5771) Comment (0) Favorites

Abstract:Second-order mutation testing simulates the actual complex defects in the original program by manually injecting two defects into the original program, which is of great significance in the mutation testing. However, the number of second-order mutants formed by the combination of first-order mutants will greatly increase, which will bring large execution costs. In order to reduce the number of second-order mutants and reduce the time consumption in the running procedure, this study proposes a method of second-order mutant reduction based on SOM neural network. The proposed method firstly utilizes a morecomprehensive combination strategy to generate feasible second order mutants based on traditional first-order mutant generation, and then construct accurate SOM neural network according to the similarity of intermediate values in the execution of second-order mutants, and at last mutants are clustered based on such model to achieve second-order mutant reduction and subtle mutant detection. This study uses the benchmark and open source projects to verify the method. Experimental results show that on the one hand, although the number of mutants is very large, it has decreased significantly through the SOM neural network, while the second-order mutation score level is the same as the pre-unclustered mutation score. However, because the number of second-order mutants performed is significantly reduced, the time cost of mutation testing was greatly lower than the execution of all mutants. On the other hand, subtle second-order mutants that facilitate the addition of test components are found.

Approach to Searching Software Source Code with Graph Embedding

LING Chun-Yang , ZOU Yan-Zhen , LIN Ze-Qi , XIE Bing , ZHAO Jun-Feng

2019, 30(5):1481-1497. DOI: 10.13328/j.cnki.jos.005721

Abstract (2883) HTML (3905) PDF 2.15 M (6357) Comment (0) Favorites

Abstract:Searching software source code and locating software's API (application program interface) are important research issues in software engineering. As software projects are becoming more and more complex, existing search tools mainly face the following two challenges. First, more accurate search results are required in natural language question based search process. Second, the relationships between API are required to illustrate so that these API' underlying logic and usage scenarios are able to be understood more quickly. In this study, an ovel approach is proposed to searching a software project's API based on graph embedding. It aims to improve the accuracy of natural language based code graph search. A software project's code graph is built automatically from its source code and they are represented through graph embedding. For a natural language question, a code-connected subgraph, composed by relevant API and their associated relationships, are returned as the best answer. In experiments, Apache Lucene and POI projects are selected as examples to perform some API search tasks. Experimental results show that the proposed approach improves F1-score by 10% than existing shortest path based approach, while reduces average response time significantly.

Retrieval and Management Technology for Industrial-scale Massive Code

LIU Zhi-Wei , XING Yong-Xu , YU Hao , LI Tao , ZHANG Xiao-Dong

2019, 30(5):1498-1509. DOI: 10.13328/j.cnki.jos.005718

Abstract (2785) HTML (4096) PDF 1.32 M (6069) Comment (0) Favorites

Abstract:In large IT companies, especially like Google or Baidu, code search is an indispensable and frequent activity in the software development process, which speeds up the development process by learning or reusing existing code. Over the years, a large number of researchers have focused on code search and designed many excellent tools. However, the existing research and tools are mainly on a small-scale or single programming language code data set, not from the actual requirement of industries, and the user's query input is also limited; there is still a lack of a set of industrial-scale massive code retrieval and management technology solutions. This study proposes a code search engine solution and system implementation based on industrial-scale massive data, oriented to the most direct needs of users in the development process, through offline analysis and online analysis, complete the index construction and retrieval of massive code base. Among them, offline analysis is responsible for the acquisition and analysis of code-related data and building an index cluster. The online process is responsible for transforming the user's query, sorting the results of the search, and generating a summary. The system is deployed on the Baidu code base, and the index is built for dozens of TB-level Git code bases. The average retrieval time is within 1s. Since the launch of Baidu's application, the number of visits has gradually increased. There are thousands of users per week and tens of thousands of times searching. The system is widely praised by Baidu engineers.

Iterative-based Relational Model to Ontology Schema Matching Approach

WANG Feng , WANG Ya-Sha , ZHAO Jun-Feng , CUI Da

2019, 30(5):1510-1521. DOI: 10.13328/j.cnki.jos.005726

Abstract (2595) HTML (3218) PDF 1.43 M (5421) Comment (0) Favorites

Abstract:The rapid development of the semantic web makes the various fields in smart city have emerged in the form of ontology to express the knowledge model. However, in the practical semantic Web application, it is often faced with the problem of lack of ontology instance. It is an extremely effective solution to transform the data in the existing relational data source into ontology instance, which requires the use of the relational model to the ontology model matching technology to establish the mapping between the data source and the ontology. In addition, the schema matching to the ontology model is widely used in data integration, data semantic annotation, ontology-based data access, and other fields. The existing related work tends to use a variety of schema matching algorithms to calculate the similarity of element pairs in heterogeneous data patterns. However, when multiple matching algorithms fail at the same time, it is difficult to obtain a more accurate final matching result. In this study, the weekness of the matching of the single schema matching algorithm are analyzed deeply, the localization feature of the data source is an important factor leading to this phenomenon, and an iterative optimization schema matching scheme is proposed. The scheme uses the matched element pairs from matching process to optimize the single schema matching algorithm. The optimized algorithm can be better compatible with the localization features of the data source, with much higher accuracy, and more matching elements can be obtained. The process continues to iterate until the end of the match. In this study, experiments are carried out through a practical case in the fields of "food information management" which have shown that the proposed approach significantly outperforms state-of-the-art method by increasing up to 50.1% of F-measure.

Data Quality Problems in Software Development Activity Data

TU Fei-Fei , ZHOU Ming-Hui

2019, 30(5):1522-1531. DOI: 10.13328/j.cnki.jos.005727

Abstract (3013) HTML (3992) PDF 1.19 M (6092) Comment (0) Favorites

Abstract:Software development tools, such as issue tracking system (ITS) and version control system (VCS), are widely used in the intelligent development of open source software and commercial software. When using these tools to assist software development, they produce substantial amount of data, which is called software development activity data. Data quality has attracted more and more attention with increasingly rich software activity data sources and their wide uses. Faithfully, data is the basis of intelligent development. Data quality has influence on research and practice. To remind data users of latent data quality problem of software developement activity data, three aspects are indicated that may have data quality problems through literature review and interview with data users. The data quality problems arose from three phases, i.e., data production, data collection, and data use. Next, to improve the data quality of software development activity data, several recommendations are proposed that could be taken into consideration, including finding data quality problems and solving data quality problems. First of all, researchers should have a clear understanding of the context of data. Next, they may use statistical analysis and data visualization to find latent data quality problems. Finally, they can try to correct the particular problems by redundant data or to improve data quality by user behavior analysis.

End User Data Query Construction Approach Based on Ontology Reasoning

TANG Shuang , WANG Ya-Sha , ZHAO Jun-Feng , WANG Jiang-Tao , XIA Ding

2019, 30(5):1532-1546. DOI: 10.13328/j.cnki.jos.005728

Abstract (2789) HTML (3617) PDF 1.73 M (5365) Comment (0) Favorites

Abstract:Intelligent decision-making based on data analysis is of great significance to enhance the competitiveness of enterprises. Querying and obtaining the information and complete data closely related to the problem from the internal information system database, is a key point in enterprise data analysis. The ontology-based visual query system (VQS) provides end-users with an effective way to access data. In recent years, by using the simple mapping rules, the database table, field, the foreign key relations, and other elements is directly mapped to the concept, attributes, and relationships in the ontology. It exposed too much database design technical details to the end user, thus increasing the end users' burden when using the VQS system. Masking database details by manually writing mapping rules is both inefficient and not universal. To this end, this study proposes a reasoning-based end user ontology query construction method. This method uses the semantic expression ability and reasoning ability of the ontology model to inject the domain knowledge into the original ontology model that is directly derived from the database. It optimizes the query construction process and enables the end user to query and manipulate the data from the domain experts' perspective, instead of a database design perspective, which improves system usability. It adds support of the group statistics and extends the application scope of the method. Finally, this method is evaluated by analyzing real cases in the field of "Restaurant Information Management" and the experimental results demonstrate that the proposed approach outperforms other baseline methods. the proposed approach has improved the usability by 53.44% and the expression ability by 20.43%.

Review Analysis Method Based on Support Vector Machine and Latent Dirichlet Allocation

CHEN Qi , ZHANG Li , JIANG Jing , HUANG Xin-Yue

2019, 30(5):1547-1560. DOI: 10.13328/j.cnki.jos.005731

Abstract (2983) HTML (3690) PDF 1.42 M (6157) Comment (0) Favorites

Abstract:In mobile apps (applications), the app reviews by users have become an important feedback resource. Users may raise some issues when they use apps, such as system compatibility issues, application crashes, and so on. With the development of mobile apps, users provide a large number of unstructured feedback comments. In order to extract effective information from user complaint comments, a review analysis method is proposed based on support vector machine (SVM) and latent dirichlet allocation (LDA) (RASL) which can help developers to understand user feedback better and faster. Firstly, features are extracted from the user neutral reviews and negative reviews, and then the support vector machine (SVM) is used to label comments on multiple tags. Next, the LDA topic model is used to get topic extraction and representative sentence extraction which are performed on the comments under each question type. 5141 original reviews are crawled from two mobile apps. Then the proposed method (RASL) and ASUM are used to process these comments to get new texts. In comparison with the classical approach ASUM, RASL has less perplexity, better understandability, more complete original review information, and less redundant information.

User Recommendation Algorithm Based on Multi-developer Community

SHI Yu-Cen , YIN Ying , ZHAO Yu-Hai , ZHANG Bin , WANG Guo-Ren

2019, 30(5):1561-1574. DOI: 10.13328/j.cnki.jos.005733

Abstract (2759) HTML (3125) PDF 1.91 M (5912) Comment (0) Favorites

Abstract:Internet technology is developing rapidly. The developer community's question-answering based experience communication method has become one of the important means for many developers to solve problems encountered in software development and maintenance. How to promptly and accurately recommend a question responder to a questioner in the developer community is an important issue with practical needs. Through the collection and analysis of the data of two representative mainstream developers in Stack Overflow and Github, three phenomena are observed that affect the timeliness and accuracy of the above recommended questions:(1) User label customization phenomenon. In the developer community, the user's tag information is subjectively defined by the user, rather than the system is objectively calibrated according to the user's historical behavior; (2) Asymmetric activity. The user may be active in one or some developer communities, however, it is not equally active or even inactive in other communities; (3) Keyword set closure phenomenon. That is the question answerer in the developer community recommends only based on the keywords in the question text, but does not consider other semantic related key words. In view of the above problems, the user information of the developer community is integrated, the interaction between users and users is analyzed, a cross-community developer network is established, and an algorithm based on restart random walk is proposed to update user tags. Further, by using Taxonomy to expand the query keyword range of the problem, on the basis of this, the user matrix is more accurately recommended, and the range of effective users at the time of recommendation is increased. Finally, the experimental results of F-measure and NDCG are good, which can effectively improve the efficiency and accuracy of problem recommendation.

微信服务号

微信订阅号

>Review Articles

>Special Issue's Articles

>Review Articles

>Special Issue's Articles

>Review Articles

>Special Issue's Articles

Current Issue

Volume

Issue