WANG Xin , XU Qiang , CHAI Le-Le , YANG Ya-Jun , CHAI Yun-Peng
2019, 30(3):498-514. DOI: 10.13328/j.cnki.jos.005696 CSTR:
Abstract:Knowledge graphs are the main representation form of intelligent data. With the development of knowledge graphs, more and more intelligent data has been released in the form of the resource description framework (RDF). It is known that the semantics of SPARQL correspond to graph homomorphism which is an NP-complete problem. Therefore, how to efficiently answer SPARQL queries in parallel over big RDF graphs has been widely recognized as a challenging problem. There are some research works using the MapReduce computational model to process big RDF graph. However, SPARQL queries in these works are decomposed into the set of query clauses without considering any semantics and graph structure embedded in RDF graph, which leads to overmuch MapReduce iterations. This study first decomposes the SPARQL query graph into a set of stars by utilizing the semantic and structural information embedded RDF graphs as heuristics, which can be matched in fewer MapReduce iterations. Meanwhile, a matching order of these stars is given to reduce intermediate results in MapReduce iterations. During the matching phase, each round of MapReduce adds one star with the join operation. The extensive experiments on both synthetic dataset WatDiv, and real-world dataset DBpedia are carried out. The experiments results demonstrate that the proposed star decomposition-based method can answer SPARQL BGP queries efficiently, which outperforms SHARD and S2X by one order of magnitude. Finally, extensive experiments show that the performance of the optimization algorithms is improved by 49.63% and 78.71% than the basic algorithm over both synthetic and real datasets.
LI Zhong-Fei , YANG Ya-Jun , WANG Xin
2019, 30(3):515-536. DOI: 10.13328/j.cnki.jos.005692 CSTR:
Abstract:The shortest path query is very important in the related applications of graph data. The problem studied in this work is the rule-based shortest path query, which is a special kind of shortest path problem. Given the starting vertex and the ending vertex, the rule-based shortest path query is that finding a shortest path from the starting vertex to the ending vertex, which passes through all vertices in the user-specified set, and the access order of some vertices meets certain partial order rules. It has proved that this problem is NP-hard problem. The existing work focuses on the rule-based shortest path problem on spatial datasets (the shortest distance between two vertices is expressed by Euclidean distance), which exhaustively lists all paths those satisfy the rule, and selects the path with the smallest length as the solution to the problem. However, in an actual road network, the distance between two vertices is equal to the length of the shortest path between two vertices, which is often greater than the Euclidean distance between two vertices. In addition, using an exhaustive approach always results in a lot of repetitive calculations. Based on this, this study designs a forward search algorithm and some optimization techniques to solve such problems. Finally, this study designs a large number of experiments on different real datasets to verify the effectiveness of the algorithms. The experimental results show that the algorithms described in this paper can quickly give the solution to the problem, and the efficiency of the algorithms greatly exceeds the existing algorithms.
FENG Bing-Qing , HU Shao-Lin , GUO Dong , ZHONG Xiao-Ge , LI Pei-Yu
2019, 30(3):537-551. DOI: 10.13328/j.cnki.jos.005684 CSTR:
Abstract:Dynamic information network is a new challenging problem in the field of current complex networks. Research on network evolution contributes to analyzing the network structure, understanding the characteristics of the network, and finding hidden network evolution rules, which has important theoretical significance and application value. The study of the network structure evolution is of great importance in getting a comprehensive understanding of the behavior trend of complex systems. However, the network structure is difficult to represent and quantify. And the evolution of dynamic networks is temporal, complex, and changeable, which increases the difficulty in analysis. This study introduces "role" to quantify the structure of dynamic networks and proposes a role-based model, which provides a new idea for the evolution analysis and prediction of network structure. As for the model, two methods to explain the role are given. To predict the role distributions of dynamic network nodes in future time, this study transforms the problem of dynamic network structure prediction into role prediction, which can represent the structural feature. The model extracts properties from historical snapshots of sub-network as the training data and predicts the future role's distributions of dynamic network by using the vector autoregressive method. This study also proposes the method of dynamic network structure prediction based on latent roles (LR-DNSP). This method not only overcomes the drawback of existing methods based on transfer matrix while ignoring the time factor, but also takes into account of possible dependencies between multiple forecast targets. Experimental results show that the LR-DNSP outperforms existing methods in prediction accuracy.
2019, 30(3):552-572. DOI: 10.13328/j.cnki.jos.005699 CSTR:
Abstract:Community search aims to find out communities containing a given set of nodes and get personalized community information quickly. Since traditional community search algorithms can hardly meet the needs under complex conditions, a new problem called conditional community search is proposed. Solving the problem helps to analyze social networks intelligently and provides users with better community results under complex search conditions. First, based on Boolean expressions, the formal definition of conditional community search problem is given, which can effectively express the requirement that a given node cannot exist in the community and at least one of the given nodes occurs in the community. Then, a general framework is proposed to solve the problem of conditional community search, including simplifying search conditions, conducting multiple singleconditional community searches according to simplified search conditions, and combining the results of singleconditional community searches. At the same time, a community search plus filtering method and a node weighting based method are proposed to carry out the singleconditional community search. Finally, extensive experimental results conducted on real-world datasets show the correctness andeffectiveness of the proposed methods.
CHENG Yu-Rong , WANG Guo-Ren , LI Bo-Yang , YUAN Ye
2019, 30(3):573-588. DOI: 10.13328/j.cnki.jos.005682 CSTR:
Abstract:In event based social networks (EBSNs), a typical problem is to plan interested events to users. Existing work only considers the users' preference to events, and plans the events that they are most possibly interested in. However, from the view of event holders, they also hope the users that are assigned to their events are with high influence and reliability. Consequentially, their events can be held successfully and achieve expected effects. Essentially, the planning problem over EBSNs is a bilateral selection problem. However, existing studies never consider the bilateral preference between events and users. Thus, this study proposes a bilateral preference stable planning problem to solve this bilateral selection problem. Since this study is the first to propose the bilateral preference planning problem, no existing algorithms can solve it. Compared with the existing planning problem which only considers the preference of users, the bilateral preference stable planning problem is more complex and contains more constraints. Thus, two baseline algorithms and two improved algorithms are proposed to efficiently and effectively solve this problem. Finally, extensive experiments are conducted to verify the efficiency and effectiveness of the proposed algorithms.
DUAN Xu-Liang , GUO Bing , SHEN Yan , SHEN Yun-Cheng , DONG Xiang-Qian , ZHANG Hong
2019, 30(3):589-603. DOI: 10.13328/j.cnki.jos.005688 CSTR:
Abstract:Data currency is an important factor influencing the data quality. The reliability of data currency plays a critical role in data retrieval accuracy and data analysis credibility. Inaccurate data currency and outdated data bring many problems to the application of big data, which greatly affects the exertion of data value. For data that with imprecise time attribute or missing timestamp, exact repair of timestamp is often difficult, but it is possible to restore the currency orders according to specific currency based rules to meet various requirements in data cleaning and applications. Based on the analysis of data currency application requirements, this study first introduces the related concepts of data currency, defines attributes currency-based rules in formal method, and then proposes the currency rules discovery algorithm and the currency repair method. The algorithms efficiency and recovery effect are tested on real dataset, the factors that affect accuracy of the algorithms are analyzed. Experimental results show that the proposed methods are efficient and effective.
QI Zhi-Xin , WANG Hong-Zhi , ZHOU Xiong , LI Jian-Zhong , GAO Hong
2019, 30(3):604-619. DOI: 10.13328/j.cnki.jos.005691 CSTR:
Abstract:Cost-sensitive decision tree is a kind of decision tree which maximizes the sum of misclassification costs and test costs. Recently, with the explosive growth of data size, dirty data appears more frequently. In the process of cost-sensitive decision tree induction, dirty data in training datasets have negative impacts on selection of splitting attributes and division of decision tree nodes. Therefore, dirty data cleaning is necessary before classification tasks. Nevertheless, in practice, many users provide an acceptable threshold of data cleaning costs since time costs and expenses of data cleaning are expensive. Therefore, in addition to misclassification cost and test cost, data-cleaning cost is also an essential factor in cost-sensitive decision tree induction. However, existing researches have not considered data quality in the problem. To fill this gap, this study aims to focus on cost-sensitive decision tree induction on dirty data. Three decision tree induction methods integrated with data cleaning algorithms are presented. Experimental results demonstrate the effective of the proposed approaches.
QI Dan-Rui , SONG Shao-Xu , WANG Jian-Min
2019, 30(3):620-647. DOI: 10.13328/j.cnki.jos.005700 CSTR:
Abstract:Database users often fails to obtain the expected answer in the query results, since databases are often incomplete with missing data. It is known as the Why-not problem, that is, "why the expected tuples do not appear in the results". Existing methods present the explanations of the Why-not problem by enumerating possible values. The number of explanations presented by enumeration is often too large to explore by users. Integrity constraints, such as function dependencies, are employed to rule out irrational explanations. Unfortunately, many attributes are simply represented as variables in the reduced explanations, which the users may still not understand. There are also many unreasonable explanations, owing to data sparsity. This work proposes to study the pair-wise relationships of tuples as the features for ranking Why-not explanations. First, the format of Why-not problem explanations is re-defined, without variables, for easy understanding by users. Secondly, the equality/inequality relationships in tuple pairs are represented. Instead of learning over the original data with sparsity issue, to learn statistical models over the {0,1} representation of tuple pairs is proposed. A number of models are employed to infer the probability, including statistical distribution, classification, and regression. Finally, the explanations are evaluated and ranked according to the inferred probability. Experiments shows that high-quality explanations for Why-not question can be returned using pair-wise method.
WANG Jin-Yan , LIU Chen , FU Xing-Cheng , LUO Xu-Dong , LI Xian-Xian
2019, 30(3):648-666. DOI: 10.13328/j.cnki.jos.005686 CSTR:
Abstract:Frequent patterns mining is an important task for data mining. Nevertheless, mining concise crucial patterns is more promising than frequent patterns over data streams, since crucial patterns can avoid redundancy to reduce storage space and extract lossless information from frequent patterns. Nevertheless, mining crucial patterns from data streams which aggregate information from individuals is more likely to reveal privacy than static scenarios, because the background knowledge of the release at adjacent time instances can enhance the adversary's inferential ability. This study points out the problems and principles of privacy leakage over mining crucial patterns in data streams, and proposes a differentially private crucial patterns mining algorithm which designs a two-phase mechanism at every timestamp. Specifically, the two-phase mechanism includes the dissimilarity calculation phase and the noise-mining phase, which considers not only the tradeoff between privacy and utility but also the tradeoff between mining time and maintenance cost. To improve data utility over successive releases in streams, the dissimilarity is computed to decide to return either low noisy statistic or accurately approximated statistic in the first phase. When the low noisy statistic needs to be turned, the algorithm goes into the noise-mining phase. In the noise-mining phase, crucial pattern candidate set with a judgment query set is firstly identified, and then random noise drawn from the Laplace distribution to their supports are added to obtain the noisy supports. Finally, strict theoretical analysis and extensive experiments are provided to confirm the effectiveness and efficiency of our algorithm.
ZHANG Dong-Yue , ZHOU Li-Hua , WU Xiang-Yun , ZHAO Li-Hong
2019, 30(3):667-683. DOI: 10.13328/j.cnki.jos.005693 CSTR:
Abstract:As more and more applications generate data streams, the research on data stream clustering analysis has received extensive attention. Grid-based clustering maps data streams into grid structures to form data summaries, and then clusters data summaries. This method usually has high efficiency, but each grid is processed independently, and the interaction between the grids is not considered, so the clustering quality needs to be improved. In this study, the coupling relationship between grids is considered rather than processed independently in the clustering process, and an algorithm for clustering data stream based on grid coupling is proposed. The proposed approach improves the quality of clusters as the coupling of the grid more accurately captures the correlation amongst the data. Experimental evaluations on synthetic and real data streams illustrate the superiority of the proposed approach compared with the state-of-the-arts approaches.
XU Zi-Jian , YE Sheng , ZHANG Xiao
2019, 30(3):684-699. DOI: 10.13328/j.cnki.jos.005694 CSTR:
Abstract:In general, the read-write separation technology can solve some of the problems on mismatch between read and write in the current big data environment, but most of the current read-write separation technology are prepared for homogeneous database. Due to the inconsistent storage structure, heterogeneous distributed database systems composed of a row storage database and a columnar storage database will face many difficulties like format conversion and mismatch of synchronization speed in data synchronization compared to a homogeneous distributed database system. This study proposes the use of MySQL binary log to perform the TD-Reduction of SQL. It designs and implements Binlog parser BinParser and Binlog restorer BinReducer, which based on the mixed format. Different events perform log parsing and restoring according to the corresponding rules to generate executable SQL statements. Based on the above techniques, this study has implemented Cynomys, a distributed database data synchronization tool. In the experimental environment, Cynomys has shown sound performance. The method is suitable for data synchronization between all other heterogeneous databases with a similar mechanism like Binlog.
2019, 30(3):700-717. DOI: 10.13328/j.cnki.jos.005687 CSTR:
Abstract:Gait analysis is an important research in the fields of pattern recognition, data mining, and intelligent data analysis. Many different methods, such as recognition of crests and troughs, matching of gait templates, signal-processing based approaches, etc., have been proposed and applied, but these methods require presetting parameters such as gait numbers, gait templates, etc. Their feasibilities are limited. In this study, a new high feasible method, which can automatically calculate a presetting cycle value and a self-adaption range, is proposed, based on peak detection and threshold, to analyze gait cycle segment without unnecessary information needed by other methods. Moreover, a method to filter nonrelated data is proposed. Comparing with simple FFT and other three kinds of representative methods, the proposed approach has the best analysis results in evaluations.
DENG Shi-Zhuo , WANG Bo-Tao , YANG Chuan-Gui , WANG Guo-Ren
2019, 30(3):718-737. DOI: 10.13328/j.cnki.jos.005685 CSTR:
Abstract:Wearable sensor-based human activity recognition (HAR) plays a significant role in the current smart applications with the development of the theory of artificial intelligence and popularity of the wearable sensors. Salient and discriminative features improve the performance of HAR. To capture the local dependence over time and space on the same axis from multi-location sensor data on convolutional neural networks (CNN), which is ignored by existing methods with 1D kernel and 2D kernel, this study proposes two methods, T-2D and M-2D. They construct three activity images from each axis of multi-location 3D accelerometers and one activity image from the other sensors. Then it implements the CNN networks named T-2DCNN and M-2DCNN based on T-2D and M-2D respectively, which fuse the four activity image features for the classifier. To reduce the number of the CNN weight, the weight-shared CNN, TS-2DCNN and MS-2DCNN, are proposed. In the default experiment settings on public datasets, the proposed methods outperform the existing methods with the F1-value increased by 6.68% and 1.09% at most in OPPORTUNITY and SKODA respectively. It concludes that both naïve and weight-shared model have better performance in most F1-values with different number of sensors and F1-value difference of each class.
PEI Wei , XU Yan-Ming , ZHU Yong-Ying , WANG Peng-Qian , LU Ming-Yu , LI Fei
2019, 30(3):738-758. DOI: 10.13328/j.cnki.jos.005695 CSTR:
Abstract:In recent years, the rapid development of UAV (Unmanned Aerial Vehicle) technology makes UAV ground target detection technology become an important research direction in the field of computer vision. UAV has a wide range of applications in military investigation, traffic control, and other scenarios. Nevertheless, the UAV images have many problems such as low target resolution, scale changes, environmental changes, multi-target interference, and complex background environment. Aiming at the above difficulties, derived from the original SSD target detection algorithm, this study uses a residual network with better characterization ability to replace the basic network and a residual learning to reduce the network training difficulty and improve the target detection accuracy. By introducing a hopping connection mechanism, the redundancy of the extracted features is reduced, and the problem of performance degradation after the increase of the number of layers is solved. The effectiveness of the algorithm is verified through experimental comparison. Aiming at the problem of target repeated detection and small sample missing detection of the original SSD target detection algorithm, this study proposes an aerial target detection algorithm based on feature information fusion. By integrating information with different feature layers, this algorithm effectively makes up for the difference between low-level visual features and high-level semantic features in neural networks. Results show that the algorithm has sound performance in both detection accuracy and real-time performance.
FENG Ning , GUO Sheng-Nan , SONG Chao , ZHU Qi-Chao , WAN Huai-Yu
2019, 30(3):759-769. DOI: 10.13328/j.cnki.jos.005697 CSTR:
Abstract:Forecasting the traffic flows is a hot issue for researchers and practitioners in the transportation field. It is very challenging to forecast the traffic flows due to the high nonlinearity and complexity of the data, and most of the existing methods cannot effectively capture the spatial-temporal correlations of traffic flow data. In this paper, we propose a novel deep learning based model, multi-component spatial-temporal graph convolution networks (MCSTGCN), to solve the problem of traffic flow forecasting. MCSTGCN employs three components to respectively model the recent, daily and weekly characteristics of traffic flow data. Each component uses graph convolutions in the spatial dimension and convolutions in the temporal dimension to effectively capture the spatial-temporal correlations of traffic data. Experiments on a public California freeway dataset show that the prediction performance of the MCSTGCN model is better than other existing prevalent methods.
SHI Jin , MAO Jia-Li , JIN Che-Qing
2019, 30(3):770-783. DOI: 10.13328/j.cnki.jos.005683 CSTR:
Abstract:Travel time prediction is critical for route planning and traffic monitoring. Due to complex relationships among road segments, spatial-temporal dependency, and other factors, it is challenging to perform modeling upon trajectory dataset. Without incorporating external factors into modeling, existing methods may import incorrect information and ignore road segment dependence, which results in poor prediction accuracy. A two-phase travel time prediction framework is proposed to solve the mentioned issues. During the first stage, trajectory data are mapped to a sequence of segments to generate a low-dimensional vector, which avoids introducing incorrect information while preserving the road segment dependence. During the second phase, after integrating road segment encoding and external factors such as weather and date, a travel time prediction model based on deep neural network is designed. The detailed experimental results on a real-world taxi trajectory dataset show that the proposed method is more accurate than existing methods.
JIANG Jia-Wei , FU Fang-Cheng , SHAO Ying-Xia , CUI Bin
2019, 30(3):784-798. DOI: 10.13328/j.cnki.jos.005690 CSTR:
Abstract:Gradient boosting decision tree algorithm is widely used in various tasks, such as classification, regression, and ranking, owing to its high accuracy and strong interpretability. With the explosive growth of data volume, distributed gradient boosting decision tree algorithms have become an important research issue. Although there exists a series of implementations of distributed gradient boosting decision tree, they perform poorly on high-dimensional and multi-classification tasks. The data parallel strategy they adopt requires the transmission of gradient histograms, and this communication overhead becomes the bottleneck in many high-dimensional and multi-classification task. This study aims at this problem and tries to find an efficient parallel strategy that is more suitable for the target. Data-parallel and feature-parallel strategies are first compared based on a cost model, and it is theoretically proved that feature-parallel is more suitable for high-dimensional and multi-classification tasks. Based on the analysis, this paper proposes a feature-parallel distributed gradient boosting decision tree algorithm, named FP-GBDT. FP-GBDT designs an efficient distributed dataset transposition method to partition the training dataset by column. During the construction of gradient histogram, FP-GBDT uses a sparsity-aware method to accelerate the histogram construction. When splitting tree nodes, FP-GBDT develops a bitmap compression method to transmit the placement of instances, thereby reduces the communication overhead. This study compares the performance of distributed gradient boosting decision tree algorithm under different parallel strategies through extensive experiments. First, the effectiveness of proposed optimization methods in FP-GBDT is verified. Then, the representative of data-parallel strategy of FP-GBDT and XGBoost are compared. On various datasets, it is proved that FP-GBDT is more efficient in high-dimensional and multi-classification tasks. FP-GBDT achieves up to 6 times performance improvement than data-parallel implementations.
ZHAO Kan-Kan , ZHANG Liang-Fu , ZHANG Jing , LI Cui-Ping , CHEN Hong
2019, 30(3):799-821. DOI: 10.13328/j.cnki.jos.005698 CSTR:
Abstract:The traditional matrix factorization method has a wide range of applications in prediction and recommendation tasks because of its high scalability and good performance. In the big data era, more and more contextual features can be obtained easily, while the traditional matrix factorization approach lacks effective use of context information. In this context, Factorization Machines (FM) is proposed and popular. To better grasp the development process of FM model and adapt FM approach to the real application, this paper reviews existing FM models and their optimization algorithms. First, it introduces the evolution process from traditional Matrix Factorization (MF) to FM model. Second, the paper summarizes the existing researches on FM method from the perspective of model accuracy and efficiency; Third, the paper presents the studies of four representative optimization algorithms, which are suitable for various FM models. Finally, the paper analyzes the challenges in the current FM model, proposes possible solutions for these problems, and discusses the future work.
YAN Cai-Rong , ZHOU Ling-Jie , ZHANG Qing-Long , LI Xiao-Lin
2019, 30(3):822-844. DOI: 10.13328/j.cnki.jos.005681 CSTR:
Abstract:Since the factorization machine (FM) model can effectively solve the sparsity problem of high-dimensional data feature combination with high prediction accuracy and computational efficiency, it has been widely studied and applied in the field of click-through-rate (CTR) prediction and recommender systems. The review of the progress on the subsequent research on FM and its related models will help to promote the further improvement and application of the model. By comparing the relationship between the FM model and the polynomial regression model and the factorization model, the flexibility and generality of the FM model are described. Considering width extension, the strategies, methods, and key technologies are summarized from the dimensions of high-order feature interaction, field-aware feature interaction and hierarchical feature interaction, as well as feature extraction, combining, intelligent selection and promotion based on feature engineering. The integration approaches and benefits of FM model with other models, especially the combination with deep learning models are compared and analyzed, which provides insights into the in-depth expansion of traditional models. The learning and optimization methods of FM models and the implementation based on different parallel and distributed computing frameworks are summarized, compared, and analyzed. Finally, the authors forecast the difficult points, hot spots and development trends in the FM model that need to be further studied.
LIANG Tian-Xin , YANG Xiao-Ping , WANG Liang , HAN Zhen-Yuan
2019, 30(3):845-864. DOI: 10.13328/j.cnki.jos.005689 CSTR:
Abstract:In recent years, reinforcement learning has made great progress in the fields of electronic games, chess, and decision-making control. It has also driven the rapid development of financial transaction systems. The issue of financial transactions has become a hot topic in the field of reinforcement learning. Especially, it has wide application demand and academic research significance in the fields of stock, foreign exchange, and futures. This paper summarizes the research achievements of transaction systems, adaptive algorithms, and transaction strategies based on the progress of reinforcement learning models, which are commonly used in the financial field. Finally, the difficulties and challenges of reinforcement learning in financial trading system are discussed, and the future development trend is prospected.