CUI Jian-Wei , ZHAO Zhe , DU Xiao-Yong
2021, 32(3):604-621. DOI: 10.13328/j.cnki.jos.006182 CSTR:
Abstract:Applications drive innovation. The advance of database technology is achieved in support of development of mainstream applications effectively and efficiently. OLTP, OLAP, and online machine learning modeling today all follow this trend. Machine learning extracts knowledge and realizes predictive analysis by modeling data, is the main approach of artificial intelligence technology. This work studies the training process of machine learning from the perspective of data management, summarizes data management technology through data selection, data storage, data access, automatic optimization, and system implementation, and analyzes the advantages and disadvantages of these techniques. Based on the analysis, this study proposes key challenges of data management technology for online machine learning.
2021, 32(3):622-635. DOI: 10.13328/j.cnki.jos.006179 CSTR:
Abstract:In a large number of changing data, data analysts often only care about a small amount of data with specific prediction results. However, users must query all the data by SQL before inference step, even if a large amount of data will be dropped, because the machine learning algorithm libraries always assume that the data is organized in a single table. This study points out that in this process, if some hints can be gotten from model in advance, it is expected that unnecessary data can be quickly eliminated in the data acquisition phase, thus reducing the cost of multi-table join, inter-process communication, and model prediction. This work takes a specific kind of machine learning model, i.e., decision tree, as an example. Firstly, a pre-filtering and validation execution workflow is proposed. Then, an offline algorithm is used to extract pre-filtering predicates from the decision tree. Finally, the algorithm is tested on real world dataset. Experiments show that the method proposed in this study can accelerate the execution of SQL queries containing predicates on decision tree prediction result.
ZHANG Wen-Tao , YUAN Bin , ZHANG Zhi-Peng , CUI Bin
2021, 32(3):636-649. DOI: 10.13328/j.cnki.jos.006186 CSTR:
Abstract:With the advent of artificial intelligence, graph embedding techniques are more and more used to mine the information from graphs. However, graphs in real world are usually large and distributed graph embedding is needed. There are two main challenges in distributed graph embedding. (1) There exist many graph embedding methods and there is not a general framework for most of the embedding algorithms. (2) Existing distributed implementations of graph embedding suffer from poor scalability and perform bad on large graphs. To tackle the above two challenges, a general framework is firstly presented for distributed graph embedding. In detail, the process of sampling and training is separated in graph embedding such that the framework can describe different graph embedding methods. Second, a parameter server-based model partitioning strategy is proposed—the model is partitioned to both workers and servers and shuffling is used to ensure that there is no model exchange among workers. A prototype system is implemented on parameter server and solid experiments are conducted to show that partitioning-based strategy can get better performance than all baseline systems without loss of accuracy.
WU An-Biao , YUAN Ye , MA Yu-Liang , WANG Guo-Ren
2021, 32(3):650-668. DOI: 10.13328/j.cnki.jos.006173 CSTR:
Abstract:Compared with the traditional graph data analysis method, graph embedding algorithm provides a new graph data analysis strategy. It aims to encoder graph nodes into vectors to perform graph data analysis or mining tasks more effectively by using neural network related technologies. And some classic tasks have been improved significantly by graph embedding methods, such as node classification, link prediction, and traffic flow prediction. Although plenty of works have been proposed by former researchers in graph embedding, the nodes embedding problem over temporal graph has been seldom studied. This study proposed an adaptive temporal graph embedding, ATGED, attempting to encoder temporal graph nodes into vectors by combining previous research works and the information propagation characteristics together. First, an adaptive cluster method is proposed by solving the situation that nodes active frequency is different in different types of graph. Then, a new node walk strategy is designed in order to store the time sequence between nodes, and also the walking list will be stored in bidirectional multi-tree in walking process to get complete walking lists fast. Last, based on the basic walking characteristics and graph topology, an important node sampling strategy is proposed to train the satisfied neural network as soon as possible. Sufficient experiments demonstrate that the proposed method surpasses existing embedding methods in terms of node clustering, reachability prediction, and node classification in temporal graphs.
SHI Ding-Yuan , WANG Yan-Sheng , ZHENG Peng-Fei , TONG Yong-Xin
2021, 32(3):669-688. DOI: 10.13328/j.cnki.jos.006174 CSTR:
Abstract:Learning-to-rank (LTR) model has made a remarkable achievement. However, traditional training scheme for LTR model requires large amount of text data. Considering the increasing concerns about privacy protection, it is becoming infeasible to collect text data from multiple data owners as before, and thus data is forced to save separately. The separation turns data owners into data silos, among which the data can hardly exchange, causing LTR training severely compromised. Inspired by the recent progress in federated learning, a novel framework is proposed named cross-silo federated learning-to-rank (CS-F-LTR), which addresses two unique challenges faced by LTR when applied it to federated scenario. In order to deal with the cross-party feature generation problem, CS-F-LTR utilizes a sketch and differential privacy based method, which is much more efficient than encryption-based protocols meanwhile the accuracy loss is still guaranteed. To tackle with the missing label problem, CS-F-LTR relies on a semi-supervised learning mechanism that facilitates fast labeling with mutual labelers. Extensive experiments conducted on public datasets verify the effectiveness of the proposed framework.
GAO Fei , SONG Shao-Xu , WANG Jian-Min
2021, 32(3):689-711. DOI: 10.13328/j.cnki.jos.006176 CSTR:
Abstract:As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields. Furthermore, data quality can optimize and promote big data and artificial intelligence technology. Generally, physical failures or technical defects in data collection and recorder will cause certain anomalies in collected data. These anomalies will have a significant impact on subsequent data analysis and artificial intelligence processes, thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. And the constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are proposed to constrain time series data, and a series of repairing candidate points is formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. In order to verify the feasibility and effectiveness of this method, an artificial data set, two real data sets, and another real data set with real anomalies are used for experiments under different rates of anomalies and data sizes. It can be seen from the experimental results that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS error and time cost. In addition, the verification of clustering and classification accuracy with several data sets shows the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.
KANG Zhu-Guan , JIN Fu-Sheng , WANG Guo-Ren
2021, 32(3):712-725. DOI: 10.13328/j.cnki.jos.006172 CSTR:
Abstract:High-level link prediction is a hot and difficult problem in network analysis research. An excellent high-level link prediction algorithm can not only mine the potential relationship between nodes in a complex network but also help to understand the law of network structure evolves over time. Exploring unknown network relationships has important applications. Most traditional link prediction algorithms only consider the structural similarity between nodes, while ignoring the characteristics of higher-order structures and information about network changes. This study proposes a high-order link prediction model based on Motif clustering coefficients and time series partitioning (MTLP). This model constructs a representational feature vector by extracting the features of Motif clustering coefficients and network structure evolution of high-order structures in the network, and uses multilayer perceptron (MLP) network model to complete the link prediction task. By conducting experiments on different real-life data sets, the results show that the proposed MTLP model has better high-order link prediction performance than the state-of-the-art methods.
JIANG Shan , DING Zhi-Ming , ZHU Mei-Ling , YAN Jin , XU Xin-Run
2021, 32(3):726-741. DOI: 10.13328/j.cnki.jos.006170 CSTR:
Abstract:The spatiotemporal graph modeling is a basic work to analyze the spatial relationship and time trend of each element in the graph structure system. The traditional spatiotemporal graph modeling method is mainly based on the explicit structure of nodes and the fixed relationship between nodes in the graph for spatial relationship mining, which severely limits the flexibility of the model. Besides, traditional methods cannot capture long-term trends. To overcome these shortcomings, a novel end-to-end neural network model for spatiotemporal graph modeling is proposed, i.e., a graph wavelet convolutional neural network for spatiotemporal graph modeling called GWNN-STGM. A graph wavelet convolutional neural network layer is designed in GWNN-STGM. A self-adaption adjacency matrix is introduced in this network layer for node embedding learning so that the model can be used without prior knowledge of the structure. The hidden structural information is automatically found in the training dataset. In addition, GWNN-STGM includes a stacked dilated causal convolutional network layer so that the receptive field of the model can grow exponentially with the increase in the number of convolutional network layers that can handle long-term sequences. The GWNN-STGM successfully integrated the two modules of graph wavelet convolutional neural network layer and dilated causal convolutional network layer. Experimental results on two public transportation network datasets show that the performance of the proposed GWNN-STGM is better than other latest benchmark models, which shows that the designed graph wavelet convolutional neural network model has a great ability to explore the spatial-temporal structure from the input dataset.
LI Chang-Sheng , MIN Qi-Xing , CHENG Yu-Rong , YUAN Ye , WANG Guo-Ren
2021, 32(3):742-752. DOI: 10.13328/j.cnki.jos.006178 CSTR:
Abstract:Recently, unsupervised Hashing has attracted much attention in the machine learning and information retrieval communities, due to its low storage and high search efficiency. Most of existing unsupervised Hashing methods rely on the local semantic structure of the data as the guiding information, requiring to preserve such semantic structure in the Hamming space. Thus, how to precisely represent the local structure of the data and Hashing code becomes the key point to success. This study proposes a novel Hashing method based on self-supervised learning. Specifically, it is proposed to utilize the contrast learning to acquire a compact and accurate feature representation for each sample, and then a semantic structure matrix can be constructed for representing the similarity between samples. Meanwhile, a new loss function is proposed to preserve the semantic information and improve the discriminative ability in the Hamming space, by the spirit of the instance discrimination method proposed recently. The proposed framework is end-to-end trainable. Extensive experiments on two large-scale image retrieval datasets show that the proposed method can significantly outperform current state-of-the-art methods.
2021, 32(3):753-762. DOI: 10.13328/j.cnki.jos.006184 CSTR:
Abstract:In the study of natural language understanding and semantic representation, the fact verification task is very important to verify whether a textual statement is based on given factual evidence. Existing research is mainly limited to dealing with textual fact verification, while verification under structured evidence has yet to be explored, such as fact verification based on forms. TabFact is the latest table-based fact verification data set, but the baseline methods do not make good use of the structural characteristics of the table. This study takes advantage of the structural characteristics of the table and designs two models, Row-GVM (Row-level GNN-based verification model) and Cell-GVM (cell-level GNN-based verification model). They have achieved performances of 2.62% and 2.77% higher than the baseline model respectively. The results prove that these two methods using table features are indeed effective.
SHEN Zhi-Hong , ZHAO Zi-Hao , WANG Hua-Jin , LIU Zhong-Xin , HU Chuan , ZHOU Yuan-Chun
2021, 32(3):763-780. DOI: 10.13328/j.cnki.jos.006180 CSTR:
Abstract:With the development of big data application, the demand of large-scale structured/unstructured data fusion management and analysis is becoming increasingly prominent. However, the differences in management, process, retrieval of structured/unstructured data brings challenges for fusion management and analysis. This study proposes an extended property graph model for heterogeneous data fusion management and semantic computing, defines related property operators and query syntax. Based on the intelligent property graph model, this study implements PandaDB, an intelligent heterogeneous data fusion management system. This study depicts the architecture, storage mechanism, query mechanism, property co-storage, AI algorithm scheduling, and distributed architecture of PandaDB. Test experiments and cases show that the co-storage mechanism and distributed architecture of PandaDB have good performance acceleration effects, and can be applied in some scenarios of fusion data intelligent management such as academic knowledge graph entity disambiguation.
LIU Bao-Zhu , WANG Xin , LIU Peng-Kai , LI Si-Zhuo , ZHANG Xiao-Wang , YANG Ya-Jun
2021, 32(3):781-804. DOI: 10.13328/j.cnki.jos.006181 CSTR:
Abstract:Knowledge graph is an important cornerstone of artificial intelligence, which currently has two main data models: RDF graph and property graph. There are several query languages on these two data models. The query language on RDF graph is SPARQL, and the query language on property graph is mainly Cypher. Over the last decade, various communities have developed different data management methods for RDF graphs and property graphs. Inconsistent data models and query languages hinder the wider application of knowledge graphs. KGDB is a knowledge graph database system with unified data model and query language. (1) Based on the relational model, a unified storage scheme is proposed, which supports the efficient storage of RDF graphs and property graphs, and meets the requirement of knowledge graph data storage and query load. (2) Using the clustering method based on characteristic sets, KGDB can handle the issue of untyped triple storage. (3) It realizes the interoperability of SPARQL and Cypher, which are two different knowledge graph query languages, and enables them to operate on the same knowledge graph. The extensive experiments on real-world datasets and synthetic datasets are carried out. The experimental results show that, compared with the existing knowledge graph database management systems, KGDB can not only provide more efficient storage management, but also has higher query efficiency. KGDB saves 30% of the storage space on average compared with gStore and Neo4j. The experimental results on basic graph pattern matching query show that, for the real-world dataset, the query efficiency of KGDB is generally higher than that of gStore and Neo4j, and can be improved by at most two orders of magnitude.
YANG Dong-Hua , ZOU Kai-Fa , WANG Hong-Zhi , WANG Jin-Bao
2021, 32(3):805-817. DOI: 10.13328/j.cnki.jos.006171 CSTR:
Abstract:In recent years, with the large increase in data-centric applications, graph data models have gradually attracted people's attention, and the development of graph databases is also very rapid. Users are often more concerned about their efficiency in using databases. This work mainly studies how to use the existing information to query and predict the graph database, so as to preload and cache the data, and improve the response efficiency of the system. In order to make the method cross-data portable and dig deep into the connections between the data, this study extracted SparQL queries into the form of sequences, used the Seq2Seq model to analyze and predict its data, and tested the method using real data sets. Experiments show that the proposed scheme in this study has a sound effect.
LI Xiao-Guang , WEI Si-Qi , ZHANG Xin , DU Yue-Feng , YU Ge
2021, 32(3):818-830. DOI: 10.13328/j.cnki.jos.006185 CSTR:
Abstract:The knowledge tracing task is designed to track changes of students' knowledge in real time based on their historical learning behaviors and to predict their future performance in learning. In the learning process, learning behaviors are intertwined with forgetting behaviors, and students' forgetting behaviors have a great impact on knowledge tracing. In order to accurately model the learning and forgetting behaviors in knowledge tracing, a deep knowledge tracing model LFKT (learning and forgetting behavior modeling for knowledge tracing) that combines learning and forgetting behaviors is proposed in this study. To model such two behaviors, the LFKT model takes into account four factors that affect knowledge forgetting, including the interval between students' repeated learning of knowledge points, the number of repeated learning of knowledge points, the interval between sequential learning, and the understanding degree of knowledge points. The model uses a deep neural network to predict knowledge status with indirect feedbacks on students' understanding of knowledge according to students' answers. With the experiments on the real datasets of online education, LFKT shows better performance of knowledge tracing and prediction in comparison with the traditional approaches.
XIE Gui-Cai , DUAN Lei , JIANG Wei-Peng , XIAO Shan , XU Yi-Fan
2021, 32(3):831-844. DOI: 10.13328/j.cnki.jos.006183 CSTR:
Abstract:Predicting pedestrian volume in campus public area is of significance for maintaining campus safety and improving campus management level. In particular, due to the outbreak of epidemic, the resumption of college education has put forward higher requirements for the prediction and control of the pedestrian volume in public area. Taking college canteens as an example, predicting the pedestrian volume in canteen is helpful with canteen epidemic prevention worker to make scheduling and arrangement, which not only reduces the risk of crowd gathering, but also provides more considerate service according to the distribution of the pedestrian volume in canteen. Considering the requirements of campus management, e.g., holiday, course arrangement, pedestrian volume prediction in the campus public area is challenging. This study proposes a multi-scale temporal patterns convolution neural networks (MSCNN) based on deep learning to obtain the short-term dependencies as well as long-term periodicities, and reweights the multi-scale temporal pattern characteristics to predict the pedestrian volume at any given time. The effectiveness and efficiency of the MSCNN model are verified by experiments on real-world datasets.
2021, 32(3):845-858. DOI: 10.13328/j.cnki.jos.006177 CSTR:
Abstract:Database is a kind of important and fundamental computer system software. With the development of database application in all walks of life, a growing number of people begin to concern the stability of the database. Because of the numerous internal of external effect, performance anomaly may emerge when the Database running and it may cause huge economic loss. People usually diagnose database anomaly by analyzing monitoring metrics. However, there are hundreds of metrics in the system and ordinary database users are unable to extract valuable information from them. Some major companies employ DBA to manage the databases but the cost is unacceptable for many other companies. Achieving automatic database monitor and diagnose with low cost is a challenging problem. Current methods have many limitations, including high cost of metrics information collection, narrow range of application or poor stability. This study proposes an anomaly diagnose framework AutoMonitor which is deployed on the PostgreSQL database. The framework contains LSTM-based anomaly detection module and modified K nearest-neighbor algorithm-based root cause diagnose module. Framework consists of an offline training and an online diagnose stage. The evaluations on the datasets show that the proposed framework has high diagnose accuracy with minor overload to system performance.
PEI Wei , LI Zhan-Huai , PAN Wei
2021, 32(3):859-885. DOI: 10.13328/j.cnki.jos.006175 CSTR:
Abstract:In recent years, GPU is favored by database manufacturers and researchers for its ultra-high-speed computing capacity and huge data processing bandwidth. The database branch—GPU accelerating database or GPU database (GDBMS) is developing vigorously. With the characteristics of high throughput, low response time, high cost performance, and easy to expand, integrated with artificial intelligence (AI), business intelligence (BI), spatial-temporal data analysis, data visualization, GDBMS have the potential to change the world pattern of data analysis field. This study surveys the four core components of GDBMS: query compiler, query processor, query optimizer, and storage manager, hoping to promote the future research and commercial application of GDBMS.