Volume 32,Issue 3,2021 Table of Contents

  • Display Type:
  • Text List
  • Abstract List
  • 1  Survey on Data Management Technology for Machine Learning
    CUI Jian-Wei ZHAO Zhe DU Xiao-Yong
    2021, 32(3):604-621. DOI: 10.13328/j.cnki.jos.006182
    [Abstract](3610) [HTML](3881) [PDF 1.74 M](6436)
    Abstract:
    Applications drive innovation. The advance of database technology is achieved in support of development of mainstream applications effectively and efficiently. OLTP, OLAP, and online machine learning modeling today all follow this trend. Machine learning extracts knowledge and realizes predictive analysis by modeling data, is the main approach of artificial intelligence technology. This work studies the training process of machine learning from the perspective of data management, summarizes data management technology through data selection, data storage, data access, automatic optimization, and system implementation, and analyzes the advantages and disadvantages of these techniques. Based on the analysis, this study proposes key challenges of data management technology for online machine learning.
    2  In-database AI Model Optimization
    NIU Ze-Ping LI Guo-Liang
    2021, 32(3):622-635. DOI: 10.13328/j.cnki.jos.006179
    [Abstract](2426) [HTML](3268) [PDF 1.46 M](5541)
    Abstract:
    In a large number of changing data, data analysts often only care about a small amount of data with specific prediction results. However, users must query all the data by SQL before inference step, even if a large amount of data will be dropped, because the machine learning algorithm libraries always assume that the data is organized in a single table. This study points out that in this process, if some hints can be gotten from model in advance, it is expected that unnecessary data can be quickly eliminated in the data acquisition phase, thus reducing the cost of multi-table join, inter-process communication, and model prediction. This work takes a specific kind of machine learning model, i.e., decision tree, as an example. Firstly, a pre-filtering and validation execution workflow is proposed. Then, an offline algorithm is used to extract pre-filtering predicates from the decision tree. Finally, the algorithm is tested on real world dataset. Experiments show that the method proposed in this study can accelerate the execution of SQL queries containing predicates on decision tree prediction result.
    3  Distributed Optimization and Implementation of Graph Embedding Algorithms
    ZHANG Wen-Tao YUAN Bin ZHANG Zhi-Peng CUI Bin
    2021, 32(3):636-649. DOI: 10.13328/j.cnki.jos.006186
    [Abstract](2346) [HTML](3235) [PDF 1.76 M](4938)
    Abstract:
    With the advent of artificial intelligence, graph embedding techniques are more and more used to mine the information from graphs. However, graphs in real world are usually large and distributed graph embedding is needed. There are two main challenges in distributed graph embedding. (1) There exist many graph embedding methods and there is not a general framework for most of the embedding algorithms. (2) Existing distributed implementations of graph embedding suffer from poor scalability and perform bad on large graphs. To tackle the above two challenges, a general framework is firstly presented for distributed graph embedding. In detail, the process of sampling and training is separated in graph embedding such that the framework can describe different graph embedding methods. Second, a parameter server-based model partitioning strategy is proposed—the model is partitioned to both workers and servers and shuffling is used to ensure that there is no model exchange among workers. A prototype system is implemented on parameter server and solid experiments are conducted to show that partitioning-based strategy can get better performance than all baseline systems without loss of accuracy.
    4  Node Embedding Research over Temporal Graph
    WU An-Biao YUAN Ye MA Yu-Liang WANG Guo-Ren
    2021, 32(3):650-668. DOI: 10.13328/j.cnki.jos.006173
    [Abstract](2216) [HTML](3093) [PDF 1.97 M](5890)
    Abstract:
    Compared with the traditional graph data analysis method, graph embedding algorithm provides a new graph data analysis strategy. It aims to encoder graph nodes into vectors to perform graph data analysis or mining tasks more effectively by using neural network related technologies. And some classic tasks have been improved significantly by graph embedding methods, such as node classification, link prediction, and traffic flow prediction. Although plenty of works have been proposed by former researchers in graph embedding, the nodes embedding problem over temporal graph has been seldom studied. This study proposed an adaptive temporal graph embedding, ATGED, attempting to encoder temporal graph nodes into vectors by combining previous research works and the information propagation characteristics together. First, an adaptive cluster method is proposed by solving the situation that nodes active frequency is different in different types of graph. Then, a new node walk strategy is designed in order to store the time sequence between nodes, and also the walking list will be stored in bidirectional multi-tree in walking process to get complete walking lists fast. Last, based on the basic walking characteristics and graph topology, an important node sampling strategy is proposed to train the satisfied neural network as soon as possible. Sufficient experiments demonstrate that the proposed method surpasses existing embedding methods in terms of node clustering, reachability prediction, and node classification in temporal graphs.
    5  Cross-Silo Federated Learning-to-Rank
    SHI Ding-Yuan WANG Yan-Sheng ZHENG Peng-Fei TONG Yong-Xin
    2021, 32(3):669-688. DOI: 10.13328/j.cnki.jos.006174
    [Abstract](2591) [HTML](3264) [PDF 2.09 M](6274)
    Abstract:
    Learning-to-rank (LTR) model has made a remarkable achievement. However, traditional training scheme for LTR model requires large amount of text data. Considering the increasing concerns about privacy protection, it is becoming infeasible to collect text data from multiple data owners as before, and thus data is forced to save separately. The separation turns data owners into data silos, among which the data can hardly exchange, causing LTR training severely compromised. Inspired by the recent progress in federated learning, a novel framework is proposed named cross-silo federated learning-to-rank (CS-F-LTR), which addresses two unique challenges faced by LTR when applied it to federated scenario. In order to deal with the cross-party feature generation problem, CS-F-LTR utilizes a sketch and differential privacy based method, which is much more efficient than encryption-based protocols meanwhile the accuracy loss is still guaranteed. To tackle with the missing label problem, CS-F-LTR relies on a semi-supervised learning mechanism that facilitates fast labeling with mutual labelers. Extensive experiments conducted on public datasets verify the effectiveness of the proposed framework.
    6  Time Series Data Cleaning under Multi-speed Constraints
    GAO Fei SONG Shao-Xu WANG Jian-Min
    2021, 32(3):689-711. DOI: 10.13328/j.cnki.jos.006176
    [Abstract](2138) [HTML](3033) [PDF 2.76 M](5042)
    Abstract:
    As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields. Furthermore, data quality can optimize and promote big data and artificial intelligence technology. Generally, physical failures or technical defects in data collection and recorder will cause certain anomalies in collected data. These anomalies will have a significant impact on subsequent data analysis and artificial intelligence processes, thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. And the constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are proposed to constrain time series data, and a series of repairing candidate points is formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. In order to verify the feasibility and effectiveness of this method, an artificial data set, two real data sets, and another real data set with real anomalies are used for experiments under different rates of anomalies and data sizes. It can be seen from the experimental results that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS error and time cost. In addition, the verification of clustering and classification accuracy with several data sets shows the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.
    7  High-order Link Prediction Method Based on Motif Aggregation Coefficient and Time Series Division
    KANG Zhu-Guan JIN Fu-Sheng WANG Guo-Ren
    2021, 32(3):712-725. DOI: 10.13328/j.cnki.jos.006172
    [Abstract](1961) [HTML](3299) [PDF 2.44 M](5170)
    Abstract:
    High-level link prediction is a hot and difficult problem in network analysis research. An excellent high-level link prediction algorithm can not only mine the potential relationship between nodes in a complex network but also help to understand the law of network structure evolves over time. Exploring unknown network relationships has important applications. Most traditional link prediction algorithms only consider the structural similarity between nodes, while ignoring the characteristics of higher-order structures and information about network changes. This study proposes a high-order link prediction model based on Motif clustering coefficients and time series partitioning (MTLP). This model constructs a representational feature vector by extracting the features of Motif clustering coefficients and network structure evolution of high-order structures in the network, and uses multilayer perceptron (MLP) network model to complete the link prediction task. By conducting experiments on different real-life data sets, the results show that the proposed MTLP model has better high-order link prediction performance than the state-of-the-art methods.
    8  Graph Wavelet Convolutional Neural Network for Spatiotemporal Graph Modeling
    JIANG Shan DING Zhi-Ming ZHU Mei-Ling YAN Jin XU Xin-Run
    2021, 32(3):726-741. DOI: 10.13328/j.cnki.jos.006170
    [Abstract](2629) [HTML](4028) [PDF 1.75 M](6371)
    Abstract:
    The spatiotemporal graph modeling is a basic work to analyze the spatial relationship and time trend of each element in the graph structure system. The traditional spatiotemporal graph modeling method is mainly based on the explicit structure of nodes and the fixed relationship between nodes in the graph for spatial relationship mining, which severely limits the flexibility of the model. Besides, traditional methods cannot capture long-term trends. To overcome these shortcomings, a novel end-to-end neural network model for spatiotemporal graph modeling is proposed, i.e., a graph wavelet convolutional neural network for spatiotemporal graph modeling called GWNN-STGM. A graph wavelet convolutional neural network layer is designed in GWNN-STGM. A self-adaption adjacency matrix is introduced in this network layer for node embedding learning so that the model can be used without prior knowledge of the structure. The hidden structural information is automatically found in the training dataset. In addition, GWNN-STGM includes a stacked dilated causal convolutional network layer so that the receptive field of the model can grow exponentially with the increase in the number of convolutional network layers that can handle long-term sequences. The GWNN-STGM successfully integrated the two modules of graph wavelet convolutional neural network layer and dilated causal convolutional network layer. Experimental results on two public transportation network datasets show that the performance of the proposed GWNN-STGM is better than other latest benchmark models, which shows that the designed graph wavelet convolutional neural network model has a great ability to explore the spatial-temporal structure from the input dataset.
    9  Local Semantic Structure Captured and Instance Discriminated by Unsupervised Hashing
    LI Chang-Sheng MIN Qi-Xing CHENG Yu-Rong YUAN Ye WANG Guo-Ren
    2021, 32(3):742-752. DOI: 10.13328/j.cnki.jos.006178
    [Abstract](2081) [HTML](2820) [PDF 1.20 M](4285)
    Abstract:
    Recently, unsupervised Hashing has attracted much attention in the machine learning and information retrieval communities, due to its low storage and high search efficiency. Most of existing unsupervised Hashing methods rely on the local semantic structure of the data as the guiding information, requiring to preserve such semantic structure in the Hamming space. Thus, how to precisely represent the local structure of the data and Hashing code becomes the key point to success. This study proposes a novel Hashing method based on self-supervised learning. Specifically, it is proposed to utilize the contrast learning to acquire a compact and accurate feature representation for each sample, and then a semantic structure matrix can be constructed for representing the similarity between samples. Meanwhile, a new loss function is proposed to preserve the semantic information and improve the discriminative ability in the Hamming space, by the spirit of the instance discrimination method proposed recently. The proposed framework is end-to-end trainable. Extensive experiments on two large-scale image retrieval datasets show that the proposed method can significantly outperform current state-of-the-art methods.
    10  Graph Neural Networks for Table-based Fact Verification
    DENG Zhe-Ye ZHANG Ming
    2021, 32(3):753-762. DOI: 10.13328/j.cnki.jos.006184
    [Abstract](2071) [HTML](3130) [PDF 1.19 M](4995)
    Abstract:
    In the study of natural language understanding and semantic representation, the fact verification task is very important to verify whether a textual statement is based on given factual evidence. Existing research is mainly limited to dealing with textual fact verification, while verification under structured evidence has yet to be explored, such as fact verification based on forms. TabFact is the latest table-based fact verification data set, but the baseline methods do not make good use of the structural characteristics of the table. This study takes advantage of the structural characteristics of the table and designs two models, Row-GVM (Row-level GNN-based verification model) and Cell-GVM (cell-level GNN-based verification model). They have achieved performances of 2.62% and 2.77% higher than the baseline model respectively. The results prove that these two methods using table features are indeed effective.
    11  PandaDB: Intelligent Management System for Heterogeneous Data
    SHEN Zhi-Hong ZHAO Zi-Hao WANG Hua-Jin LIU Zhong-Xin HU Chuan ZHOU Yuan-Chun
    2021, 32(3):763-780. DOI: 10.13328/j.cnki.jos.006180
    [Abstract](3020) [HTML](3215) [PDF 2.26 M](6835)
    Abstract:
    With the development of big data application, the demand of large-scale structured/unstructured data fusion management and analysis is becoming increasingly prominent. However, the differences in management, process, retrieval of structured/unstructured data brings challenges for fusion management and analysis. This study proposes an extended property graph model for heterogeneous data fusion management and semantic computing, defines related property operators and query syntax. Based on the intelligent property graph model, this study implements PandaDB, an intelligent heterogeneous data fusion management system. This study depicts the architecture, storage mechanism, query mechanism, property co-storage, AI algorithm scheduling, and distributed architecture of PandaDB. Test experiments and cases show that the co-storage mechanism and distributed architecture of PandaDB have good performance acceleration effects, and can be applied in some scenarios of fusion data intelligent management such as academic knowledge graph entity disambiguation.
    12  KGDB: Knowledge Graph Database System with Unified Model and Query Language
    LIU Bao-Zhu WANG Xin LIU Peng-Kai LI Si-Zhuo ZHANG Xiao-Wang YANG Ya-Jun
    2021, 32(3):781-804. DOI: 10.13328/j.cnki.jos.006181
    [Abstract](3381) [HTML](3093) [PDF 2.32 M](6127)
    Abstract:
    Knowledge graph is an important cornerstone of artificial intelligence, which currently has two main data models: RDF graph and property graph. There are several query languages on these two data models. The query language on RDF graph is SPARQL, and the query language on property graph is mainly Cypher. Over the last decade, various communities have developed different data management methods for RDF graphs and property graphs. Inconsistent data models and query languages hinder the wider application of knowledge graphs. KGDB is a knowledge graph database system with unified data model and query language. (1) Based on the relational model, a unified storage scheme is proposed, which supports the efficient storage of RDF graphs and property graphs, and meets the requirement of knowledge graph data storage and query load. (2) Using the clustering method based on characteristic sets, KGDB can handle the issue of untyped triple storage. (3) It realizes the interoperability of SPARQL and Cypher, which are two different knowledge graph query languages, and enables them to operate on the same knowledge graph. The extensive experiments on real-world datasets and synthetic datasets are carried out. The experimental results show that, compared with the existing knowledge graph database management systems, KGDB can not only provide more efficient storage management, but also has higher query efficiency. KGDB saves 30% of the storage space on average compared with gStore and Neo4j. The experimental results on basic graph pattern matching query show that, for the real-world dataset, the query efficiency of KGDB is generally higher than that of gStore and Neo4j, and can be improved by at most two orders of magnitude.
    13  SparQL Query Prediction Based on Seq2Seq Model
    YANG Dong-Hua ZOU Kai-Fa WANG Hong-Zhi WANG Jin-Bao
    2021, 32(3):805-817. DOI: 10.13328/j.cnki.jos.006171
    [Abstract](1937) [HTML](2898) [PDF 1.43 M](4735)
    Abstract:
    In recent years, with the large increase in data-centric applications, graph data models have gradually attracted people's attention, and the development of graph databases is also very rapid. Users are often more concerned about their efficiency in using databases. This work mainly studies how to use the existing information to query and predict the graph database, so as to preload and cache the data, and improve the response efficiency of the system. In order to make the method cross-data portable and dig deep into the connections between the data, this study extracted SparQL queries into the form of sequences, used the Seq2Seq model to analyze and predict its data, and tested the method using real data sets. Experiments show that the proposed scheme in this study has a sound effect.
    14  LFKT: Deep Knowledge Tracing Model with Learning and Forgetting Behavior Merging
    LI Xiao-Guang WEI Si-Qi ZHANG Xin DU Yue-Feng YU Ge
    2021, 32(3):818-830. DOI: 10.13328/j.cnki.jos.006185
    [Abstract](3332) [HTML](3017) [PDF 1.44 M](7950)
    Abstract:
    The knowledge tracing task is designed to track changes of students' knowledge in real time based on their historical learning behaviors and to predict their future performance in learning. In the learning process, learning behaviors are intertwined with forgetting behaviors, and students' forgetting behaviors have a great impact on knowledge tracing. In order to accurately model the learning and forgetting behaviors in knowledge tracing, a deep knowledge tracing model LFKT (learning and forgetting behavior modeling for knowledge tracing) that combines learning and forgetting behaviors is proposed in this study. To model such two behaviors, the LFKT model takes into account four factors that affect knowledge forgetting, including the interval between students' repeated learning of knowledge points, the number of repeated learning of knowledge points, the interval between sequential learning, and the understanding degree of knowledge points. The model uses a deep neural network to predict knowledge status with indirect feedbacks on students' understanding of knowledge according to students' answers. With the experiments on the real datasets of online education, LFKT shows better performance of knowledge tracing and prediction in comparison with the traditional approaches.
    15  Pedestrian Volume Prediction for Campus Public Area Based on Multi-scale Temporal Dependency
    XIE Gui-Cai DUAN Lei JIANG Wei-Peng XIAO Shan XU Yi-Fan
    2021, 32(3):831-844. DOI: 10.13328/j.cnki.jos.006183
    [Abstract](2669) [HTML](3783) [PDF 8.58 M](5926)
    Abstract:
    Predicting pedestrian volume in campus public area is of significance for maintaining campus safety and improving campus management level. In particular, due to the outbreak of epidemic, the resumption of college education has put forward higher requirements for the prediction and control of the pedestrian volume in public area. Taking college canteens as an example, predicting the pedestrian volume in canteen is helpful with canteen epidemic prevention worker to make scheduling and arrangement, which not only reduces the risk of crowd gathering, but also provides more considerate service according to the distribution of the pedestrian volume in canteen. Considering the requirements of campus management, e.g., holiday, course arrangement, pedestrian volume prediction in the campus public area is challenging. This study proposes a multi-scale temporal patterns convolution neural networks (MSCNN) based on deep learning to obtain the short-term dependencies as well as long-term periodicities, and reweights the multi-scale temporal pattern characteristics to predict the pedestrian volume at any given time. The effectiveness and efficiency of the MSCNN model are verified by experiments on real-world datasets.
    16  AI-based Database Performance Diagnosis
    JIN Lian-Yuan LI Guo-Liang
    2021, 32(3):845-858. DOI: 10.13328/j.cnki.jos.006177
    [Abstract](2930) [HTML](3556) [PDF 1.71 M](7031)
    Abstract:
    Database is a kind of important and fundamental computer system software. With the development of database application in all walks of life, a growing number of people begin to concern the stability of the database. Because of the numerous internal of external effect, performance anomaly may emerge when the Database running and it may cause huge economic loss. People usually diagnose database anomaly by analyzing monitoring metrics. However, there are hundreds of metrics in the system and ordinary database users are unable to extract valuable information from them. Some major companies employ DBA to manage the databases but the cost is unacceptable for many other companies. Achieving automatic database monitor and diagnose with low cost is a challenging problem. Current methods have many limitations, including high cost of metrics information collection, narrow range of application or poor stability. This study proposes an anomaly diagnose framework AutoMonitor which is deployed on the PostgreSQL database. The framework contains LSTM-based anomaly detection module and modified K nearest-neighbor algorithm-based root cause diagnose module. Framework consists of an offline training and an online diagnose stage. The evaluations on the datasets show that the proposed framework has high diagnose accuracy with minor overload to system performance.
    17  Survey of Key Technologies in GPU Database System
    PEI Wei LI Zhan-Huai PAN Wei
    2021, 32(3):859-885. DOI: 10.13328/j.cnki.jos.006175
    [Abstract](3141) [HTML](5233) [PDF 2.40 M](8634)
    Abstract:
    In recent years, GPU is favored by database manufacturers and researchers for its ultra-high-speed computing capacity and huge data processing bandwidth. The database branch—GPU accelerating database or GPU database (GDBMS) is developing vigorously. With the characteristics of high throughput, low response time, high cost performance, and easy to expand, integrated with artificial intelligence (AI), business intelligence (BI), spatial-temporal data analysis, data visualization, GDBMS have the potential to change the world pattern of data analysis field. This study surveys the four core components of GDBMS: query compiler, query processor, query optimizer, and storage manager, hoping to promote the future research and commercial application of GDBMS.

    Current Issue


    Volume , No.

    Table of Contents

    Archive

    Volume

    Issue

    联系方式
    • 《Journal of Software 》
    • 主办单位:Institute of Software, CAS, China
    • 邮编:100190
    • 电话:010-62562563
    • 电子邮箱:jos@iscas.ac.cn
    • 网址:https://www.jos.org.cn
    • 刊号:ISSN 1000-9825
    •           CN 11-2560/TP
    • 国内定价:70元
    You are the firstVisitors
    Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
    Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
    Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
    Technical Support:Beijing Qinyun Technology Development Co., Ltd.

    Beijing Public Network Security No. 11040202500063