SUN Lu-Ming , ZHANG Shao-Min , JI Tao , LI Cui-Ping , CHEN Hong
2020, 31(3):600-619. DOI: 10.13328/j.cnki.jos.005909
Abstract:Traditional database systems and data management techniques are facing great challenge due to the 3V's in big data. The development of artificial intelligence provides a brand-new opportunity for database management systems with its power in learning, reasoning, and planning. Through learning from data distribution, query workload and query execution performance, the systems powered by artificial intelligence are able to forecast future workload, tune database configurations, partition data blocks, index on proper columns, estimate selectivity, optimize query plan and control query concurrency automatically. Also, some machine learning models can replace core components of a database such as index structures. This study introduces new research on database systems with artificial intelligence and state the existing problems and potential solutions and the future research directions are proposed as well.
GAO Yuan-Ning , YE Jin-Biao , YANG Nian-Zu , GAO Xiao-Feng , CHEN Gui-Hai
2020, 31(3):620-633. DOI: 10.13328/j.cnki.jos.005910
Abstract:In the era of big data and cloud computing, efficient data access is an important metric to measure the performance of a large-scale storage system. Therefore, design a lightweight and efficient index structure, which can meet the system's demand for high throughput and low memory footprint, is one of the research hotspots in the current database field. Recently, Kraska, et al proposed to use the machine learning models instead of traditional B-tree indexes, and remarkable results are achieved on real data sets. However, the proposed model assumes that the workload is static and read-only, failing to handle the index update problem. This study proposes Dabble, a middle layer based scalable learning index model, which is used to mitigate the index update problem. Dabble first uses K-means algorithm to divide the data set into K regions, and trains K neural networks to learn the data distribution of different regions. During the training phase, it innovatively integrates the data access patterns into the neural network, which can improve the prediction accuracy of the model for hotspot data. For data insertion, it borrows the idea of LSM tree, i.e., delay update mechanism, which greatly improved the data writing speed. In the index update phase, a middle layer based mechanism is proposed for model decoupling, thus easing the problem of index updating cost. Dabble model is evaluated on two datasets, the Lognormal distribution dataset and the real-world Weblogs dataset. The experiment results demonstrate the effectiveness and efficiency of the proposed model compared with the state-of-the-art methods.
QIU Tao , WANG Bin , SHU Zhao-Wei , ZHAO Zhi-Bo , SONG Zi-Wen , ZHONG Yan-Hui
2020, 31(3):634-647. DOI: 10.13328/j.cnki.jos.005906
Abstract:Indexing is one of the most effective techniques for relational databases to achieve fast queryprocessing. The intelligent index tuning technique can effectively adjust the index of the database instance to obtain efficient query performance. Most of the existing methods utilize the query log to generate candidate indices, and then use the artificially designed models to select indices, thereby the indices are adjusted. However, the candidate indices generated from the query log may not exist in the database instance, so they cannot precisely estimate the effects of such indices on the query processing. This study first designs and implements an intelligent index tuning system for the relational database. Secondly, it proposes a learning-based method to model the effects of indices for query processing, accordingly, the query optimization effect of an index can be accurately estimated when selecting optimized indices. Then, an efficient optimal index selection algorithm is designed to select a set of indices with the maximal utility from candidate indices, which satisfy the space threshold. Finally, experiments are conducted to test the performance of the proposed system in different settings. The experimental results show that the proposed technique can effectively adjust the index and achieve a significant improvement in query performance for a relational database.
LAI Yong-Xuan , ZHANG Lu , YANG Fan , LU Wei , WANG Tian
2020, 31(3):648-662. DOI: 10.13328/j.cnki.jos.005901
Abstract:Bus arrival time prediction is an important basis for the decision-making assistant system of bus dispatching. It helps dispatchers to find late vehicles in time and make reasonable dispatching decisions. However, bus arrival time is influenced by traffic congestion, weather, and variable time when stopping at stations or travelling duration between stations. It is a spatio-temporal dependence problem, which is quite challenging. This study proposes a new algorithm called STPM for bus arrival time prediction based on deep neural network. The algorithm uses space-time components, attribute components and fusion components to predict the total bus arrival time from the starting point to the terminal. In this algorithm, time-dependence and space-time components are used to learn the internal spatio-temporal dependence. It uses attribute components to learn the influence of external factors, uses fusion components to fuse the output of temporal and spatial components, as well as attribute components, to predict the final results. Experimental results show that STPM can combine the advantages of convolutional neural network and recurrent neural network model to learn the key temporal and spatial features. The proposed algorithm outperforms existing methods in terms of the error percentage and accuracy of bus arrival time prediction.
LI Liang , WU Gang , WANG Guo-Ren
2020, 31(3):663-679. DOI: 10.13328/j.cnki.jos.005902
Abstract:Skiplist is a widely used indexing technology in the database systems. The advantage is that the complexity of skiplist is O(log(n)). However, in the standard skiplist algorithm, the level of each nodes is generated by a random generator, thus, the performance of the skiplist is unstable. In extreme case, the searching complexity deceases to O(n) which is similar to the list searching time. This is because the classic skiplist do not combine data features to generate its structure. It is believed that a stable skiplist structure should fully consider the distribution characteristics of the data to determine the number of node levels. This study estimates the data cumulative distribution function based on the kernel density estimation method, and predicts the position of the data in the skiplist, determines the number of node levels. In addition, it is found that the node with a higher level has a higher probability of being accessed. This study also focuses on the access frequency and the hot data of frequent access, make sure that the upper level of the skiplist is hot data, and access the less cold data in the lower level of skiplist. Finally, a comprehensive experimental evaluation of the six kinds of skiplist algorithms is performed based on the synthesis dataset and real dataset, besides, the source code is open. The results show that the best skiplist algorithm can achieve a 60% performance improvement, which points out a authentic direction for the future researchers and system developers.
CAI Lei , ZHU Yan-Chao , GUO Qing-Xing , ZHANG Zhao , JIN Che-Qing
2020, 31(3):680-694. DOI: 10.13328/j.cnki.jos.005914
Abstract:The blockchain system is favored by many fields, such as finance and logistics due to several unique properties, including decentralized architecture, data immutability and data traceability. Transactions belonging to the same type are commonly distributed in massive blocks because all transactions are stored in chronological order of transaction committing, which lowers the efficiency to process tracing queries where a huge number of historical blocks are involved. Although indexing and materialized view are two typical ways to boost query performance, indexing cannot lower the I/O cost if the data to be processed are widely distributed in the system. Fortunately, materialized view suits for this scenario well. Furthermore, as traditional materialized view technologies for RDBMS cannot be directly adopted to blockchain due to significant difference between them, a set of materialized view mechanisms is firstly proposed for blockchain with the following properties:(1) To lower the impact to the system, the view maintenance operation is executed in parallel with consensus process; (2) Trie-Tree is used to speed up multi-materialized view maintenance process in blocks; (3) the query results is guaranteed credible by ensuring the materialized results not falsified with Merkle verification. After integrating the proposed materialized view maintenance mechanism into a blockchain system, experimental results show that the proposed method is convenient and efficient.
SUN Chen-Chen , SHEN De-Rong , LI Yu-Kun , XIAO Ying-Yuan , MA Jian-Hong
2020, 31(3):695-709. DOI: 10.13328/j.cnki.jos.005900
Abstract:Entity resolution (ER) is an important aspect of data integration and data cleaning, and is also a necessary pre-process step of big data analytics and mining. Traditional batch based ER's overall runtime is costly, and cannot satisfy current (nearly) real-time data applications' requirements. Therefore, time constraint entity resolution (TC-ER) is focused on, while core problem is record pair ranking according to match probability both information inner blocks and information across blocks are analyzed from multi-pass blocking respectively, and two basic recordsmatch probability methods are proposed. The basic methods are improved by proposing an advanced record match probability method based on similarity flowing over a biparitite graph.A bipartite graph is constructed according to record pairs, blocks, and relations between them. Similarities iteratively flow between pair nodes and block nodes over the bipartite graph until convergence. The convergence result is computed with fixpoint iterations. An approximate convergence computation mehod is proposed to reduce cost, and it improves real-time recall in TC-ER. Finally, the proposed methods are evaluated on two datasets, which shows their effectiveness and also tests different aspects of the proposed methods.
ZHENG Jiao-Ling , QIAO Shao-Jie , SHU Hong-Ping , YING Guang-Hua , Louis Alberto GUTIERREZ
2020, 31(3):710-725. DOI: 10.13328/j.cnki.jos.005905
Abstract:In distribution channel system, product manufacturer will often reward retail trader who makes big deal to increase the sales. On the other hand, in order to obtain high reward, retail traders may form alliance, where a cheating retail trader accumulates the deals of other retail traders. This type of commercial fraud is called deal cheating or cross region sale. Because the sales contain a lot of normal big deals, traditional outlier detection methods cannot distinguish the normal extreme value and the true outlier generated by deal cheating behavior. Meanwhile, the sparsity of the multidimensional sales data makes the outlier detection methods based on multidimensional space cannot work effectively. To handle the aforementioned problems, this study proposes deal cheating mining algorithms based on ratio characteristic and tensor reconstruction method. These algorithms combine artificial intelligence and database technique. Meanwhile, because there are multiple types of deal cheating patterns, this study proposes deal cheating pattern classification methods based on the partially ordered lattice of deal cheating patterns. In the experiments on synthetic data, the deal cheating detection algorithm based on the ratio characteristic can achieve an average AUC-value of 65%. The traditional feature extraction methods can only achieve average AUC-values of 36% and 30%. In the experiments on the real data, the results shows the deal cheating detection algorithm is capable of distinguishing normal big deal from abnormal big deal which may be generated by the deal cheating behaviors.
DING Xiao-Ou , YU Sheng-Jian , WANG Mu-Xian , WANG Hong-Zhi , GAO Hong , YANG Dong-Hua
2020, 31(3):726-747. DOI: 10.13328/j.cnki.jos.005907
Abstract:Anomaly detection on multi-dimensional time series is an important research problem in temporal data analysis. In recent years, large-scale industrial time series data have been collected and accumulated by equipment sensors from Industrial Internet of Things (IIoT). These data show the feature of diversity data patterns and workflows, which requires high performance of anomaly detection methods in efficiency, effectiveness, and reliability. Besides, there exists latent correlation between sequences from different dimensions. The correlation information can be used to identify and explain anomalies in data. Based on this, this study proposes a correlation analysis based anomaly detection on multi-dimensional time series data. It first computes correlation values among sequences after standardization steps, and a time series correlation graph model is constructed. Time series cliques are constructed according to correlation degree in the time series correlation graph. Anomaly detection is processed within and out of a clique. Experimental results on a real industrial sensor data set show that the proposed method is effective in anomaly detection tasks in high dimensional time series data. Through contrast experiments, the proposed method is verified to have a better performance than both the statistic-based and the machine learning-based baseline methods. Research in this study achieves reliable correlation knowledge mining between time series, which not only saves time costs, but also identifies abnormal patterns form complex conditions.
GUO Jia-Yan , LI Rong-Hua , ZHANG Yan , WANG Guo-Ren
2020, 31(3):748-762. DOI: 10.13328/j.cnki.jos.005903
Abstract:Dynamic graph structured data is ubiquitous in real-life applications. Mining outliers on dynamic networks is an important problem, which is very useful for many practical applications. Most traditional network outlier detection algorithms focus mainly on the strutraulal anomaly, ignoring the nodes and edges' attributes, and the time-varying features as well. This study proposes a graph neural network based network anomaly detection algorithm which can capture the nodes and edges' attributes and time-varying features and fully uses these features to learn a representation vector for each node. Specifically, the proposed algorithm improves an unsupervised graph neural network framework called DGI. Based on DGI, a new danamic DGI algorithm is proposed, which is called Dynamic-DGI, for dynamic networks. Dynamic-DGI can simultaneously extracts the abnormal characteristics of the network itself and the abnormal characteristics of the network changes. The experimental results show that the proposed algorithm is better than the state-of-the-art anomaly detection algorithm SpotLight, and is significantly better than the traditional network representation learning algorithms. In addition to improving the accuracy, the proposed algorithmis also able to mine interesting anomalies in the network.
ZHAO Chao , WANG Teng-Jiang , LIU Shi-Jun , PAN Li , JI Cun
2020, 31(3):763-777. DOI: 10.13328/j.cnki.jos.005912
Abstract:The time series classification algorithm based on Shapelet has the characteristics of interpretability, high classifica-tion accuracy and fast classification speed. Among these Shapelet-based algorithms, learning Shapelet algorithm does not rely on a single classifier, and Shapelet that is not in the original time series can be learned, which can achieve a high classification accuracy and ensure that Shapelet discovery and classifier construction are completed at the same time. However, if too many Shapelets are generated, it will increase the dependent parameters, resulting in too long training time, low classification speed, and difficult dynamic updates. And similar redundancy Shapelets will reduce the interpretability of the classification. This study proposes a new selective extraction algorithm to select Shapelet candidate set and change the learning method to accelerate the learning process of Shapelet and puts forward two optimization strategies. By using time series clustering for the original training set, Shapelets not in the original time series can be obtained. Meanwhile, a voting mechanism is added into the selective extraction algorithm to solve the problem of excessive Shapelet generation. Experiments show that the proposed algorithm can improve the training speed while maintaining high accuracy.
ZHANG Qing-Bo , WANG Bin , CUI Ning-Ning , SONG Xiao-Xu , QIN Jing
2020, 31(3):778-793. DOI: 10.13328/j.cnki.jos.005913
Abstract:In recent years, matrix factorization (MF) has been exploited commonly in recommender system because of its capability and simplification. However, data sparsity and cold-start problems make the latent feature of users learned by MF cannot represent the users' preferences and the similarity relation among users exactly, which limits the performance of MF. To remedy it, the regularized matrix factorization (RMF) draws researchers' attention. And the problem demanding prompt solution in RMF is capturing the reliable similarity relation among users. Besides, MF simply regards the inner product between the latent features of both target user and target item as the score that target user may rate the target item, ignoring the user's different attentions on various features of the item. How to analyze the user's attention on item's features and capture more accurate preference of the user is still a challenge. To address these issues, a model is put forward named attention-based regularized matrix factorization, abbreviated as ARMF. Specifically, to settle the problems of data sparsity and cold-start and obtain reliable similar relationships among users, the model builds a user-item heterogeneous network according to the social network and the rating record, and the similarities among users can be obtained based on it. Incorporating attention mechanism into MF allows us to analyze the attention of users on different item's features and capture moreaccurate preferences of users, which improves the precision of MF further. At last, the proposed model is compared with the state-of-the-art models on two real-world datasets and the result demonstrates the better precision and robustness of ARMF.
CHEN Bi-Yi , HUANG Ling , WANG Chang-Dong , JING Li-Ping
2020, 31(3):794-805. DOI: 10.13328/j.cnki.jos.005897
Abstract:The combination of explicit and implicit feedback can effectively improve recommendation performance. However, the existing recommendation systems have some disadvantages in integrating explicit feedback and implicit feedback, i.e., the ability of implicit feedback to reflect hidden preferences from missing values is ignored or the ability of explicit feedback to reflect users' preferences is not fully utilized. To address this issue, this study proposes an explicit and implicit feedback based collaborative filtering algorithm. The algorithm is divided into two stages, where the first stage deals with implicit feedback data by weighted low rank approximation to train implicit user/item vectors, and the second stage introduces a baseline estimate and uses the implicit user/item vectors as supplementaries to the explicit user/item vectors. Through the combination of explicit and implicit user/item vectors, the predictions of users' preferences for items can be obtained by training. The proposed algorithm is compared with several typical algorithms on standard datasets, and the results confirm its feasibility and effectiveness.
CHAI Ming-Ke , FAN Ju , DU Xiao-Yong
2020, 31(3):806-830. DOI: 10.13328/j.cnki.jos.005908
Abstract:Modern database systems provide a general design principle for various data types and application workloads. While gaining great success in the last decades, the principle has a limitation that a database system may not achieve superior performance, if the system cannot be "customized" to the specific data distributions and workload characteristics. To address the problem, learnable database systems have attracted much attention from both industrial and academic communities, with a novel idea of using machine learning to optimize database systems. Along with this direction, extensive efforts have been done very recently to advance the field of learnable database systems. This survey systematically reviews the existing studies from the perspective of database system architecture. A fine-grained taxonomy is provided by categorizing the existing works by their target learnable database components. To help readers better understand each type of learnable components their motivations are presented, demonstrating the insights and introducing the key techniques. Finally, a number of promising future research directions are outlined of learnable database systems.
2020, 31(3):831-844. DOI: 10.13328/j.cnki.jos.005899
Abstract:In big data era, database systems face three challenges. Firstly, the traditional empirical optimization techniques (e.g., cost estimation, join order selection, knob tuning) cannot meet the high-performance requirement for large-scale data, various applications and diversified users. It is needed to design learning-based techniques to make database more intelligent. Secondly, many database applications require to use AI algorithms, e.g., image search in database. AI algorithms can be embedded into database, utilizing database techniques to accelerate AI algorithms, and providing AI capability inside databases. Thirdly, traditional databases focus on using general hardware (e.g., CPU), but cannot fully utilize new hardware (e.g., ARM, GPU, AI chips). Moreover, besides relational model, tensor model can be utilized to accelerate AI operations. Thus, it is needed to design new techniques to make full use of new hardware. To address these challenges, an AI-native database is designed. On one hand, AI techniques are integrated into databases to provide self-configuring, self-optimizing, self-monitoring, self-diagnosis, self-healing, self-assembling, and self-security capabilities. On the other hand, databases are enabled to provide AI capabilities using declarative languages in order to lower the barrier of using AI. This study introduce five levels of AI-native databases and provide several open challenges of designing an AI-native database. Autonomous database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, and autonomous index/view advisor are also taken as examples to showcase the superiority of AI-native databases.
ZHANG Yi-Tao , WAN Chang-Xuan , LIU Xi-Ping , JIANG Teng-Jiao , LIU De-Xi , LIAO Guo-Qiong
2020, 31(3):845-865. DOI: 10.13328/j.cnki.jos.005898
Abstract:With the increasing enrichment of economic activity data, a large number of financial texts have emerged on Internet, which contains the influence factors of the economic development. How to mine these economic factors from these texts is the key to conduct economic analysis based on unstructured data. Due to the limitation of manual selection of economic indicators, and the inaccuracy of modelling economic indicators in unstructured texts, the CRF (Chinese restaurant franchise) allocation processes in HDP topic model are extended to a more efficient pattern. In order to describe the dish style in a restaurant, the existing economic taxonomies are used to determine the domain membership of a document. The semantic similarity between words is exploited to define the semantic relevance between words and topics, which reflect the similarity of customers' requirements for dishes. For each word, its representativeness of each topic is employed to evaluate its contribution to the topic, which explains the loyalty of a customer to each dish. By combining documents' domain properties, word semantics and words' presence in topics with HDP topic model, a novel model, PSP_HDP topic model, is proposed. As the PSP_HDP topic model improves documents-topics and topics-words allocation processes, it increases the accuracy of identifying economic topics and distinctiveness of the topics, which leads to a more effective mining of economic topics and economic factors. Experimental results show that the proposed model not only achieves a better performance in terms of topic diversity, topic perplexity and topic complexity, but also is effective in finding more cohesive unstructured economic indicators and economic factors.
LIU Rui-Xuan , CHEN Hong , GUO Ruo-Yang , ZHAO Dan , LIANG Wen-Juan , LI Cui-Ping
2020, 31(3):866-892. DOI: 10.13328/j.cnki.jos.005904
Abstract:In the era of big data, a rich source of data prompts the development of machine learning technology. However, risks of privacy leakage of models' training data in data collecting and training stages pose essential challenges to data management in the artificial intelligence age. Traditional privacy preserving methods of data management and analysis could not satisfy the complex privacy problems in various stages and scenarios of machine learning. This study surveys the state-of-the-art works of privacy attacks and defenses in machine learning. On the one hand, scenarios of privacy leakage and adversarial models of privacy attacks are illustrated. Also, specific works of privacy attacks are classified with respect to adversarial strategies. On the other hand, 3 main technologies which are commonly applied in privacy preserving of machine learning are introduced and key problems of their applications are pointed out. In addition, 5 defense strategies and corresponding specific mechanisms are elaborated. Finally, future works and challenges of privacy preserving in machine learning are concluded.
WANG Song , PENG Yu-Wei , LAN Hai , LUO Qian-Wen , PENG Zhi-Yong
2020, 31(3):893-908. DOI: 10.13328/j.cnki.jos.005911
Abstract:Data integration plays a very important role in data management and analytical area. Although there have been decades since the data integration problem was first proposed, there are many data integration problems that remain unsolved. This study surveys the works in data integration area from 2001 until now. By categorizing these papers and their methodologies, it is able to summarize how these works develop and how their research topics shift from time to time. Several research topics are also filtered out that draw much attention recently and hopefully the survey and conclusions may provide guidance to the related researchers.