• Volume 29,Issue 3,2018 Table of Contents
    Select All
    Display Type: |
    • >Special Issue's Articles
    • Streamlined Asynchronous Graph Processing Framework

      2018, 29(3):528-544. DOI: 10.13328/j.cnki.jos.005441 CSTR:

      Abstract (4291) HTML (2700) PDF 1.86 M (7263) Comment (0) Favorites

      Abstract:Distributed graph processing is mainstream but suffers from a few unavoidable issues, such as workload imbalancing and the debugging/optimizing difficulties in distributed programs. On the other hand, recent research results show that with a reasonable design of data structure and processing model, graph processing on a single PC can achieve comparable performance as the systems using large number of machines. For example, GraphChi on a single PC can achieve almost the same performance with Spark with 50 nodes. In this paper, a streamlined asynchronous graph processing model, ASP is proposed based on accumulated iterative model and external storage based parallel computing techniques. ASP relies on sequential disk access and allows asynchronous computations on the graph structure data. Based on ASP, a streamlined graph processing framework, S-Maiter is designed and implemented to provide high performance graph processing ability on a single PC. By optimizing I/O threading, memory monitoring, and shard-level priority scheduling, the performance of S-Maiter is greatly improved. Experimental results on a big graph dataset (13 million nodes and 500 million edges) show that, 1-node S-Maiter can achieve comparable performance with distributed Maiter with 16 nodes. Furthermore, S-Maiter is 1.5 times faster than the popular single-PC graph processing system GraphChi.

    • P&D GraphOLAP: Parallel Framework for Large-Scale Multidimensional Network Analysis

      2018, 29(3):545-568. DOI: 10.13328/j.cnki.jos.005443 CSTR:

      Abstract (4087) HTML (3160) PDF 2.76 M (6151) Comment (0) Favorites

      Abstract:Most data in real life can be described as multidimensional networks. How to process the analysis on multidimensional networks from multiple views and multiple granularities is still the focus of current research. Meanwhile, OLAP (online analytical processing) technology has been proven to be an effective tool on relational data. However, it is an enormous challenge to manage and analyze multidimensional heterogeneous networks via OLAP technology to support effective decision making. In this paper, a P&D (path and dimension) graph cube model is proposed. Based on this model, the graph cube materialization is divided into two parts, termed as path related materialization and dimension related materialization, and the corresponding materialization algorithms are designed. Some GraphOLAP operations are also refined to improve the ability of analyzing multidimensional networks. Finally, the algorithms are implemented on Spark and the multidimensional networks are constructed through real datasets. These networks are then analyzed using the framework. The results of experiments validate the effectiveness and scalability of P&D graph cube model and the materialization algorithms.

    • Survey on Technologies of Distributed Graph Processing Systems

      2018, 29(3):569-586. DOI: 10.13328/j.cnki.jos.005450 CSTR:

      Abstract (4735) HTML (4258) PDF 2.59 M (8446) Comment (0) Favorites

      Abstract:Well recognized as a primitive data structure, graph is an abstraction of objects and their pairwise connections. There exists a wide spectrum of applications, including semantic web analysis, social network analysis, biological genetic analysis and information retrieval, which can be modeled as graphs. Therefore, it is of great importance to conduct data analysis over these applications. With the development of information technology such as mobile Internet and Internet of things, the scale of graph data is increasing continuously and rapidly. To provide fast analysis over large-scale graph data, Pregel was first proposed as a distributed graph processing framework by Google. Since then, based on Pregel framework, a variety of optimization techniques and systems have been proposed by academic and industrial communities. Through extensive investigation and analysis, this paper first establishes three optimization objectives for the state-of-the-arts solutions to build distributed graph processing systems. Subsequently, it reviews mainstream optimizing techniques for the state-of-the-arts solutions from the perspective of computation granularity, task scheduling, communication mode and load balance. Finally, the paper discusses some open research problems and possible future research directions in this field.

    • Mining Coteries Trajectory Patterns for Recommending Personalized Travel Routes

      2018, 29(3):587-598. DOI: 10.13328/j.cnki.jos.005452 CSTR:

      Abstract (4593) HTML (2590) PDF 1.38 M (8104) Comment (0) Favorites

      Abstract:Coterie is an asynchronous group pattern that finds the group patterns with similar trajectory behavior under unequal time interval constraints. The traditional trajectory pattern mining algorithm often deals with GPS data with fixed time interval sampling constraints, which cannot be directly used for coterie pattern mining. At the same time, the traditional group pattern mining has the problem of missing semantic information, and thus reduces the completeness and accuracy of individualized tourist routes. To address the issue, two semantic-based tourism route recommendation strategies, distance-aware recommendation strategy based on semantics (DRSS) and conformity-aware recommendation strategy based on semantics (CRSS), are proposed in this paper. In addition, with the increasing size of social network data, the efficiency of traditional group model clustering algorithm is of great challenge. Therefore, in order to deal with large-scale social network trajectory data efficiently, MapReduce programming model with optimized clustering is used to mine the coterie group pattern. The experimental results show that the coterie group pattern mining with optimized clustering and semantic information under the MapReduce programming model achieves better recommendation quality than the traditional group pattern travel route in the personalized tourism route recommendation and can effectively handle the large-scale social network trajectory data.

    • Distance Metric Based Diversified Ranking on Large Graphs

      2018, 29(3):599-613. DOI: 10.13328/j.cnki.jos.005455 CSTR:

      Abstract (3931) HTML (2633) PDF 1.89 M (5390) Comment (0) Favorites

      Abstract:Expansion relevance which combines both relevance and diversity into a single function is resorted to a submodular optimization objective that can be solved by applying the classic cardinality constrained monotone submodular maximization. However, expansion relevance do not directly capture the dis-similarity over a pair of nodes. Existing submodular algorithms are sequential and not easy to take full advantage of the power of distributed cluster computing platform, such as Spark, to significantly improve the efficiency of algorithm. To tackle this issue, in this paper, a distance metric, which is defined by a sum function of personalized PageRank scores over the symmetry difference of neighbors of a pair of nodes, is first introduced to capture the pairwise dis-similarity over pairs of nodes. Then, the problem of diversified ranking on graphs is formulated as a max-sum k-dispersion problem with metrical edge weight. A polynomial time 2-approximate algorithm is proposed to solve the problem. Considering the computational independence of different pairs of nodes, a MapReduce algorithm is further developed to boost the efficiency of the process. Finally, extensive experiments are conducted on real network datasets to verify the effectiveness and efficiency of the proposed algorithm.

    • Signed Network Prediction Method Based on the Client-to-Client Distributed Framework

      2018, 29(3):614-626. DOI: 10.13328/j.cnki.jos.005447 CSTR:

      Abstract (4121) HTML (2731) PDF 1.43 M (5346) Comment (0) Favorites

      Abstract:The edges of a network can be divided into positive and negative relationships according to their potential meanings. When the edges of a network are signed with plus or minus signs respectively, a signed network can be formed. Signed networks are widely used in many fields such as sociology, informatics and biology. Hence, the sign prediction problem in signed networks has become one of research hot spots. In large dataset, the scalability of sign prediction algorithm is still a great challenge. There are many related works in the distributed design of signed network prediction methods, however, the computation efficiency is still limited by the fundamental server/client framework. This paper proposes client to client distributed framework (C2CDF). Compared with traditional server/client framework, C2CDF is a completely new client-to-client framework which can release the bandwidth pressure by abandoning the server node and allowing the communications between the client nodes. The Experiments on sign prediction in signed social networks, prediction in click-through rate and prediction in forest type show that C2CDF is a general approach which can not only be applied in sign prediction in signed network but also be used in the other prediction areas. In these three datasets, C2CDF can achieve better performance than FM inferred by the traditional SGD algorithm. C2CDF also achieves a 2.3-3.3x speed-up over the method implemented under the server/client framework while obtains a better accuracy performance than the method compared against.

    • MapReduce-Based Graph Structural Clustering Algorithm

      2018, 29(3):627-641. DOI: 10.13328/j.cnki.jos.005456 CSTR:

      Abstract (4551) HTML (2788) PDF 1.82 M (6372) Comment (0) Favorites

      Abstract:Graph Clustering is a fundamental task for graph mining which has been widely used in social network analysis related applications. Graph structural clustering (SCAN) is a well-known density-based graph clustering algorithm. SCAN algorithm can not only find the clusters in a graph, but also be able to identify hub nodes and outliers. However, with the growing graph size, the traditional SCAN algorithm is very hard to handle massive graph data, as its time complexity is O(m1.5) (m is the number of edges in the graph). To overcome the scalability issue of SCAN algorithm, this paper proposes a MapReduce based graph structural clustering algorithm, called MRSCAN. Specifically, the paper develops a MapReduce based similarity computation, a core node computation, as well as two clustering merging algorithms. In addition, it conducts extensive experiments over serval real-world graph datasets, and results demonstrate the accuracy, effectiveness, and scalability of the presented algorithm.

    • Nearest Neighbor Query in Road Networks

      2018, 29(3):642-662. DOI: 10.13328/j.cnki.jos.005442 CSTR:

      Abstract (4202) HTML (2959) PDF 2.26 M (6578) Comment (0) Favorites

      Abstract:Nearest neighbor query, as one of the building blocks of location-based service, has become a hot research topic in recent years. Compared with Euclidean space, road network is a more practical model in real applications; hence, nearest neighbor query in road network has received broader research efforts. In road network, tremendous data are generated along with sophisticated data structure, making nearest neighbor query computationally expensive. This poses a major challenge to spatial database community on its effort to effectively improve the query processing efficiency for nearest neighbor query. This work summarizes existing nearest neighbor query techniques in road network, and conducts analysis and comparison among them, from various perspectives including indexing structure and algorithm implementation. Additionally, several variants of nearest neighbor query are also summarized in this work. Finally, future research focus and trend for nearest neighbor query in road network are discussed.

    • Survey on Dynamic Graph Pattern Matching Technologies

      2018, 29(3):663-688. DOI: 10.13328/j.cnki.jos.005444 CSTR:

      Abstract (6123) HTML (3709) PDF 2.76 M (8509) Comment (0) Favorites

      Abstract:With the advent of big data era, the rapid growth of multi-source heterogeneous data has become an open problem. The inherent relationships between these data are usually modeled by the graph model. However, in practical applications, such as network security analysis and public opinion analysis over social networks, the structure and content of the graph data describing relationships between entity objects are usually not fixed. To be specific, the structure of the graph data, and the attributes of the nodes and edges in it will vary over time. Therefore, efficient query and match over dynamically updated graph data currently draws extensive research, where many outstanding research works are proposed. In this paper, the research progress of dynamic graph data matching technologies is reviewed from the aspects of key technologies, representative algorithms and performance evaluation. The state-of-the-art applications, the challenging problems and the research trend of dynamic graph matching technologies are summarized.

    • SQL-Based Solution for Fast Graph Similarity Search

      2018, 29(3):689-702. DOI: 10.13328/j.cnki.jos.005449 CSTR:

      Abstract (4639) HTML (2984) PDF 1.50 M (6211) Comment (0) Favorites

      Abstract:Graphs are widely used to model complicated data in many areas such as social networking, knowledge base, semantic web, bioinformatics and cheminformatics. More and more graph data are collected such that it has become a rather challenging problem to manage such complex data. The database community has had a long-standing interest in querying graph databases, and graph similarity search is one of most popular topics. This paper focuses on the graph similarity search problem with edit distance constraints. Firstly, several state-of-the-art methods are investigated to reveal that all the proposed pruning rules have limitations and none of them can outperform others on various queries. To address this problem, then a novel approach is proposed to support the graph similarity search in the framework of query evaluation using the relational model. The proposed approach develops a novel unified filtering framework by combing all the existing pruning rules. It can avoid limitations on existing pruning rules, and have more widely applications. A series of experiments are also conducted to evaluate the proposed approach. The results show that the new approach can outperform all existing state-of-the-art methods.

    • Privacy Preserving Method for Point-of-Interest Query on Road Network

      2018, 29(3):703-720. DOI: 10.13328/j.cnki.jos.005451 CSTR:

      Abstract (3978) HTML (2268) PDF 2.06 M (5714) Comment (0) Favorites

      Abstract:In recent years, the rapid development of wireless communication technology has promoted the development of locationbased services (LBS), among which the point-of-interest (POI) query is one of the most important applications. A novel privacy preserving method of k-anonymous model is proposed to solve the problem of leaking location privacy during the query process in road network environment. First, the anonymous server uses the set of points of interest to construct the network Voronoi diagram. Then, the whole road network is divided into independent units which are called network Voronoi cell (NVC) without overlapping. Moreover, the anonymous server uses the Hilbert curve to traverse the road network space and sort the points of interest in accordance with Hilbert order. When a user requests a query, the anonymous algorithm selects dispersed k-1 NVCs which have the same query frequency with the NVC that user located in, and then generates dummy locations in the relative road segments corresponding to the user's in each NVC. The reciprocity of the anonymity set can be ensured and the inference attack that traditional k-anonymity can't resist can be avoided through the proposed anonymous algorithm. Finally, the theoretical analysis and experimental results show that the proposed privacy preserving scheme can effectively protect the location privacy.

    • Search of Genes with Similar Phenotype Based on Disease Information Network

      2018, 29(3):721-733. DOI: 10.13328/j.cnki.jos.005445 CSTR:

      Abstract (4038) HTML (2401) PDF 1.41 M (6338) Comment (0) Favorites

      Abstract:The results of Human Genome Project promote the development of bioinformatics. Searching disease genes that have function correlations, also called similar phenotype genes, based on the strategy of disease phenome similarity becomes an emerging research topic due to its important research value and wide range of applications. However, in biomedical field, there is no previous work that applies computer methods to search similar phenotype genes via a network consists of "gene-disease-phenotype" relations. To fill the gap, in this study, a disease information network containing three heterogeneous nodes (i.e., gene, disease, and phenotype) is built by making use of a disease open database. In addition, an algorithm, called gSim-Miner, is designed for the search of similar phenotype genes via the disease information network. Pruning strategies based on the characteristics of disease phenotype data are proposed to improve the efficiency of gSim-Miner. Experiments on real-world data sets demonstrate that the disease information network is feasible, and gSim-Miner is effective, efficient and extensible.

    • Road Network Aware Online Trajectory Compression

      2018, 29(3):734-755. DOI: 10.13328/j.cnki.jos.005438 CSTR:

      Abstract (4313) HTML (2717) PDF 2.69 M (7524) Comment (0) Favorites

      Abstract:With the rapid development of positioning technologies, positioning sensors are widely used in smart phones, car navigation system and other mobile devices. These positioning systems collect data points at certain sampling rates and produce massive trajectories, which further bring the challenges of storage and transmission of the trajectory data. The trajectory compression technique reduces the waste of the network bandwidth and the storage space by removing the redundant trajectory points and preserving the key trajectory points. This paper summarizes the progresses of trajectory compression researches and proposes a road-network aware and error bounded online trajectory compression system, named ROADER. The system includes a distance-bounded Hidden Markov map matching algorithm and error-bounded efficient trajectory compression algorithm. Experiments based on real data sets show that the system is superior to similar systems in terms of compression ratio, error occurrence and running time.

    • Edge Sampling Based Network Embedding Model

      2018, 29(3):756-771. DOI: 10.13328/j.cnki.jos.005435 CSTR:

      Abstract (3939) HTML (2530) PDF 2.34 M (6478) Comment (0) Favorites

      Abstract:With the development of online social networks such as Weibo, WeChat and Facebook, network representation learning has drawn widespread research interests from academia and industry. Traditional network embedding models exploit the spectral properties of matrix representations of graphs, which suffer from both computation and performance bottlenecks when applied to real world networks. Recently, a lot of neural network based embedding models are presented in the literature. They are computationally efficient and preserve the network structure information well. The vertices in the network are connected to various types of relations, which convey rich information. However, such important information are neglected by all existing models. This paper proposes NEES, an unsupervised network embedding model to encode the relations. It first obtains the edge vectors by edge sampling to reflect the relation types of the edges. Then, it uses the edge vectors to learn a low dimension representation for each node in the graph. Extensive experiments are conducted on several social networks and one citation network. The results show that NEES model outperforms the state-of-the-art methods in multi-label classification and link prediction tasks. NEES is also scalable to large-scale networks in the real world.

    • Effective and Efficient Approach for Graph De-Anonymization

      2018, 29(3):772-785. DOI: 10.13328/j.cnki.jos.005436 CSTR:

      Abstract (4253) HTML (2752) PDF 1.81 M (6584) Comment (0) Favorites

      Abstract:Ever since social networks became the focus of a great number of researches, the privacy risks of published network data have also raised considerable concerns. To evaluate users' privacy risks, researchers have developed methods to de-anonymize graphs and identify same person in different graphs, yet the existing algorithms either requires high-quality seed mappings, or have low accuracy and high expense. In this paper, an effective and efficient seedless de-anonymization algorithm, "RoleMatch" is proposed. This algorithm is based on the network topology and consists of (1) a new cross-graph node similarity measurement "RoleSim++" with fast computation method, and (2) an effective node matching algorithm considering both similarities and feedbacks. In experiments, the algorithm is tested with graphs anonymized in several popular anonymization ways, using the data from LiveJournal. In addition to the traditional symmetric experiments, an asymmetric experiment setting is proposed to mimic closer to real-world application. The results from those experiment show that with the proposed algorithm the de-anonymization work achieves superior performance compared with existing de-anonymization algorithms.

    • Graph Embedding by Incorporating Prior Knowledge on Vertex Information

      2018, 29(3):786-798. DOI: 10.13328/j.cnki.jos.005437 CSTR:

      Abstract (4182) HTML (2823) PDF 1.36 M (7833) Comment (0) Favorites

      Abstract:Graph embedding is a fundamental technique for graph data mining. The real-world graphs not only consist of complex network structures, but also contain diverse vertex information. How to integrate the network structure and vertex information into the graph embedding procedure is a big challenge. To deal with this challenge, a graph embedding method, which is based on deep leaning technique while taking into account the prior knowledge on vertices information, is proposed in this paper. The basic idea of the proposed method is to regard the vertex features as the prior knowledge, and learn the representation vector through optimizing an objective function that simultaneously keeps the similarity of network structure and vertex features. The time complexity of the proposed method is O(|V|), where|V|is the count of vertices in the graph. This indicates the proposed method is suitable for large-scale graph analysis. Experiments on several data sets demonstrate that, compared with the state-of-art baselines, the proposed method is able to achieve favorable and stable results for the task of node classification.

    • Database Query Cost Prediction Using Recurrent Neural Network

      2018, 29(3):799-810. DOI: 10.13328/j.cnki.jos.005439 CSTR:

      Abstract (4709) HTML (2310) PDF 1.46 M (8233) Comment (0) Favorites

      Abstract:Query cost models are the key parts of database workload management and performance tuning. Firstly, it is difficult, even impossible, to precisely estimate the costs of different relational operators due to the complexity of database systems and competition of computer resources. Secondly, most existing research work uses general query information without taking advantage of actual operators because of the complexity of query plans. Thirdly, most previous research work does not address the problem of predicting actual execution time of a query but rather predicts the query performance by the cost the like query optimizers generate. To reduce the complexity of workload management, his paper proposes an elaborate cost prediction model based on recurrent neural network through learning from operator behavior and detailed runtime information. In particular, the model uses a special kind of recurrent neural network, called long-short term memory (LSTM). Given an ad-hoc query, the model is able to predict its running time before it starts to run. It is more meaningful than the state-of-the-art query optimizers of existing database systems which only estimate costs in arbitrary units. It is also better than query progress indicators which cannot predict cost before the query runs. This research provides a novel approach to solve the key problem in database workload management. Verified by the experiments, the accuracy of the model is over 71% which shows the method is feasible to some degree.

    • Identifying Users Across Social Networks Based on Global View Features with Crowdsourcing

      2018, 29(3):811-823. DOI: 10.13328/j.cnki.jos.005448 CSTR:

      Abstract (4035) HTML (3274) PDF 1.57 M (5806) Comment (0) Favorites

      Abstract:With the popularity and development of Internet, people like to take part in multiple social networks to enjoy different kinds of services. Consequently, an important task is to identify users in the networks, which is helpful for user recommendation, behavior analysis and impact maximization. Most state-of-the-art works on this issue are mainly based on the user's structure features and attribute features. They prefer to exploit user's local features and are limited by the number of the known matching users. In this paper, a method based on global view features is proposed to align users with crowdsourcing (OCSA). First, crowdsourcing is used to increase the number of known matching users on networks. Then, global view features are used to evaluate the similarity between users to improve the accuracy of user identification. Finally, an iterative two-stage matching method is put forward to answer the user identification. The results of experiments show that the presented method has better performance on precision and recall, especially when the number of known matching users is insufficient.

    • Optimal Task Assignment Algorithm Based on Tree-Decouple in Spatial Crowdsourcing

      2018, 29(3):824-838. DOI: 10.13328/j.cnki.jos.005453 CSTR:

      Abstract (4021) HTML (3323) PDF 1.61 M (6561) Comment (0) Favorites

      Abstract:The ubiquity of mobile devices with high-fidelity sensors and the sharp decreases in the cost of ultra-broadband wireless network flourish the market of spatial crowdsourcing, which has been proposed as a new framework to assign location-aware tasks (e.g., reporting road traffic, delivering food) to workers (i.e., persons equipped with smart device and willing to perform tasks). This paper studies the task assignment problem that concerns the optimal strategy of assigning each task to proper worker such that the total number of completed tasks can be maximized while all workers can go back to their starting point before expected deadlines after performing assigned tasks. It is an intractable problem since optimal assignment for individual worker does not necessarily lead to global optimal results. Observing that the task assignment dependency only exists amongst subsets of workers, this study utilizes tree-decomposition technique to separate workers into independent clusters and develops an efficient depth-first search algorithm with progressive bounds to prune non-promising assignments. Extended experiments demonstrate the effectiveness and efficiency of the proposed solution.

    • Social Relationship Mining Algorithm by Multi-Dimensional Graph Structural Clustering

      2018, 29(3):839-852. DOI: 10.13328/j.cnki.jos.005454 CSTR:

      Abstract (4254) HTML (3068) PDF 2.03 M (7827) Comment (0) Favorites

      Abstract:Social relationship mining is a hot topic in the area of massive graph analysis. Graph clustering algorithms such as SCAN (structural clustering algorithm for networks) can quickly discover the communities from the massive graph data. However, relationships in these communities fail to reflect the ‘real’ social information such as family, colleagues and classmates. In reality, social data is very complex, and there are many types of interaction among each individual, such as calling, meeting, chatting in WeChat, and sending emails. However, traditional SCAN algorithm can only handle single dimensional graph data. Based on the study of multidimensional social graph data and traditional clustering algorithms, this paper first proposes an efficient subspace clustering algorithm named SCA by mining multi-dimensional clusters in subspaces as a mean to explore real social relationships. SCA follows the bottom-up principle and can discover the set of clusters from the social graph data in all dimensions. To improve the efficiency of SCA, the paper also develops a pruning algorithm called SCA+ based on the monotonicity of subspace clustering. Extensive experiments on several real-world multi-dimensional graph data demonstrate the efficiency and effectiveness of the proposed algorithms.

    • Community Based Node Betweenness Centrality Updating Algorithms in Dynamic Networks

      2018, 29(3):853-868. DOI: 10.13328/j.cnki.jos.005457 CSTR:

      Abstract (5035) HTML (3228) PDF 1.99 M (6954) Comment (0) Favorites

      Abstract:With the rapid development of Internet technology, social networks show a trend of explosive growth. As the traditional analysis on static networks becomes more and more difficult to achieve satisfactory results, dynamic network analysis has turned into a research hotspot in the field of social network data management. Node betweenness centrality measures the ability of a node to control the shortest paths between other nodes in the graph, which is useful for mining important nodes in social networks. However, the efficiency will be low if the betweenness centrality of all nodes needs to be calculated each time while the graph structure changes frequently. To address the difficult problem of computing node betweenness centrality in dynamic networks, a community based betweenness centrality updating algorithm is proposed in this paper. By maintaining the shortest distance sets between communities and communities, as well as between communities and nodes, the node pairs which are not affected during the dynamically updating process can be quickly filtered out, thus greatly improving the updating efficiency of node betweenness centrality. Experimental results conducted on real-world datasets and synthetic datasets show the effectiveness of the proposed algorithms.

    • Online Join Method for Skewed Data Streams

      2018, 29(3):869-882. DOI: 10.13328/j.cnki.jos.005440 CSTR:

      Abstract (3583) HTML (2261) PDF 1.72 M (5155) Comment (0) Favorites

      Abstract:Scalable distributed join processing in a parallel environment requires a partitioning policy to transfer data while minimizing the size of migrated statement and the number of communicated messages. Online theta-joins over data streams are more computationally expensive and impose higher memory requirement in distributed data stream management systems (DDSMS) than standalone database management systems (DBMS). The complete bipartite graph-based model can support distributed stream joins, and has the characteristics of memory-efficiency, elasticity and scalability. This is because each relation is stored in its corresponding processing units without data replicas and the units are independent of each other. However, due to the instability of data stream rate and the imbalance of attribute value distribution, the online theta-joins over skewed data streams can lead to the load imbalance of cluster. In this case, the bipartite graph-based model is unable to allocate the query nodes dynamically, and requires to set parameters about the grouping manually. The more serious issue is that the effect of the full-history join is worse. In this paper, a framework for handling skewed stream join is presented for enhancing the adaptability of the join model and minimizing the system cost based on the varying workloads. The proposal includes a mixed key-based and tuple-based partitioning scheme to handle skewed data in each side of the bipartite graph-based model, a strategy for redistribution of query nodes in two sides of this model, and a migration algorithm about state consistency to support full-history joins and adaptive resource management. Experiments with synthetic data and real data show that the presented method can effectively handle skewed data streams and improve the throughput of DDSMS, and it also effective especially on reducing the operational cost in the cloud environment.

    • Vector Referencing Oriented Platform-Oblivious In-Memory Join Optimization Technique

      2018, 29(3):883-895. DOI: 10.13328/j.cnki.jos.005446 CSTR:

      Abstract (3750) HTML (2436) PDF 1.60 M (4919) Comment (0) Favorites

      Abstract:Graph analysis database such as MapD employs the emerging manycore architecture GPU and Phi processors to support high performance analytical processing, where the join operation is still the performance bottleneck when facing complex data schemas. In recent years, as heterogeneous processors come to be main-stream high performance computing platforms, the researches of in-memory join performance extend the focuses from multicore to the emerging manycore platforms. However those efforts have not uncover the inner relationships among join algorithm performance, join dataset size and hardware architectures, and cannot provide sufficient join selection strategies for databases under the future heterogeneous processor platforms. This paper targets in-memory join optimization techniques on multicore, Xeon Phi and GPU processor platforms. By optimizing hash table design, this work uses vector mapping instead of hash mapping to eliminate the hashing overhead effects for performance, so that the in-memory join performance characteristics influenced can be measured by hardware factors such as multicore cache size, Xeon Phi cache size, Xeon Phi simultaneous multi-threading mechanism, and GPU SIMT (single instruction multiple threads) mechanism. The experimental results show that caching and simultaneous massive-threading mechanism are key factors to improve in-memory join algorithm performance. Caching mechanism performs well for cache fit join operations, the simultaneous massive-threading mechanism of GPU does well for big table joins, and Xeon Phi achieves the highest performance for L2 cache fit joins. The experimental results also exploit the relationship between in-memory join performance and heterogeneous processor hardware features, and provide optimization policy for in-memory database query optimizer on future heterogeneous processor platforms.

Current Issue


Volume , No.

Table of Contents

Archive

Volume

Issue

联系方式
  • 《Journal of Software 》
  • 主办单位:Institute of Software, CAS, China
  • 邮编:100190
  • 电话:010-62562563
  • 电子邮箱:jos@iscas.ac.cn
  • 网址:https://www.jos.org.cn
  • 刊号:ISSN 1000-9825
  •           CN 11-2560/TP
  • 国内定价:70元
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063