WANG Jia-Hao , CAI Peng , QIAN Wei-Ning , ZHOU Ao-Ying
2017, 28(3):476-489. DOI: 10.13328/j.cnki.jos.005162 CSTR:
Abstract:Many applications such as social networking, online shopping and online finance may receive highly concurrent data access from massive Internet users. In this scenario, traditional single node database systems gradually become the bottleneck of the system, and the main reason for many successful Internet applications is the use of cluster-based data management systems. Compared with traditional database systems, cluster-based distributed database systems have better scalability and availability, and log replication is one of the core components to build these features. Master-slave based log replication cannot handle the uncertain logs while failure occurs, resulting in the risk of inconsistency among different copies. Consensus algorithms cannot be directly applied to the database system due to the lack of transaction consistency model, and they also have issues in leader election with livelock, as well as double master and continuous election problem. This paper introduces a log replication strategy and corresponding recovery technique for cluster environments, which can effectively process the uncertain logs and provide two read consistency options, i.e. strong and weak consistency. A lightweight master election algorithm is also presented to avoid the master election issues. The algorithms are implemented in the OceanBase distributed database system and tested using benchmark tool. Experiments show that the proposed method can improve the scalability and availability.
ZHANG Yu , ZHANG Yan-Song , CHEN Hong , WANG Shan
2017, 28(3):490-501. DOI: 10.13328/j.cnki.jos.005156 CSTR:
Abstract:The emerging many integrated core architecture (MIC) Xeon Phi coprocessor becomes the mainstream platform for high performance computing. For database applications, in-memory analytics requires computation intensive workload in which the in-memory foreign key joins between big fact table and dimension tables dominate the OLAP performance. This paper focuses on a cache-friendly foreign key join with respect to cache-conscious radix partitioning oriented hash join and cache-oblivious no-partitioning hash join to adapt to the small LLC size and massive simultaneous multi-threading mechanism of Xeon Phi coprocessor. By exploiting the characteristic of surrogate key in OLAP schema, the key matching oriented hash probing can be further simplified as surrogate key referencing between fact table and dimension tables with PK-FK reference constraint, so that the complex hash table and CPU cycle consuming hash probing can be simplified as directly referencing surrogate vector by mapping foreign key to offset address of surrogate vector. The surrogate vector referencing oriented foreign key join is simple and efficient to be implemented for Xeon Phi coprocessor for more cores, and also offers massive simultaneous multi-threading mechanism to overlap memory access latency. In experiments, the surrogate vector referencing foreign key join algorithm and traditional hash join algorithms (NPO and PRO) are compared on both Xeon E5-2650 v3 10-core CPU platform and Xeon Phi 5110P 60-core platform, the experimental results provide a comprehensive perspective for how the mainstream in-memory foreign key join algorithms perform with different datasets on different platforms.
HE Long , CHEN Jin-Chuan , DU Xiao-Yong
2017, 28(3):502-513. DOI: 10.13328/j.cnki.jos.005161 CSTR:
Abstract:The SOH (SQL over HDFS) systems usually store the data into distributed file system HDFS (Hadoop distributed file system), and process queries by the Map/Reduce computing framework or distributed database query engine. Benefitting from the fault tolerance and scalability provided by Map/Reduce and HDFS, SOH systems perform well in processing analytical queries over big data. However, the efficiency of such systems is too low to meet the requirement of selective queries or interactive queries which have strict limit on the query response time. This paper proposes a HDFS-based index, called HIndex, for SOH systems. HIndex can easily be integrated into the existing SOH systems to improve the efficiency of query evaluation. The process that SOH systems access data stored in HDFS is analyzed, and the important factors affecting the time cost is highlighted, a two-layer index structure is proposed, and both aggregated and non-aggregated index techniques are implemented. According to the experiments conducted on standard datasets, HIndex performs much better than Hadoop++, a state-of-the-art HDFS-based index.
SONG Jie , SUN Zong-Zhe , MAO Ke-Ming , BAO Yu-Bin , YU Ge
2017, 28(3):514-543. DOI: 10.13328/j.cnki.jos.005169 CSTR:
Abstract:This paper introduces the research advance on MapReduce based big data processing platforms. Frist, twelve typical MapReduce based data processing platforms are descripted, their implementation principles and application areas are compared, and their commonalities are concluded. Second, the MapReduce based big data processing algorithms, including search algorithms, data cleansing/transformation algorithms, aggregation algorithms, join algorithms, sorting algorithms, optimization algorithms, preference query algorithms, graph algorithms, and data mining algorithms, are studied. These algorithms are classified by their MapReduce implementations, and the factors that affect their performance are analyzed. Finally, big data processing algorithms are abstracted as the out-of-core algorithms whose performance features are well analyzed. The considerations, ideas and challenges of universal optimizations on the performance of out-of-core algorithms are proposed as references for researchers. These optimizations include optimizing algorithms' I/O cost and locality, and designing incremental iterative algorithms. Comparing the current topics, such as resource allocation and task scheduling based dynamic optimizations on platform, parallelization for specific algorithms, and performance optimizations on iterative algorithms, the proposed static optimizations serve as complements that highlight new areas for the researchers.
XIAO Wen-Hua , BAO Wei-Dong , ZHU Xiao-Min , SHAO Yi-Yang , CHEN Chao , Jianhong Wu
2017, 28(3):544-562. DOI: 10.13328/j.cnki.jos.005160 CSTR:
Abstract:Cloud computing has shown to provide a cost-effective and powerful platform for big data processing. Under this paradigm, data manager (DM) usually rents geographically distributed datacenters to process their geographically dispersed data set, concerning its convenience and economy. Usually, the data sets are dynamically generated and the resource pricing varies over time, which make it a critical issue of cost effectiveness to move the data from different geographic locations to different datacenters while providing suitable computation resources for processing. In this paper, a pertinent joint stochastic optimization problem is firstly formulated, and then the problem is decoupled into two independent subproblems with efficient solutions via Lyapunov framework. Next, an online algorithm based on the solutions is developed. Theoretical analysis show that the proposed online algorithm can produce a solution which is arbitrarily close to the offline optimal solution while minimizing the data processing delays. Experiments on WorldCup98 and Youtube dataset validate the proposed algorithms and demonstrate the superiority of the new approach.
FANG Jun-Hua , WANG Xiao-Tong , ZHANG Rong , ZHOU Ao-Ying
2017, 28(3):563-578. DOI: 10.13328/j.cnki.jos.005168 CSTR:
Abstract:Along with the popularization of big data applications, scalable and efficient stream join processing plays a more important role in online real-time analysis. The distributed parallel processing framework provides an effective solution which facilitates processing of massive data stream with low latency. For Key-based calculations, data skewness and inherent features of stream data, such as real-time, dynamics and unpredictability on data volume, lead to load imbalance to distributed processing systems. Such phenomenon can produce poor performance and waste hardware resources. There have been two solutions to load imbalance:1) Key-based migration scheme that keeps balance among parallel processing nodes; 2) tuple-based partitioning scheme that distributes data randomly to achieve load balance. The former scheme adjusts system to the defined equilibrium range, which resembles the one-dimensional packing problem. And the latter maintains the accuracy of Key-based operations, which certainly incurs additional memory cost and network communication cost. This paper presents a novel parallel processing scheme that combines both Key-based and tuple-based schemes to partition keys on demand. The proposed scheme adopts a lightweight load balance algorithm and a partitioning scheme which retains the characteristics of Key-based operations, thus realizing the load balance of tuple-base strategy while reducing the additional cost of fine-grained balance.
SHEN Yao , QIN Xiao-Lin , BAO Zhi-Feng
2017, 28(3):579-597. DOI: 10.13328/j.cnki.jos.005158 CSTR:
Abstract:As a new emerging service provider, cloud computing, exhibiting advantages and disadvantages when executing the scientific data flows, is getting more and more attention. One of the main factors that constitute the performance bottleneck is there are many homogeneous and concurrent task packages in cloud. This paper focuses on optimizing the scheduling process in dataflow and transforming the optimization objectives into user metrics (makespan and economic cost) and indicators of cloud systems (network bandwidth, storage constraints and system fairness). An efficient multi-objective game algorithm (MOG) is proposed by formulating the optimization problem as a new cooperative game. The MOG method is able to optimize the user metrics while satisfying the constraint of the system metrics and ensuring the efficiency and fairness of the cloud resources. Comprehensive experiments demonstrate that compared with other related algorithms, the proposed MOG method has obvious advantages in terms of algorithm complexity O(l·K·M) (improvement of magnitude), result quality (optimum in some cases) and system level fairness.
CHEN Xiao-Xu , WU Heng , WU Yue-Wen , LU Zhi-Gang , ZHANG Wen-Bo
2017, 28(3):598-610. DOI: 10.13328/j.cnki.jos.005167 CSTR:
Abstract:Concurrent job execution is a hot topic in large-scale resource scheduling research. Existing efforts employ queueing model with local optimal solution to schedule co-located tasks, thus can only fit specific requirement. Hence, how to design a single scheduler to meet diverse requirements is challenging. This paper introduces Sirius, a new framework for resource scheduling based on minimum cost maximum flow network. This new approach makes it easy to express scheduling requirements, including fairness, priority and placement constraint, on a unified way as a typical graph construction and solution problem. Meanwhile, an incremental algorithm is implemented to speed up the flow network solver, significantly reducing its runtime by 90 percent.
SONG Tian-Shu , TONG Yong-Xin , WANG Li-Bin , XU Ke
2017, 28(3):611-630. DOI: 10.13328/j.cnki.jos.005166 CSTR:
Abstract:With the rapid development of mobile Internet techniques and Online-to-offline (O2O) business models, various spatial crowdsourcing (SC) platforms become popular. In particular, the SC platforms, such as Didi taxi and Baidu meal-ordering service, play a significant role in people's daily life. A core issue in SC is task assignment, which is to assign real-time tasks to suitable crowd workers. Existing approaches usually are based on infeasible assumptions and have the following two drawbacks:(1) Existing methods often assume to work on the static scenarios, where the spatio-temporal information of all tasks and workers is known before the assignment is conducted. However, since both tasks and workers dynamically appear and request to be allocated in real time, therefore, existing works are impractical in real applications. (2) Existing studies usually assume that there are only two types of objects, tasks and workers, in SC and ignore the influence of workplace for task assignment. To solve the aforementioned challenges, this paper frames a novel dynamic task assignment problem, called online task assignment for three types of objects in spatial crowdsourcing, which not only includes the three types of objects, namely tasks, workers and workplaces, but also focuses on dynamic scenarios. Moreover, a random-threshold-based algorithm is designed for the new problem and a worst-case competitive analysis is provided for the algorithm. Particularly, to further optimize the algorithm, an adaptive threshold algorithm, which is always close to the best possible effectiveness of the random-threshold-based algorithm, is developed. Finally, the effectiveness and efficiency of the proposed methods are verified through extensive experiments on real dataset and synthetic datasets generated by different distributions.
QIAO Shao-Jie , HAN Nan , ZHANG Kai-Feng , ZOU Lei , WANG Hong-Zhi , Louis Alberto GUTIERREZ
2017, 28(3):631-647. DOI: 10.13328/j.cnki.jos.005155 CSTR:
Abstract:Currently, the number of Internet users, along with complex networks including online social networks and electronic commerce networks, is growing explosively. To effectively and efficiently detecting overlapping community structure from complex network, big data plays an essential role in point of interest recommendation and hotspot propagation. In this study, a new algorithm over complex networks is proposed to detecting overlapping communities with a time complexity of O(nlog2(n)). The algorithm applies a new method for updating node and edge modularity based on the techniques of modularity clustering and graph computing. Balanced binary tree is used to index the modularity increment, and an overlapping community detection approach is provided based on the idea of modularity optimization to reduce the frequency of node analysis compared to traditional approaches. Experiments are conducted on real complex network big data, and the results show that the DOC algorithm can effectively detect overlapping communities with high accuracy, the normalized mutual information (NMI) can reach to 0.97 in large-scale LFR benchmark datasets, and the overlapping community detecting standard F-score value is averagely higher than 0.91. In addition, the runtime efficiency beats traditional approaches in complex network big data.
SHANG Jing-Wen , WANG Chao-Kun , XIN Xin , YING Xiang
2017, 28(3):648-662. DOI: 10.13328/j.cnki.jos.005165 CSTR:
Abstract:Community structure is one of the most important features of complex network. Community detection is of great significance in exploring the network structure. Classical clustering algorithms such as k-means are the basic methods for community detection. However, the detection results are often not accurate enough when dealing with high-dimensional matrix when using these classical methods. In this study, a community detection algorithm based on deep sparse autoencoder (CoDDA) is proposed to improve the accuracy of community detection using high-dimensional adjacent matrix with the classical methods. First, a hop-based operation for sparse adjacent matrix is provided to obtain the similarity matrix, which can express not only the relations between nodes that are linked but also the relations between nodes that are not linked. Then, a deep sparse autoencoder based on unsupervised deep learning methods is designed to extract the features of similarity matrix and obtain the low-dimensional feature matrix which can represent the features of network topology better than similarity matrix. Finally, k-means is used to identify the communities according to the feature matrix. Experimental results show that CoDDA can obtain more accurate communities than the six baseline methods. Besides, the parameter analysis indicates that CoDDA can result in more accurate communities than the k-means algorithm which finds the communities according to the high-dimensional matrix directly.
LI Chuan , FENG Bing-Qing , LI Yan-Mei , HU Shao-Lin , YANG Ning , TANG Chang-Jie
2017, 28(3):663-675. DOI: 10.13328/j.cnki.jos.005164 CSTR:
Abstract:Dynamic information network is a new challenging problem in the field of current complex networks. The evolution of dynamic networks is temporal, complex and changeable. Structure is the basic characteristics of the network, and is also the basis of network modeling and analysis. The study of the network structure evolution is of great importance in getting a comprehensive understanding of the behavior trend of complex systems. This paper introduces "role" to quantify the structure of dynamic network and proposes a role-based model. To predict the role distributions of dynamic network nodes in future time, the presented framework views role prediction as a multi-target regression problem, extracts properties from historical snapshot sub-network, and predicts the future role distributions of dynamic network nodes. The paper then proposes a multi-target regression based role prediction (MTR-RP) method for dynamic network. This method not only overcomes the drawback of the existing methods which operate on transfer matrix while ignoring the time factor, but also takes into account of possible dependencies between multiple forecast targets. Experiments results show that MTR-RP has better and more stable prediction capability compared with the existing methods.
PENG Yun , WAN Chang-Xuan , JIANG Teng-Jiao , LIU De-Xi , LIU Xi-Ping , LIAO Guo-Qiong
2017, 28(3):676-693. DOI: 10.13328/j.cnki.jos.005154 CSTR:
Abstract:With the development of online shopping, the Web has produced a large quantity of product reviews containing abundant evaluation knowledge about products. How to extract aspect and opinion words from the reviews and further obtain the sentiment polarity of the products at aspect level is the key problems to solve in fine-grained sentiment analysis of product reviews. First, considering certain features of Chinese product reviews, this paper designs methods to derive semantic relationships among words through syntactic analysis, word meaning understanding and context relevance, and then embed them as constrained knowledge into the topic model. Second, a semantic relation constrained topic model called SRC-LDA is proposed to guide the LDA to extract fine-grained topical words. Through the improvement of semantic comprehension and recognition ability of topical words in standard LDA, the proposed model can increase the words correlation under the same topic and the discrimination under the different topics, thus revealing more fine-grained aspect words, opinion words and their semantic associations. The experimental results show that SRC-LDA is an effective approach for fine-grained aspects and opinion words extraction.
HUANG Fa-Liang , YU Ge , ZHANG Ji-Lian , LI Chao-Xiong , YUAN Chang-An , LU Jing-Li
2017, 28(3):694-707. DOI: 10.13328/j.cnki.jos.005157 CSTR:
Abstract:Sentiment analysis in micro-blogging is an important task in mining social media, and has important theoretical and application value in personalized recommendation and public opinion analysis. Topic sentiment models have attracted much attention due to their good performance and ability of synchronized topic and the sentiment analysis in micro-blogs. However, most existing models simply assume that topic sentiment distributions of different micro-blogs are independent, which is contrary to the realistic status in micro-blogging and thus further leads to unsatisfactory modeling of micro-blogger's true sentiment. To address the issues, a probabilistic model, SRTSM (social relation topic sentiment model) is proposed. The new model introduces sentiment and micro-blogger social relation into LDA inference framework and achieves synchronized detection of sentiment and topic in micro-blogging. Extensive experiments on Sina Weibo show that SRTSM outperforms state-of-the-art unsupervised approaches including JST, SLDA and DPLDA significantly in terms of sentiment classification accuracy.
HUANG Lu , LIN Chuan-Jie , HE Jun , LIU Hong-Yan , DU Xiao-Yong
2017, 28(3):708-720. DOI: 10.13328/j.cnki.jos.005163 CSTR:
Abstract:With rapid growth of mobile applications, users of mobile app platforms are facing problem of information overload. Large number of apps make it difficult for users to find appropriate ones, while many long tail apps are submerged in the resource pool and are unknown to most users. Meanwhile, existing recommendation methods usually pay more attention to accuracy than diversity, making popular apps as most recommended items. As a result, the overall exposure rate of mobile apps is low as behavior data accumulated by the system is gradually biased towards popular apps, which leads to a poor recommendation performance in the long run. To solve this problem, this article first proposes two recommendation methods, named LDA_MF and LDA_CF, to improve existing methods. LDA_MF combines user topic model and app topic model with matrix factorization model MF, and LDA_CF takes both tag information of apps and user behavior data into consideration. In order to take advantages of different algorithms and increase the diversity of recommendation results without sacrificing accuracy, a hybrid recommendation algorithm is also provided to combine LDA_MF, LDA_CF and item-based collaborative filtering models. A large real data set is used to evaluate the proposed methods, and the results show that the presented approach achieves better diversity and good recommendation accuracy.
CHEN Ting , ZHU Qing , ZHOU Meng-Xi , WANG Shan
2017, 28(3):721-731. DOI: 10.13328/j.cnki.jos.005159 CSTR:
Abstract:The existing trust-based recommendation algorithms usually assume that users are homogeneous, and therefore can't fully mine the trust relationship information. Moreover, the lack of efficient model for integrating similar relationship and trust relationship greatly affects the accuracy and reliability of those models. To solve the issue, this paper first proposes a trust-based recommendation algorithm called Trust-PMF. It combines similarity with trust to build user' preference and selects the target user's neighbors. Then, the probability matrix factorization model is extended by integrating memory-based idea and trust information, and a dynamic adaptive weight is used to determine the degree of influence of each part to form a unified and efficient Trust-PMF model. Finally, experiment results on Filmtrust and Epinions data sets are presented to demonstrate that the proposed method outperforms the state-of-the-art methods.
LIU Guang-Ming , REN Yan , LI Chuan , YANG Ning , TANG Chang-Jie
2017, 28(3):732-743. DOI: 10.13328/j.cnki.jos.005170 CSTR:
Abstract:Calculation of the information network data cube (InfoNetCube) is the foundation of information online analytical processing. However, different from the traditional data cube, InfoNetCube consists of multiple lattices in which each cuboid contains a topic graph (or graph measurement), thus the storage consumption overhead is two orders of magnitude more than that of traditional data cube. How to materialize the specified cuboids or lattice rapidly and efficiently in the information network is a quite challenging research issue. In this paper, a novel InfoNetCube materializing strategy for information network is proposed based on dialysis computing. By leveraging the anti-monotonicity of topic graph measurement in the information and topology dimensions, a dialysis based space pruning algorithm is constructed to rapidly dialysis out the hidden sub graph, cuboids and lattices. Experimental results show that the proposed partial materialization algorithm outperform the cube based partial materialization strategy, saving almost 75% aggregation time.