• Volume 28,Issue 3,2017 Table of Contents
    Select All
    Display Type: |
    • Log Replication and Recovery in Cluster-Based Database System

      2017, 28(3):476-489. DOI: 10.13328/j.cnki.jos.005162 CSTR:

      Abstract (3004) HTML (1061) PDF 1.68 M (3518) Comment (0) Favorites

      Abstract:Many applications such as social networking, online shopping and online finance may receive highly concurrent data access from massive Internet users. In this scenario, traditional single node database systems gradually become the bottleneck of the system, and the main reason for many successful Internet applications is the use of cluster-based data management systems. Compared with traditional database systems, cluster-based distributed database systems have better scalability and availability, and log replication is one of the core components to build these features. Master-slave based log replication cannot handle the uncertain logs while failure occurs, resulting in the risk of inconsistency among different copies. Consensus algorithms cannot be directly applied to the database system due to the lack of transaction consistency model, and they also have issues in leader election with livelock, as well as double master and continuous election problem. This paper introduces a log replication strategy and corresponding recovery technique for cluster environments, which can effectively process the uncertain logs and provide two read consistency options, i.e. strong and weak consistency. A lightweight master election algorithm is also presented to avoid the master election issues. The algorithms are implemented in the OceanBase distributed database system and tested using benchmark tool. Experiments show that the proposed method can improve the scalability and availability.

    • OLAP Foreign Join Algorithm for MIC Coprocessor

      2017, 28(3):490-501. DOI: 10.13328/j.cnki.jos.005156 CSTR:

      Abstract (2469) HTML (1141) PDF 1.70 M (3177) Comment (0) Favorites

      Abstract:The emerging many integrated core architecture (MIC) Xeon Phi coprocessor becomes the mainstream platform for high performance computing. For database applications, in-memory analytics requires computation intensive workload in which the in-memory foreign key joins between big fact table and dimension tables dominate the OLAP performance. This paper focuses on a cache-friendly foreign key join with respect to cache-conscious radix partitioning oriented hash join and cache-oblivious no-partitioning hash join to adapt to the small LLC size and massive simultaneous multi-threading mechanism of Xeon Phi coprocessor. By exploiting the characteristic of surrogate key in OLAP schema, the key matching oriented hash probing can be further simplified as surrogate key referencing between fact table and dimension tables with PK-FK reference constraint, so that the complex hash table and CPU cycle consuming hash probing can be simplified as directly referencing surrogate vector by mapping foreign key to offset address of surrogate vector. The surrogate vector referencing oriented foreign key join is simple and efficient to be implemented for Xeon Phi coprocessor for more cores, and also offers massive simultaneous multi-threading mechanism to overlap memory access latency. In experiments, the surrogate vector referencing foreign key join algorithm and traditional hash join algorithms (NPO and PRO) are compared on both Xeon E5-2650 v3 10-core CPU platform and Xeon Phi 5110P 60-core platform, the experimental results provide a comprehensive perspective for how the mainstream in-memory foreign key join algorithms perform with different datasets on different platforms.

    • Multi-Layered Index for HDFS-Based Systems

      2017, 28(3):502-513. DOI: 10.13328/j.cnki.jos.005161 CSTR:

      Abstract (2574) HTML (1072) PDF 1.56 M (3523) Comment (0) Favorites

      Abstract:The SOH (SQL over HDFS) systems usually store the data into distributed file system HDFS (Hadoop distributed file system), and process queries by the Map/Reduce computing framework or distributed database query engine. Benefitting from the fault tolerance and scalability provided by Map/Reduce and HDFS, SOH systems perform well in processing analytical queries over big data. However, the efficiency of such systems is too low to meet the requirement of selective queries or interactive queries which have strict limit on the query response time. This paper proposes a HDFS-based index, called HIndex, for SOH systems. HIndex can easily be integrated into the existing SOH systems to improve the efficiency of query evaluation. The process that SOH systems access data stored in HDFS is analyzed, and the important factors affecting the time cost is highlighted, a two-layer index structure is proposed, and both aggregated and non-aggregated index techniques are implemented. According to the experiments conducted on standard datasets, HIndex performs much better than Hadoop++, a state-of-the-art HDFS-based index.

    • Research Advance on MapReduce Based Big Data Processing Platforms and Algorithms

      2017, 28(3):514-543. DOI: 10.13328/j.cnki.jos.005169 CSTR:

      Abstract (4652) HTML (1606) PDF 3.58 M (7406) Comment (0) Favorites

      Abstract:This paper introduces the research advance on MapReduce based big data processing platforms. Frist, twelve typical MapReduce based data processing platforms are descripted, their implementation principles and application areas are compared, and their commonalities are concluded. Second, the MapReduce based big data processing algorithms, including search algorithms, data cleansing/transformation algorithms, aggregation algorithms, join algorithms, sorting algorithms, optimization algorithms, preference query algorithms, graph algorithms, and data mining algorithms, are studied. These algorithms are classified by their MapReduce implementations, and the factors that affect their performance are analyzed. Finally, big data processing algorithms are abstracted as the out-of-core algorithms whose performance features are well analyzed. The considerations, ideas and challenges of universal optimizations on the performance of out-of-core algorithms are proposed as references for researchers. These optimizations include optimizing algorithms' I/O cost and locality, and designing incremental iterative algorithms. Comparing the current topics, such as resource allocation and task scheduling based dynamic optimizations on platform, parallelization for specific algorithms, and performance optimizations on iterative algorithms, the proposed static optimizations serve as complements that highlight new areas for the researchers.

    • Cost Minimization Method for Multi-Source Big Data Processing in Clouds

      2017, 28(3):544-562. DOI: 10.13328/j.cnki.jos.005160 CSTR:

      Abstract (2841) HTML (1060) PDF 2.31 M (3978) Comment (0) Favorites

      Abstract:Cloud computing has shown to provide a cost-effective and powerful platform for big data processing. Under this paradigm, data manager (DM) usually rents geographically distributed datacenters to process their geographically dispersed data set, concerning its convenience and economy. Usually, the data sets are dynamically generated and the resource pricing varies over time, which make it a critical issue of cost effectiveness to move the data from different geographic locations to different datacenters while providing suitable computation resources for processing. In this paper, a pertinent joint stochastic optimization problem is firstly formulated, and then the problem is decoupled into two independent subproblems with efficient solutions via Lyapunov framework. Next, an online algorithm based on the solutions is developed. Theoretical analysis show that the proposed online algorithm can produce a solution which is arbitrarily close to the offline optimal solution while minimizing the data processing delays. Experiments on WorldCup98 and Youtube dataset validate the proposed algorithms and demonstrate the superiority of the new approach.

    • High-Performance Data Distribution Algorithm on Distributed Stream Systems

      2017, 28(3):563-578. DOI: 10.13328/j.cnki.jos.005168 CSTR:

      Abstract (2650) HTML (1184) PDF 1.88 M (5175) Comment (0) Favorites

      Abstract:Along with the popularization of big data applications, scalable and efficient stream join processing plays a more important role in online real-time analysis. The distributed parallel processing framework provides an effective solution which facilitates processing of massive data stream with low latency. For Key-based calculations, data skewness and inherent features of stream data, such as real-time, dynamics and unpredictability on data volume, lead to load imbalance to distributed processing systems. Such phenomenon can produce poor performance and waste hardware resources. There have been two solutions to load imbalance:1) Key-based migration scheme that keeps balance among parallel processing nodes; 2) tuple-based partitioning scheme that distributes data randomly to achieve load balance. The former scheme adjusts system to the defined equilibrium range, which resembles the one-dimensional packing problem. And the latter maintains the accuracy of Key-based operations, which certainly incurs additional memory cost and network communication cost. This paper presents a novel parallel processing scheme that combines both Key-based and tuple-based schemes to partition keys on demand. The proposed scheme adopts a lightweight load balance algorithm and a partitioning scheme which retains the characteristics of Key-based operations, thus realizing the load balance of tuple-base strategy while reducing the additional cost of fine-grained balance.

    • Effective Multi-Objective Scheduling Strategy of Dataflow in Cloud

      2017, 28(3):579-597. DOI: 10.13328/j.cnki.jos.005158 CSTR:

      Abstract (2719) HTML (1182) PDF 2.54 M (3549) Comment (0) Favorites

      Abstract:As a new emerging service provider, cloud computing, exhibiting advantages and disadvantages when executing the scientific data flows, is getting more and more attention. One of the main factors that constitute the performance bottleneck is there are many homogeneous and concurrent task packages in cloud. This paper focuses on optimizing the scheduling process in dataflow and transforming the optimization objectives into user metrics (makespan and economic cost) and indicators of cloud systems (network bandwidth, storage constraints and system fairness). An efficient multi-objective game algorithm (MOG) is proposed by formulating the optimization problem as a new cooperative game. The MOG method is able to optimize the user metrics while satisfying the constraint of the system metrics and ensuring the efficiency and fairness of the cloud resources. Comprehensive experiments demonstrate that compared with other related algorithms, the proposed MOG method has obvious advantages in terms of algorithm complexity O(l·K·M) (improvement of magnitude), result quality (optimum in some cases) and system level fairness.

    • Large-Scale Resource Scheduling Based on Minimum Cost Maximum Flow

      2017, 28(3):598-610. DOI: 10.13328/j.cnki.jos.005167 CSTR:

      Abstract (2622) HTML (1572) PDF 1.82 M (4703) Comment (0) Favorites

      Abstract:Concurrent job execution is a hot topic in large-scale resource scheduling research. Existing efforts employ queueing model with local optimal solution to schedule co-located tasks, thus can only fit specific requirement. Hence, how to design a single scheduler to meet diverse requirements is challenging. This paper introduces Sirius, a new framework for resource scheduling based on minimum cost maximum flow network. This new approach makes it easy to express scheduling requirements, including fairness, priority and placement constraint, on a unified way as a typical graph construction and solution problem. Meanwhile, an incremental algorithm is implemented to speed up the flow network solver, significantly reducing its runtime by 90 percent.

    • Online Task Assignment for Three Types of Objects under Spatial Crowdsourcing Environment

      2017, 28(3):611-630. DOI: 10.13328/j.cnki.jos.005166 CSTR:

      Abstract (3011) HTML (1270) PDF 3.43 M (5175) Comment (0) Favorites

      Abstract:With the rapid development of mobile Internet techniques and Online-to-offline (O2O) business models, various spatial crowdsourcing (SC) platforms become popular. In particular, the SC platforms, such as Didi taxi and Baidu meal-ordering service, play a significant role in people's daily life. A core issue in SC is task assignment, which is to assign real-time tasks to suitable crowd workers. Existing approaches usually are based on infeasible assumptions and have the following two drawbacks:(1) Existing methods often assume to work on the static scenarios, where the spatio-temporal information of all tasks and workers is known before the assignment is conducted. However, since both tasks and workers dynamically appear and request to be allocated in real time, therefore, existing works are impractical in real applications. (2) Existing studies usually assume that there are only two types of objects, tasks and workers, in SC and ignore the influence of workplace for task assignment. To solve the aforementioned challenges, this paper frames a novel dynamic task assignment problem, called online task assignment for three types of objects in spatial crowdsourcing, which not only includes the three types of objects, namely tasks, workers and workplaces, but also focuses on dynamic scenarios. Moreover, a random-threshold-based algorithm is designed for the new problem and a worst-case competitive analysis is provided for the algorithm. Particularly, to further optimize the algorithm, an adaptive threshold algorithm, which is always close to the best possible effectiveness of the random-threshold-based algorithm, is developed. Finally, the effectiveness and efficiency of the proposed methods are verified through extensive experiments on real dataset and synthetic datasets generated by different distributions.

    • Algorithm for Detecting Overlapping Communities from Complex Network Big Data

      2017, 28(3):631-647. DOI: 10.13328/j.cnki.jos.005155 CSTR:

      Abstract (3662) HTML (1693) PDF 2.12 M (7545) Comment (0) Favorites

      Abstract:Currently, the number of Internet users, along with complex networks including online social networks and electronic commerce networks, is growing explosively. To effectively and efficiently detecting overlapping community structure from complex network, big data plays an essential role in point of interest recommendation and hotspot propagation. In this study, a new algorithm over complex networks is proposed to detecting overlapping communities with a time complexity of O(nlog2(n)). The algorithm applies a new method for updating node and edge modularity based on the techniques of modularity clustering and graph computing. Balanced binary tree is used to index the modularity increment, and an overlapping community detection approach is provided based on the idea of modularity optimization to reduce the frequency of node analysis compared to traditional approaches. Experiments are conducted on real complex network big data, and the results show that the DOC algorithm can effectively detect overlapping communities with high accuracy, the normalized mutual information (NMI) can reach to 0.97 in large-scale LFR benchmark datasets, and the overlapping community detecting standard F-score value is averagely higher than 0.91. In addition, the runtime efficiency beats traditional approaches in complex network big data.

    • Community Detection Algorithm Based on Deep Sparse Autoencoder

      2017, 28(3):648-662. DOI: 10.13328/j.cnki.jos.005165 CSTR:

      Abstract (3001) HTML (1484) PDF 1.94 M (5130) Comment (0) Favorites

      Abstract:Community structure is one of the most important features of complex network. Community detection is of great significance in exploring the network structure. Classical clustering algorithms such as k-means are the basic methods for community detection. However, the detection results are often not accurate enough when dealing with high-dimensional matrix when using these classical methods. In this study, a community detection algorithm based on deep sparse autoencoder (CoDDA) is proposed to improve the accuracy of community detection using high-dimensional adjacent matrix with the classical methods. First, a hop-based operation for sparse adjacent matrix is provided to obtain the similarity matrix, which can express not only the relations between nodes that are linked but also the relations between nodes that are not linked. Then, a deep sparse autoencoder based on unsupervised deep learning methods is designed to extract the features of similarity matrix and obtain the low-dimensional feature matrix which can represent the features of network topology better than similarity matrix. Finally, k-means is used to identify the communities according to the feature matrix. Experimental results show that CoDDA can obtain more accurate communities than the six baseline methods. Besides, the parameter analysis indicates that CoDDA can result in more accurate communities than the k-means algorithm which finds the communities according to the high-dimensional matrix directly.

    • Role-Based Structural Evolution and Prediction in Dynamic Networks

      2017, 28(3):663-675. DOI: 10.13328/j.cnki.jos.005164 CSTR:

      Abstract (2350) HTML (1465) PDF 1.79 M (3652) Comment (0) Favorites

      Abstract:Dynamic information network is a new challenging problem in the field of current complex networks. The evolution of dynamic networks is temporal, complex and changeable. Structure is the basic characteristics of the network, and is also the basis of network modeling and analysis. The study of the network structure evolution is of great importance in getting a comprehensive understanding of the behavior trend of complex systems. This paper introduces "role" to quantify the structure of dynamic network and proposes a role-based model. To predict the role distributions of dynamic network nodes in future time, the presented framework views role prediction as a multi-target regression problem, extracts properties from historical snapshot sub-network, and predicts the future role distributions of dynamic network nodes. The paper then proposes a multi-target regression based role prediction (MTR-RP) method for dynamic network. This method not only overcomes the drawback of the existing methods which operate on transfer matrix while ignoring the time factor, but also takes into account of possible dependencies between multiple forecast targets. Experiments results show that MTR-RP has better and more stable prediction capability compared with the existing methods.

    • Extracting Product Aspects and User Opinions Based on Semantic Constrained LDA Model

      2017, 28(3):676-693. DOI: 10.13328/j.cnki.jos.005154 CSTR:

      Abstract (2533) HTML (1817) PDF 2.14 M (5573) Comment (0) Favorites

      Abstract:With the development of online shopping, the Web has produced a large quantity of product reviews containing abundant evaluation knowledge about products. How to extract aspect and opinion words from the reviews and further obtain the sentiment polarity of the products at aspect level is the key problems to solve in fine-grained sentiment analysis of product reviews. First, considering certain features of Chinese product reviews, this paper designs methods to derive semantic relationships among words through syntactic analysis, word meaning understanding and context relevance, and then embed them as constrained knowledge into the topic model. Second, a semantic relation constrained topic model called SRC-LDA is proposed to guide the LDA to extract fine-grained topical words. Through the improvement of semantic comprehension and recognition ability of topical words in standard LDA, the proposed model can increase the words correlation under the same topic and the discrimination under the different topics, thus revealing more fine-grained aspect words, opinion words and their semantic associations. The experimental results show that SRC-LDA is an effective approach for fine-grained aspects and opinion words extraction.

    • Mining Topic Sentiment in Micro-Blogging Based on Micro-Blogger Social Relation

      2017, 28(3):694-707. DOI: 10.13328/j.cnki.jos.005157 CSTR:

      Abstract (2908) HTML (1232) PDF 1.64 M (4462) Comment (0) Favorites

      Abstract:Sentiment analysis in micro-blogging is an important task in mining social media, and has important theoretical and application value in personalized recommendation and public opinion analysis. Topic sentiment models have attracted much attention due to their good performance and ability of synchronized topic and the sentiment analysis in micro-blogs. However, most existing models simply assume that topic sentiment distributions of different micro-blogs are independent, which is contrary to the realistic status in micro-blogging and thus further leads to unsatisfactory modeling of micro-blogger's true sentiment. To address the issues, a probabilistic model, SRTSM (social relation topic sentiment model) is proposed. The new model introduces sentiment and micro-blogger social relation into LDA inference framework and achieves synchronized detection of sentiment and topic in micro-blogging. Extensive experiments on Sina Weibo show that SRTSM outperforms state-of-the-art unsupervised approaches including JST, SLDA and DPLDA significantly in terms of sentiment classification accuracy.

    • Diversified Mobile App Recommendation Combining Topic Model and Collaborative Filtering

      2017, 28(3):708-720. DOI: 10.13328/j.cnki.jos.005163 CSTR:

      Abstract (3284) HTML (1092) PDF 1.55 M (4423) Comment (0) Favorites

      Abstract:With rapid growth of mobile applications, users of mobile app platforms are facing problem of information overload. Large number of apps make it difficult for users to find appropriate ones, while many long tail apps are submerged in the resource pool and are unknown to most users. Meanwhile, existing recommendation methods usually pay more attention to accuracy than diversity, making popular apps as most recommended items. As a result, the overall exposure rate of mobile apps is low as behavior data accumulated by the system is gradually biased towards popular apps, which leads to a poor recommendation performance in the long run. To solve this problem, this article first proposes two recommendation methods, named LDA_MF and LDA_CF, to improve existing methods. LDA_MF combines user topic model and app topic model with matrix factorization model MF, and LDA_CF takes both tag information of apps and user behavior data into consideration. In order to take advantages of different algorithms and increase the diversity of recommendation results without sacrificing accuracy, a hybrid recommendation algorithm is also provided to combine LDA_MF, LDA_CF and item-based collaborative filtering models. A large real data set is used to evaluate the proposed methods, and the results show that the presented approach achieves better diversity and good recommendation accuracy.

    • Trust-Based Recommendation Algorithm in Social Network

      2017, 28(3):721-731. DOI: 10.13328/j.cnki.jos.005159 CSTR:

      Abstract (3053) HTML (1777) PDF 1.38 M (7429) Comment (0) Favorites

      Abstract:The existing trust-based recommendation algorithms usually assume that users are homogeneous, and therefore can't fully mine the trust relationship information. Moreover, the lack of efficient model for integrating similar relationship and trust relationship greatly affects the accuracy and reliability of those models. To solve the issue, this paper first proposes a trust-based recommendation algorithm called Trust-PMF. It combines similarity with trust to build user' preference and selects the target user's neighbors. Then, the probability matrix factorization model is extended by integrating memory-based idea and trust information, and a dynamic adaptive weight is used to determine the degree of influence of each part to form a unified and efficient Trust-PMF model. Finally, experiment results on Filmtrust and Epinions data sets are presented to demonstrate that the proposed method outperforms the state-of-the-art methods.

    • Dialysis Computing: Efficient InfoNetCube Materialization Strategy for OLGP

      2017, 28(3):732-743. DOI: 10.13328/j.cnki.jos.005170 CSTR:

      Abstract (2168) HTML (997) PDF 1.61 M (3106) Comment (0) Favorites

      Abstract:Calculation of the information network data cube (InfoNetCube) is the foundation of information online analytical processing. However, different from the traditional data cube, InfoNetCube consists of multiple lattices in which each cuboid contains a topic graph (or graph measurement), thus the storage consumption overhead is two orders of magnitude more than that of traditional data cube. How to materialize the specified cuboids or lattice rapidly and efficiently in the information network is a quite challenging research issue. In this paper, a novel InfoNetCube materializing strategy for information network is proposed based on dialysis computing. By leveraging the anti-monotonicity of topic graph measurement in the information and topology dimensions, a dialysis based space pruning algorithm is constructed to rapidly dialysis out the hidden sub graph, cuboids and lattices. Experimental results show that the proposed partial materialization algorithm outperform the cube based partial materialization strategy, saving almost 75% aggregation time.

Current Issue


Volume , No.

Table of Contents

Archive

Volume

Issue

联系方式
  • 《Journal of Software 》
  • 主办单位:Institute of Software, CAS, China
  • 邮编:100190
  • 电话:010-62562563
  • 电子邮箱:jos@iscas.ac.cn
  • 网址:https://www.jos.org.cn
  • 刊号:ISSN 1000-9825
  •           CN 11-2560/TP
  • 国内定价:70元
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063