2014, 25(4):693-712. DOI: 10.13328/j.cnki.jos.004551 CSTR:
Abstract:Development of mobile communication and sensing technologies forms location based big data, bringing revolution to human's living, business pattern and scientific research. Diversity of usage patterns and redundancy among various sources of location based big data make it impossible for classical location preservation methods to protect privacy systemically. Privacy preservation for location based big data measures user's location privacy in all possible aspects and therefore protects user's privacy in information theory semantic. Starting with an introduction to the concept of location based big data, its associated privacy threats and a universal measurement-based attack model, this paper surveys the state of the art of privacy preservation techniques for location based big data. Based on different privacy protecting strength, various big privacy preservation techniques can be categorized into heuristic privacy measurement, probability deduction and private information retrieval based technologies. The principles, mechanisms and characteristics of various techniques are described in detail, with special emphasis on a proceeding research topic: Private information retrieval based technology. Following a comprehensive analysis and comparison of existing techniques, privacy protecting for location based big data under situations like combination of location information and non location information and attacker's arbitrary background knowledge is highlighted as future research directions.
GUO Chi , LIU Jing-Nan , FANG Yuan , LUO Meng , CUI Jing-Song
2014, 25(4):713-730. DOI: 10.13328/j.cnki.jos.004570 CSTR:
Abstract:Uncountable geographical location information, vehicle trajectories and users' application location records have been recorded from different location-based service (LBS) applications. These records are forming to a location big data resource which facilitates mining human migrating patterns, analyzing geographic conditions and building smart cities. Comparing with traditional data mining, location big data has its own characteristics, including the variety of resources, the complexity of data and the sparsity in its data space. To restore and recreate data analysis network model from location big data, this study applies data value extraction and cooperative mining on location big data to create trajectories behavior pattern and local geographical feature. In this paper, three major aspects of analysis methods on location big data are systematically explained follows: (1) For the variety of resources, how to extract potential contents, generate behavior patterns and discover transferring features of moving objects in a partial region; (2) For complexity of data, how to conduct dimension reduction analysis on complex location networks in temporal and spatial scale, and thus to construct learning and inferential methods for mobility behavior of individuals in communities; (3) For sparsity, how to construct the global model of location big data by using collaborative filtering and probabilistic graphical model. Finally, an integral framework is provided to analyze location big data using software engineering approach. Under this framework, location data is used not only for analyzing traffic problems, but also for promoting cognition on a much wider-range of human social economic activities and mastering a better knowledge of nature. This study incarnates the practical value of location big data.
SONG Jie , GUO Chao-Peng , WANG Zhi , ZHANG Yi-Chuan , YU Ge , Jean-Marc PIERSON
2014, 25(4):731-752. DOI: 10.13328/j.cnki.jos.004569 CSTR:
Abstract:To address the new challenges that big data has brought on data storage, management and analysis, distributed file systems and MapReduce programming model have been widely adopted in both industry and academia. This paper proposes a distributed MOLAP technique, named DOLAP (distributed OLAP), based on Hadoop distributed file system (HDFS) and MapReduce program model. DOLAP adopts the specified multidimensional model to map the dimensions and the measures. It comprises the dimension coding and traverse algorithm to achieve the roll up operation on dimension hierarchy, the partition and linearization algorithm to store dimensions and measures, the chunk selection strategy to optimize OLAP performance, and MapReduce to execute OLAP. In addition, the paper describes the application case of the scientific data analysis and compares DOLAP performance with other dominate non-relational data management systems. Experimental results show that huge dominance in OLAP performance of the DOLAP technique over an acceptable performance lose in data loading.
ZHU Yue-An , ZHANG Yan-Song , ZHOU Xuan , WANG Shan
2014, 25(4):753-767. DOI: 10.13328/j.cnki.jos.004568 CSTR:
Abstract:Integrating big data and traditional data warehouse (DW) techniques bring demand for real-time big data analysis. The new demand means DW can not depend too much on the optimization such as materialization and indexing which consume large space, but instead needs to enhance ability of real-time analysis to handle big data analysis which usually issues complex queries on huge data volumes. Those queries usually consist in applying group or aggregation operator on the join result between fact table and dimension table(s). The join and group operation often are the bottle-necks for performance improvement. This paper studies the OLAP performance under the new hardware platform and big data environment, and develops a new OLAP query execution engine in columnar storage, called CDDTA-MMDB (columnar direct dimensional tuple access for main memory database query execution engine). The optimized materialization makes CDDTA-MMDB reduce access to base table and intermediate data structure during join procedure. CDDTA- MMDB decomposes the query into sub-queries on the fact table and dimension table respectively. If the sub-query on dimension table only serves as filter, it will produce the binary tuple <surrogate,Boolean_value>; otherwise, it will produce the triplet in the form of <surrogate,key,value>. Thus, by just scanning the fact table one-pass and utilizing the mapping function of foreign keys in fact table to directly access the binary tuples or triplets, the executor can accomplish the join, filter and group operations. Consideration is fully placed on the design principle for the main-memory columnar database. Experimental results show that the system is efficient and can be 2.5 times faster than MonetDB 5.5 and 5 times faster than invisible join used by C-store. Moreover, it scales linearly on multi-core processors.
FU Yan-Yan , ZHANG Min , FENG Deng-Guo , CHEN Kai-Qu
2014, 25(4):768-780. DOI: 10.13328/j.cnki.jos.004565 CSTR:
Abstract:Recent research shows that social structures or non-sensitive attributes of users can increase risks of user sensitive attribute disclosure in social networks. Most of the existing private attribute anonymization schemes have many defects, such as lack of proper model, too much distortion on attributes distribution, neglect social structure and non-sensitive attributes' influence on sensitive attributes. In this paper, an attribute privacy preservation scheme based on node anatomy is proposed. It allocates original node's attribute links and social links to new nodes to improve original node's anonymity, thus protects user from sensitive attribute disclosure. Meanwhile, it measures social structure influence on attribute distribution, and splits attributes according to attributes' correlations. Experimental results show that the proposed scheme can maintain high data utility and resist private attribute disclosure.
CUI Ying-An , LI Xue , WANG Zhi-Xiao , ZHANG De-Yun
2014, 25(4):781-796. DOI: 10.13328/j.cnki.jos.004566 CSTR:
Abstract:The big data from online social media represents the relationship between the actors' self-organization. It contains multi-level social entity relationship. As an emerging field in recent years, online social media sampling method has important research value and practical significance in social computing. However, there are some problems in existing methods. For example, large Markov chain is difficult to parallelize, sampling is easy to be trapped in local, and there is concerns with Markov chain burn-in process. To address those issues, the paper presents a multi stage cluster sampling for online social media big data (OSM-MSCS). The proposed method first decomposes integral cluster into small cohesive subgroups, then uses delay rejection (DR) to sample typical online social relationship with parallel processing, and finally uses Gibbs sampling methods to choose interaction relationship in different cohesive subgroups to obtain the random sequence. Experimental results show that OSM-MSCS is an effective method for online social media big data, and its sampling technique is better than BFS and MHRW.
LI Ming-Peng , GAO Hong , ZOU Zhao-Nian
2014, 25(4):797-812. DOI: 10.13328/j.cnki.jos.004567 CSTR:
Abstract:This paper focuses on k-reach query processing based on graph compression and proposes a k-reach query preserving graph compression algorithm k-RPC and a query processing algorithm which is able to query on the compressed graph without decompression. k-RPC algorithm is optimal among all the compression algorithms based on equivalent class which supports k-reach query. Considering k-RPC is based on a strict equivalent relation, this study further produces a linear approximate graph compression algorithm k-GRPC. k-GRPC first removes some edges from the input graph, then utilizes k-RPC to acquire better compression ratio. Novel linear query processing algorithms which are able to answer k-reach query on the compressed graph without decompression are introduced. Experiments on real datasets demonstrate that the compression ratios of these two compression algorithms can reach to 45% for sparse graphs and 75%, 67% for dense graphs. Comparing with the query processing on original graphs, the query performance on compressed graphs is better and for sparse graphs, it can be 2.5 times better.
CI Xiang , MA You-Zhong , MENG Xiao-Feng
2014, 25(4):813-825. DOI: 10.13328/j.cnki.jos.004564 CSTR:
Abstract:Top-K query has been widely used in lots of modern applications such as search engine and e-commerce. Top-K query returns the most relative results for user from massive data, and its main purpose is to eliminate the negative effect of information overload. Top-K query on big data has brought new challenges to data management and analysis. In light of features of MapReduce, this paper presents an in-depth study of Top-K query on big data from the perspective of data partitioning and data filtering. Experimental results show that the proposed approaches have better performance and scalability.
HE Jin-Rong , DING Li-Xin , LI Zhao-Kui , HU Qing-Hui
2014, 25(4):826-838. DOI: 10.13328/j.cnki.jos.004571 CSTR:
Abstract:A novel supervised linear dimensionality reduction algorithm called margin discriminant projection (MDP) is proposed to extract low-dimensional features with good performance of discriminant. MDP aims to minimize maximum distance of samples belong to the same class and maximize minimum distance of samples belong to different classes, and at the sametime preserve the geometrical structure of data manifold. Compared with classical algorithms based on the definition of margin, MDP is good at preserving the global properties, such as geometrical and discriminant structure of data manifold, and can overcome small size sample problem. Due to its low cost of computation, MDP can be directly applied on ultra-high dimensional big data dimensionality reduction. Experimental results on five face data sets show its effectiveness for feature extraction on big data.
SUN Da-Wei , ZHANG Guang-Yan , ZHENG Wei-Min
2014, 25(4):839-862. DOI: 10.13328/j.cnki.jos.004558 CSTR:
Abstract:Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
WANG Liang , ZHOU Guang-Yan , WANG Li-Wei , PENG Zhi-Yong
2014, 25(4):863-879. DOI: 10.13328/j.cnki.jos.004426 CSTR:
Abstract:In the traditional database applications, data is generally considered to be accurate and available. However, data uncertainty often occurs in the real world. Most of current methods usually use provenance information to track data uncertainty while placing focus on the uncertainty with tuple level rather than attribute level. Their main idea is to identify a tuple with a variable, and then construct Boolean expression based on provenance information to compute the probability of a tuple. For the tuple with lots of uncertain attributes, these methods can not help users rapidly and correctly identify the source of uncertainty. In this paper, attribute expressions are defined and used to construct the lineage expression for each result tuple. With the lineage expression, the new method can not only accurately traces the location where the uncertainty takes place, but also computes the probability of the result tuple. Meanwhile, the exchange algorithm of the lineage expression is proposed to guarantee the correctness of the probability computation. In order to improve the efficiency of the probability computation, a method is also provided to construct share paths, and compute the probability of atomic disjunctions during the period of constructing share paths. Experiments are performed to compare tuple level lineage expressions with the existing methods on both time and cost. The results show the feasibility and validity of the proposed method, and further verify the validity of utilizing share paths to speed up the probability computation.
SHEN Zhi-Rong , XUE Wei , SHU Ji-Wu
2014, 25(4):880-895. DOI: 10.13328/j.cnki.jos.004554 CSTR:
Abstract:With the rapid development of cloud computing, users are beginning to move their data to the cloud servers in order to avoid troublesome data management at local machines and enjoy convenient service. To protect data security and user privacy, data are usually stored in encrypted form in the cloud, but it activates the inconvenience when the user tries to retrieve the files containing some interested keywords. Searchable encryption (SE) is a recently developed cryptographic primitive that supports keyword search over encrypted data, which not only saves huge network bandwidth and computation capacity for users, but also migrates the cumbersome search operation to the cloud server to utilize its vast computational resources. This paper first introduces the research background and the current development of SE schemes and compares the different features between symmetric key cryptography based SE schemes and public key cryptography based SE schemes. The research status of the search query supported in SE schemes is then provided. The discussion includes the support of single keyword search query, conjunctive (and multi-keyword) search query and complex search query, respectively. Finally, this study presents the typical application scenario of SE schemes, and discusses the possible development tendency.
GE Jing-Guo , MI Wei , WU Yu-Lei
2014, 25(4):896-912. DOI: 10.13328/j.cnki.jos.004557 CSTR:
Abstract:As IPv4 address resources being exhausted, the transition from IPv4 to IPv6 is inevitable and fairly urgent. The existing IPv6 transition techniques encountered several critical issues for large-scale deployment. For example, the lack of unified evaluation criterion makes it hard to select the appropriate transition strategy for a given scenario. This paper investigates and presents potential IPv6 transition scenarios and typical transition mechanisms in ISP networks, with emphasis on the provision of unified evaluation criteria in terms of functionality, applications, performance, deployment and security for IPv6 transition mechanisms and evaluation of these transition techniques based on the provided criteria. In addition, by virtue of the unified evaluation criteria, this study proposes a practical deployment of IPv6 transition strategy in the ISP's core and access networks. Finally, the consideration of IPv6 transition based on the emerging technology such as SDN to ease and accelerate its deployment is presented.