大数据分析的分布式MOLAP技术
作者:
基金项目:

国家自然科学基金(61202088);中央高校基本科研业务费专项资金(N120817001);中国博士后科学基金面上项目(2013M540232);教育部博士点基金(20120042110028);教育部-英特尔信息技术专项科研基金(MOE-INTEL-2012-06)


Distributed MOLAP Technique for Big Data Analysis
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [31]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    大数据的规模效应给数据存储、管理以及数据分析带来了极大的挑战,学界和业界广泛采用分布式文件系统和MapReduce编程模型来应对这一挑战.提出了大数据环境中一种基于Hadoop分布式文件系统(HDFS)和MapReduce编程模型的分布式MOLAP技术,称为DOLAP(distributed OLAP).DOLAP采用一种特殊的多维模型完成维和度量的映射;采用维编码和遍历算法实现维层次上的上卷下钻操作;采用数据分块和线性化算法将维和度量保存在分布式文件系统中;采用数据块选择算法优化OLAP的性能;采用MapReduce编程模型实现OLAP操作.描述了DOLAP在科学数据分析的应用案例,并与主流的非关系数据库系统进行性能对比.实验结果表明,尽管数据装载性能略显不足,但DOLAP的性能要优于基于HBase,Hive,HadoopDB,OLAP4Cloud等主流非关系数据库系统实现的OLAP性能.

    Abstract:

    To address the new challenges that big data has brought on data storage, management and analysis, distributed file systems and MapReduce programming model have been widely adopted in both industry and academia. This paper proposes a distributed MOLAP technique, named DOLAP (distributed OLAP), based on Hadoop distributed file system (HDFS) and MapReduce program model. DOLAP adopts the specified multidimensional model to map the dimensions and the measures. It comprises the dimension coding and traverse algorithm to achieve the roll up operation on dimension hierarchy, the partition and linearization algorithm to store dimensions and measures, the chunk selection strategy to optimize OLAP performance, and MapReduce to execute OLAP. In addition, the paper describes the application case of the scientific data analysis and compares DOLAP performance with other dominate non-relational data management systems. Experimental results show that huge dominance in OLAP performance of the DOLAP technique over an acceptable performance lose in data loading.

    参考文献
    [1] Gray J, Liu DT, Nieto-Santisteban M, Szalay A, DeWitt DJ, Heber G. Scientific data management in the coming decade. ACM SIGMOD Record, 2005,34:34-41. [doi: 10.1145/1107499.1107503]
    [2] Miller HJ. The data avalanche is here. Shouldn't we be digging? Journal of Regional Science, 2010,50(1):181-201. [doi: 10.1111/j.1467-9787.2009.00641.x]
    [3] Wang S, Wang HJ, Qin XP, Zhou X. Architecting big data: Challenges, studies and forecasts. Chinese Journal of Computers, 2011, 34(10):1741-1752 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2011.01741]
    [4] Meng XF, Ci X. Big data management: Concepts, techniques and challenges. Journal of Computer Research and Development, 2013,50(1):146-169 (in Chinese with English abstract).
    [5] Shim JP, Warkentin M, Courtney JF, Power DJ, Sharda R, Carlsson C. Past, present, and future of decision support technology. Decision Support Systems, 2002,33:111-126. [doi: 10.1016/S0167-9236(01)00139-7]
    [6] Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. ACM Sigmod Record, 1997,26:65-74. [doi: 10. 1145/248603.248616]
    [7] Luk WS, Li C. A partial pre-aggregation scheme for HOLAP engines. In: Proc. of the 6th Int'l Conf. on Data Warehousing and Knowledge Discovery (DaWaK 2004). Berlin: Springer-Verlag, 2004. 129-137. [doi: 10.1007/978-3-540-30076-2_13]
    [8] Bolosky WJ, Douceur JR, Ely D, Theimer M. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. ACM SIGMETRICS Performance Evaluation Review, 2000,28(1):34-43. [ doi: 10.1145/345063.339345]
    [9] Shen DR, Yu G, Wang XT, Nie TZ, Kou Y. Survey on NoSQL for management of big data. Ruan Jian Xue Bao/Journal of Software, 2013,24(8):1786-1803 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4416.htm [doi: 10.3724/SP.J.1001. 2013.04416]
    [10] Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 2008,51:107-113. [doi: 10.1145/1327452.1327492]
    [11] Hadoop home page. http://hadoop.apache.org
    [12] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: A warehousing solution over a map-reduce framework. Proc. of the VLDB Endowment, 2009,2(2):1626-1629.
    [13] Vora MN. Hadoop-HBase for large-scale data. In: Proc. of the 2011 Int'l Conf. on Computer Science and Network Technology. Piscataway: IEEE, 2011. 24-26.
    [14] Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proc. of the VLDB Endowment, 2009,2(1):922-933.
    [15] Olap4cloud home page. http://code.google.com/p/olap4cloud/
    [16] Song J, Li TT, Zhu ZL, Bao YB, Yu G. Benchmarking and analyzing the energy consumption of cloud data management system. Chinese Journal of Computers, 2013,36(7):1485-1499 (in Chinese with English abstract).
    [17] You JG, Xi JQ, Zhang PJ, Chen H. A parallel algorithm for closed cube computation. Computer and Information Science, 2008,8: 95-99. [doi: 10.1109/ICIS.2008.63]
    [18] Zhang YS, Jiao M, Wang ZW, Wang S, Zhou X. One-Size-Fits-All OLAP technique for big data analysis. Chinese Journal of Computers, 2011,34(10):1936-1946 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2011.01936]
    [19] Cao Y, Chen C, Guo F, Jiang DW, Lin YT, Ooi BC, Vo HT, Wu S, Xu QQ. ES2: A cloud data storage system for supporting both OLTP and OLAP. In: Proc. of the Int'l Conf. on Data Engineering (ICDE). 2011. 291-302. [doi: 10.1109/ICDE.2011.5767881]
    [20] Tian X. Large-Scale SMS messages mining based on map-reduce. Computational Intelligence and Design, 2008,1:7-12. [doi: 10. 1109/ISCID.2008.9]
    [21] Han H, Lee YC, Choi S, Yeom HY, Zomaya AY. Cloud-Aware processing of MapReduce-based OLAP applications. In: Javadi B, ed. Proc. of the 11th Australasian Symp. on Parallel and Distributed Computing. Darlinghurst: Australian Computer Society, 2013. 31-38.
    [22] D'Orazio L, Bimonte S. Multidimensional arrays for warehousing data on clouds. In: Hameurlain A, ed. Proc. of the Data Management in Grid and Peer-to-Peer Systems. Berlin, Heidelberg: Springer-Verlag, 2010. 26-37.
    [23] Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: A not-so-foreign language for data processing. In: Lakshmanan LVS, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. New York: Association for Computing Machinery, 2008. 1099-1110.
    [24] Hu KF, Dong YS, Xu LZ, Yang KH. A novel aggregation algorithm for online analytical processing queries evaluation based on dimension hierachical encoding. Journal of Computer Research and Development, 2004,41(4):608-614 (in Chinese with English abstract).
    [25] TPC-H homepage. http://www.tpc.org/tpch/
    [26] O'Neil P, O'Neil B, Chen XD, Stephen R. The star schema benchmark and augmented fact table indexing. In: Proc. of the 1st TPC Technology Conf. on Performance Evaluation and Benchmarking (TPCTC 2009). Berlin: Springer-Verlag, 2009. 237-252. [doi: 10. 1007/978-3-642-10424-4_17]
    [27] Sarawagi S, Stonebraker M. Efficient organization of large multidimensional arrays. In: Proc. of the Int'l Conf. on Data Engineering. IEEE, 1994. 328-336. http://dl.acm.org/citation.cfm?id=645479.655138&coll=DL&dl=GUIDE&CFID=419816811& CFTOKEN=55823079
    [28] Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proc. of the 2010 IEEE 26th Symp. on Mass Storage Systems and Technologies (MSST 2010). IEEE, 2010. 1-10. [doi: 10.1109/MSST.2010.5496972]
    [29] Yang L, Shi ZZ. An efficient data mining framework on Hadoop using Java persistence API. In: Proc. of the 2010 IEEE 10th Int'l Conf. on Computer and Information Technology (CIT 2010). IEEE Computer Society, 2010. 203-209. [doi: 10.1109/CIT.2010.71]
    [30] China oceanic information network. http://mds.coi.gov.cn/jcsj.asp
    [31] Qin XP, Wang HJ, Du XY, Wang S. Big data analysis—Competition and symbiosis of RDBMS and MapReduce. Ruan Jian Xue Bao/Journal of Software, 2012,23(1):32-45 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4091.htm [doi: 10.3724/SP.J.1001.2012.04091]
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

宋杰,郭朝鹏,王智,张一川,于戈,Jean-Marc PIERSON.大数据分析的分布式MOLAP技术.软件学报,2014,25(4):731-752

复制
分享
文章指标
  • 点击次数:8029
  • 下载次数: 11056
  • HTML阅读次数: 2437
  • 引用次数: 0
历史
  • 收稿日期:2013-10-15
  • 最后修改日期:2014-01-27
  • 在线发布日期: 2014-03-28
文章二维码
您是第19728130位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号