一个基于三元组存储的列式OLAP查询执行引擎
作者:
基金项目:

国家科技重大专项(核高基)(2010ZX01042-001-002);国家自然科学基金(61272138,61232007);中国人民大学研究生科学研究基金(13XNH216)


Column-Oriented Query Execution Engine for OLAP Based on Triplet
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [26]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    大数据与传统的数据仓库技术相结合产生了大数据实时分析处理需要(volume+velocity),它要求大数据背景下的数据仓库不能过多地依赖物化、索引等高存储代价的优化技术,而要提高实时处理能力来应对大数据分析中数据量大、查询分析复杂等特点.这些查询分析操作一般表现为在事实表和维表之间连接操作的基础上对结果集上进行分组聚集等操作.因此,表连接和分组聚集操作是ROLAP(relational OLAP)性能的两个重要决定因素.研究了新硬件平台下针对大规模数据的OLAP查询的性能,设计新的列存储OLAP查询执行引擎CDDTA-MMDB(columnar direct dimensional tuple access-main memory databasequeryexecutionengine,直接维表元组访问的内存数据库查询执行引擎).基于三元组的物化策略,使得CDDTA-MMDB能够减少内存列存储模型上表连接操作访问基表和中间数据结构的次数.首先,CDDTA-MMDB将查询分解为作用在维表和事实表上的子查询,如果只涉及过滤操作,子查询将生成<代理键,布尔值>二元组;否则,子查询生成<代理键,关键字,值>三元组.然后,只需一趟扫描事实表,利用事实表的外键映射函数直接定位相应三元组或者二元组,完成相应的过滤、连接或聚集操作.CDDTA-MMDB充分考虑了内存列存储数据库的设计原则,尽量减少随机内存访问.实验结果表明:CDDTA-MMDB是高效的,与具代表性的列存储数据库相比,比MonetDB 5.5快2.5倍,比C-store的invisible join快5倍;并且,CDDTA-MMDB在多核处理器上具有线性加速比.

    Abstract:

    Integrating big data and traditional data warehouse (DW) techniques bring demand for real-time big data analysis. The new demand means DW can not depend too much on the optimization such as materialization and indexing which consume large space, but instead needs to enhance ability of real-time analysis to handle big data analysis which usually issues complex queries on huge data volumes. Those queries usually consist in applying group or aggregation operator on the join result between fact table and dimension table(s). The join and group operation often are the bottle-necks for performance improvement. This paper studies the OLAP performance under the new hardware platform and big data environment, and develops a new OLAP query execution engine in columnar storage, called CDDTA-MMDB (columnar direct dimensional tuple access for main memory database query execution engine). The optimized materialization makes CDDTA-MMDB reduce access to base table and intermediate data structure during join procedure. CDDTA- MMDB decomposes the query into sub-queries on the fact table and dimension table respectively. If the sub-query on dimension table only serves as filter, it will produce the binary tuple <surrogate,Boolean_value>; otherwise, it will produce the triplet in the form of <surrogate,key,value>. Thus, by just scanning the fact table one-pass and utilizing the mapping function of foreign keys in fact table to directly access the binary tuples or triplets, the executor can accomplish the join, filter and group operations. Consideration is fully placed on the design principle for the main-memory columnar database. Experimental results show that the system is efficient and can be 2.5 times faster than MonetDB 5.5 and 5 times faster than invisible join used by C-store. Moreover, it scales linearly on multi-core processors.

    参考文献
    [1] Douglas L. The Importance of ‘Big Data': A Definition. Gartner, G00235055, 2012.
    [2] Boncz P, Zukowski M, Nes N. MonetDB/X100: Hyper-Pipelining query execution. In: Ailamaki A, ed. Proc. of the CIDR 2005. Asilomar, 2005. 225-237.
    [3] Boncz P, Grust T, van Keulen M, Manegold S, Rittinger J, Teubner J. MonetDB/XQuery: A fast xquery processor powered by a relational engine. In: Chaudhuri S, Hristidis V, Polyzotis N, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Chicago: ACM Press, 2006. 479-490. [doi: 10.1145/1142473.1142527]
    [4] Larson PA, Hanson EN, Price SL. Columnar storage in SQL server 2012. In: Kementsietsidis A, Vaz Salles MN, eds. Proc. of the IEEE 28th Int'l Conf. on Data Engineering (ICDE 2012).Washington: IEEE Computer Society, 2012. 15-20.
    [5] Boncz PA, Kersten ML. MIL primitives for querying a fragmented world. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML, eds. Proc. of the 25th Int'l Conf. on Very Large Data Bases (VLDB'99). Edinburgh: Morgan Kaufmann Publishers, 1999. 101-119.
    [6] Plattner H. A common database approach for OLTP and OLAP using an in-memory column database. In: Çetintemel U, Zdonik SB, Kossmann D, Tatbul N, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2009). Rhode Island: ACM Press, 2009. 1-2. [doi: 10.1145/1559845.1559846]
    [7] Sikka V, Färber F, Lehner W, Cha SK, Peh T, Bornhövd C. Efficient transaction processing in SAP HANA database—The end of a column store myth. In: Candan KS, Chen Y, Snodgrass RT, Gravano L, Fuxman A, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2012). Scottsdale: ACM Press, 2012. 731-741. [doi: 10.1145/2213836.2213946]
    [8] Zhang YS, Jiao M, Wang ZW, Wang S, Zhou X. One-Size-Fits-All OLAP technique for big data analysis. Chinese Journal of Computers, 2011,34(10):1936-1946 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2011.01936]
    [9] Wang HJ, Qin XP, Zhang YS, Wang S, Wang ZW. LinearDB: A relational approach to make data warehouse scale like MapReduce. In: Yu JX, Kim MH, Unland R, eds. Proc. of the 16th Int'l Conf. on Database Systems for Advanced Applications (DASFAA 2011). Lecture Notes in Computer Science, Hong Kong, 2011. 306-320. [doi: 10.1007/978-3-642-20152-3_23]
    [10] Wang S, Wang HJ, Qin XP, Zhou X. Architecting big data: Challenges, studies and forecasts. Chinese Journal of Computers, 2011, 34(10):1742-1752 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2011.01741]
    [11] Copeland GP, Khoshaian S. A decomposition storage mode. In: Navathe SB, ed. Proc. of the 1985 ACM SIGMOD Int'l Conf. on Management of Data. Austin: ACM Press, 1985. 268-279. [doi: 10.1145/318898.318923]
    [12] Boncz P, Kersten ML, Manegold S. Breaking the memory wall in MonetDB. Communications of the ACM, 2008,51:77-85. [doi: 10.1145/1409360.1409380]
    [13] Stonebraker M, Abadi DJ, Batkin A, Chen XD, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O'Neil E, O'Neil P, Rasin A, Tran T, Zdonik S. C-Store: A column-oriented DBMS. In: Böhm K, Jensen CS, Haas LM, Kersten ML, Larson PA, Ooi BC, eds. Proc. of the 31st Int'l Conf. on Very Large Data Bases. Trondheim: ACM Press, 2005. 553-564.
    [14] Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG. Access path selection in a relational database. In: Bernstein PA, ed. Proc. of the 1979 ACM SIGMOD Int'l Conf. on Management of Data. Boston: ACM Press, 1979. 23-34.
    [15] Hong W, Stonebraker M. Exploiting InteroperatorParallelism in XPRS. In: Stonebraker M, ed. Proc. of the 1992 ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 1992. 19-28.
    [16] Abadi DJ, Madden SR, Hachem N. Column-Stores vs. row-stores: How different are they really. In: Wang JTL, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2008). Vancouver: ACM Press, 2008. 967-980. [doi: 10. 1145/1376616.1376712]
    [17] Abadi DJ, Myers DS, DeWitt DJ, Madden SR. Materialization strategies in a column-oriented DBMS. In: Chirkova R, Dogac A, Özsu MT, Sellis TK, eds. Proc. of the 23rd Int'l Conf. on Data Engineering (ICDE 2007). Istanbul: IEEE, 2007. 466-475. [doi: 10. 1109/ICDE.2007.367892]
    [18] Halverson A, Beckmann J, Naughton J. A comparisonof C-store and row-store in a common framework. Technical Report, TR1566, UW Madison Department of CS, 2006.
    [19] Harizopoulos S, Liang V, Abadi DJ, Madden S. Performance tradeoffs in read-optimized databases. In: Dayal U, Whang KY, Lomet DB, Alonso G, Lohman GM, Kersten ML, Cha SK, Kim YK, eds. Proc. of the 32nd Int'l Conf. on Very Large Data Bases. Seoul: ACM Press, 2006. 487-498.
    [20] Harizopoulos S, Abadi D, Madden S, Stonebraker M. OLTP through the looking glass, and what we found there. In: Wang JTL, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2008). ACM Press, 2008. 981-992. [doi: 10.1145/ 1376616.1376713]
    [21] DeBrabant J, Pavlo A, Tu S, Stonebraker M, Zdonik S. Anti-Caching: A new approach to database management system architecture. Proc. of the VLDB Endowment, 2013,6(14):1942-1953.
    [22] Youssefiand K, Wong E. Query processing in a relational database Management system. In: Furtado AL, Morgan HL, eds. Proc. of the 5th Int'l Conf. on Very Large Data Bases. Rio de Janeiro: IEEE, 1979. 409-417.
    [23] Held GD, Stonebraker M, Wong E. INGRES—A relational data base management system. In: Proc. of the 1975 National Computer Conf. on American Federation of Information Processing Societies. Anaheim: AFIPS Press, 1975. 409-417. http://dl.acm.org/ citation.cfm?doid=1499949.1500029
    [24] Garcia-Molina H, Ullman JD, Widom J. Database System Implementation. Upper Saddle River: Prentice Hall, 2000.
    [25] O'Neil PE, O'Neil EJ, Chen X. The star schema benchmark (SSB). 2009. http://www.cs.umb.edu/-poneil/StarSchemaB.PDF
    [26] O'Neil PE, Chen X, O'Neil EJ. Adjoined dimension column index to improve star schema query performance. In: Alonso G, Blakeley JA, Chen ALP, eds. Proc. of the 24th Int'l Conf. on Data Engineering (ICDE 2008). Cancún: IEEE, 2008. 1409-1411.
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

朱阅岸,张延松,周烜,王珊.一个基于三元组存储的列式OLAP查询执行引擎.软件学报,2014,25(4):753-767

复制
分享
文章指标
  • 点击次数:6363
  • 下载次数: 8105
  • HTML阅读次数: 2557
  • 引用次数: 0
历史
  • 收稿日期:2013-10-13
  • 最后修改日期:2014-01-27
  • 在线发布日期: 2014-03-28
文章二维码
您是第19831091位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号