Survey of Intelligent Partition and Layout Technology in Database System
Author:
Affiliation:

Clc Number:

TP311

  • Article
  • | |
  • Metrics
  • |
  • Reference [79]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    In the era of big data, there are more and more application analysis scenarios driven by large-scale data. How to quickly and efficiently extract the information for analysis and decision-making from these massive data brings great challenges to the database system. At the same time, the real-time performance of analysis data in modern business analysis and decision-making requires that the database system can process ACID transactions and complex analysis queries. However, the traditional data partition granularity is too coarse, and cannot adapt to the dynamic changes of complex analysis load; the traditional data layout is single, and cannot cope with the modern increasing mixed transaction analysis application scenarios. In order to solve the above problems, "intelligent data partition and layout" has become one of the current research hotspots. It extracts the effective characteristics of workload through data mining, machine learning, and other technologies, and design appropriate partition strategy to avoid scanning a large number of irrelevant data and guide the layout structure design to adapt to different types of workloads. This paper first introduces the background knowledge of data partition and layout techniques, and then elaborates the research motivation, development trend, and key technologies of intelligent data partition and layout. Finally, the research prospect of intelligent data partition and layout is summarized and prospected.

    Reference
    [1] Lee K, Liu L. Scaling queries over big RDF graphs with semantic hash partitioning. Proc. of the VLDB Endowment, 2013, 6(14): 1894-1905.
    [2] Copeland GP, Khoshafian SN. A decomposition storage model. In: Proc. of the 1985 ACM SIGMOD Int'l Conf. on Management of Data. 1985. 268-279.
    [3] Sun LM, Zhang SM, Ji T, Li CP, Chen H. Survey of data management techniques powered by artificial intelligence. Ruan Jian Xue Bao/Journal of Software, 2020, 31(3): 600-619 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5909.htm [doi: 10.13328/j.cnki.jos.005909 ]
    [4] Sun L. Skipping-oriented data design for large-scale analytics [Ph. D. Thesis]. UC Berkeley, 2017.
    [5] Yang Z, Chandramouli B, Wang C, et al. QD-tree: Learning data layouts for big data analytics. In: Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 193-208.
    [6] Ziauddin M, Witkowski A, Kim YJ, et al. Dimensions based data clustering and zone maps. Proc. of the VLDB Endowment, 2017, 10(12): 1622-1633.
    [7] Moerkotte G. Small materialized aggregates: A light weight index structure for data warehousing. In: Proc. of the 24th Int'l Conf. on Very Large Data Bases. 1998. 476-487.
    [8] Gupta A, Agarwal D, Tan D, et al. Amazon redshift and the case for simpler data warehouses. In: Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. 2015. 1917-1923.
    [9] Raman V, Attaluri G, Barber R, et al. DB2 with BLU acceleration: So much more than just a column store. Proc. of the VLDB Endowment, 2013, 6(11): 1080-1091.
    [10] Hall A, Bachmann O, Büssow R, et al. Processing a trillion cells per mouse click. Proc. of the VLDB Endowment, 2012, 5(11): 1436-1446.
    [11] Dageville B, Cruanes T, Zukowski M, et al. The snowflake elastic data warehouse. In: Proc. of the 2016 Int'l Conf. on Management of Data. 2016. 215-226.
    [12] Ślȩzak D, Wróblewski J, Eastwood V, et al. Brighthouse: An analytic data warehouse for ad-hoc queries. Proc. of the VLDB Endowment, 2008, 1(2): 1337-1345.
    [13] Huai Y, Chauhan A, Gates A, et al. Major technical advancements in apache hive. In: Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 1235-1246.
    [14] Apache Parquet. http://parquet.apache.org
    [15] Graefe G. Fast loads and fast queries. In: Proc. of the Int'l Conf. on Data Warehousing and Knowledge Discovery. Berlin, Heidelberg: Springer, 2009. 111-124.
    [16] Athanassoulis M, Bøgh KS, Idreos S. Optimal column layout for hybrid workloads. Proc. of the VLDB Endowment, 2019, 12(13): 2393-2407.
    [17] Sun L, Franklin MJ, Krishnan S, et al. Fine-grained partitioning for aggressive data skipping. In: Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 1115-1126.
    [18] IBM Netezza data warehouse appliance. http://www.ibm.com/software/data/netezza/
    [19] Amazon Redshift. Database developer guide (API Version 2012-12-01). Choosing sort keys. http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
    [20] Yamamoto Y, Iwanuma K, Fukuda S. Resource-oriented approximation for frequent itemset mining from bursty data streams. In: Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 205-216.
    [21] Moerkotte G, Neumann T. Dynamic programming strikes back. In: Proc. of the 2008 ACM SIGMOD Int'l Conf. on Management of Data. 2008. 539-552.
    [22] Andersen ED, Andersen KD. The MOSEK Interior Point Optimizer for Linear Programming: An Implementation of the Homogeneous Algorithm——High Performance Optimization. Boston: Springer, 2000. 197-232.
    [23] Olma M, Karpathiotakis M, Alagiannis I, et al. Slalom: Coasting through raw data via adaptive partitioning and indexing. Proc. of the VLDB Endowment, 2017, 10(10): 1106-1117.
    [24] Hoffer JA. An integer programming formulation of computer data base design problems. Information Sciences, 1976, 11(1): 29-48.
    [25] Eisner MJ, Severance DG. Mathematical techniques for efficient record segmentation in large shared databases. Journal of the ACM (JACM), 1976, 23(4): 619-635.
    [26] March ST, Serverance DG. The determination of efficient record segmentations and blocking factors for shared data files. ACM Trans. on Database Systems (TODS), 1977, 2(3): 279-296.
    [27] Schkolnick M. A clustering algorithm for hierarchical structures. ACM Trans. on Database Systems (TODS), 1977, 2(1): 27-44.
    [28] Hoffer JA, Severance DG. The use of cluster analysis in physical data base design. In: Proc. of the 1st Int'l Conf. on Very Large Data Bases. 1975. 69-86.
    [29] McCormick Jr WT, Schweitzer PJ, White TW. Problem decomposition and data reorganization by a clustering technique. Operations Research, 1972, 20(5): 993-1009.
    [30] Navathe SB, Ra M. Vertical partitioning for database design: A graphical algorithm. In: Proc. of the 1989 ACM SIGMOD Int'l Conf. on Management of Data. 1989. 440-450.
    [31] Navathe S, Karlapalem K, Ra M. A mixed fragmentation methodology for initial distributed database design. Journal of Computer and Software Engineering, 1995, 3(4): 395-426.
    [32] Marir F, Najjar Y, AlFaress MY, et al. An enhanced grouping algorithm for vertical partitioning problem in DDBS. In: Proc. of the 22nd Int'l Symp. on Computer and Information Sciences. IEEE, 2007. 1-6.
    [33] Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proc. of the Int'l Workshop on Business Intelligence for the Real-time Enterprise. Berlin, Heidelberg: Springer, 2011. 65-80.
    [34] Alagiannis I, Idreos S, Ailamaki A. H2O: A hands-free adaptive store. In: Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 1103-1114.
    [35] Grund M, Krüger J, Plattner H, et al. Hyrise: A main memory hybrid storage engine. Proc. of the VLDB Endowment, 2010, 4(2): 105-116.
    [36] Durand GC, Pinnecke M, Piriyev R, et al. GridFormation: Towards self-driven online data partitioning using reinforcement learning. In: Proc. of the 1st Int'l Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 2018. 1-7.
    [37] Arulraj J, Pavlo A, Menon P. Bridging the archipelago between row-stores and column-stores for hybrid workloads. In: Proc. of the 2016 Int'l Conf. on Management of Data. 2016. 583-598.
    [38] Sun L, Franklin MJ, Wang J, et al. Skipping-oriented partitioning for columnar layouts. Proc. of the VLDB Endowment, 2016, 10(4): 421-432.
    [39] Agarawal S, Chaudhuri S, Narasayya V. Automated selection of materialized views and indexes for SQL databases. In: Proc. of the 26th Int'l Conf. on Very Large Databases. Cairo, 2000. 191-207.
    [40] Chaudhuri S, Narasayya VR. An efficient, cost-driven index selection tool for Microsoft SQL server. In: Proc. of the VLDB, Vol. 97. 1997. 146-155.
    [41] Rao J, Zhang C, Megiddo N, et al. Automating physical database design in a parallel database. In: Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. 2002. 558-569.
    [42] Zeller B, Kemper A. Experience report: Exploiting advanced database optimization features for large-scale sap r/3 installations. In: Proc. of the 28th Int'l Conf. on Very Large Databases (VLDB 2002). Morgan Kaufmann Publishers, 2002. 894-905
    [43] Zilio DC, Jhingran A, Padmanabhan S. Partitioning key selection for a shared-nothing parallel database system. IBM TJ Watson Research Center, 1994. https://www.researchgate.net/publication/2623565_Partitioning_Key_Selection_for_a_Shared-Nothing_Parallel_Database_System
    [44] Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proc. of the 2004 ACM SIGMOD Int'l Conf. on Management of Data. 2004. 359-370.
    [45] Cornell DW, Yu PS. An effective approach to vertical partitioning for physical design of relational databases. IEEE Trans. on Software Engineering, 1990, 16(2): 248-258.
    [46] Cao Y, Chen C, Guo F, et al. ES 2: A cloud data storage system for supporting both oltp and OLAP. In: Proc. of the 27th IEEE Int'l Conf. on Data Engineering. IEEE, 2011. 291-302.
    [47] Chen C, Chen G, Jiang D, et al. Providing scalable database services on the cloud. In: Proc. of the Int'l Conf. on Web Information Systems Engineering. Berlin, Heidelberg: Springer, 2010. 1-19.
    [48] Kemper A, Neumann T. HyPer: A hybrid OLTP & OLAP main memory database system based on virtual memory snapshots. In: Proc. of the 27th IEEE Int'l Conf. on Data Engineering. IEEE, 2011. 195-206.
    [49] Lang H, Mühlbauer T, Funke F, et al. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In: Proc. of the 2016 Int'l Conf. on Management of Data. 2016. 311-326.
    [50] Neumann T, Mühlbauer T, Kemper A. Fast serializable multi-version concurrency control for main-memory database systems. In: Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. 2015. 677-689.
    [51] Färber F, May N, Lehner W, et al. The SAP HANA database—An architecture overview. IEEE Data Engineering Bulletin, 2012, 35(1): 28-33.
    [52] Banerjee S, Krishnamurthy V, Krishnaprasad M, et al. Oracle8i—The XML enabled data management system. In: Proc. of the 16th Int'l Conf. on Data Engineering. IEEE, 2000. 561-568
    [53] Cheng J, Xu J. XML and DB2. In: Proc. of the 16th Int'l Conf. on Data Engineering. IEEE, 2000. 569-573.
    [54] Informix object translator. 2001. http://www.informix.com/idn-secure/webtools/ot/
    [55] SQL server magazine. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlmag2k/html/TheXMLFiles.asp
    [56] Stonebraker M, Weisberg A. The VoltDB main memory DBMS. IEEE Data Engineering Bulletin, 2013, 36(2): 21-27.
    [57] Stonebraker M, Abadi DJ, Batkin A, et al. C-Store: A column-oriented DBMS. In: Proc. of the 31st Int'l Conf. on Very Large Databases. 2005. 553-564.
    [58] Blockhaus P, Broneske D, Schäler M, et al. Combining two worlds: MonetDB with multi-dimensional index structure support to efficiently query scientific data. In: Proc. of the 32nd Int'l Conf. on Scientific and Statistical Database Management. 2020. 1-4.
    [59] Melnik S, Gubarev A, Long JJ, et al. Dremel: Interactive analysis of Web-scale datasets. Proc. of the VLDB Endowment, 2010, 3(1-2): 330-339.
    [60] Sadoghi M, Bhattacherjee S, Bhattacharjee B, et al. L-Store: A real-time OLTP and OLAP system. In: Proc. of the 21st Int'l Conf. on Extending Database Technology (EDBT 2018). 2018.
    [61] Idreos S, Kersten ML, Manegold S. Database cracking. In: Proc. of the CIDR, Vol. 7. 2007. 68-78.
    [62] Bian H, Yan Y, Tao W, et al. Wide table layout optimization based on column ordering and duplication. In: Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 299-314.
    [63] Bian H, Tao Y, Jin G, et al. Rainbow: Adaptive layout optimization for wide tables. In: Proc. of the 34th IEEE Int'l Conf. on Data Engineering (ICDE). IEEE, 2018. 1657-1660.
    [64] Ailamaki A, DeWitt DJ, Hill MD, et al. Weaving relations for cache performance. In: Proc. of the VLDB, Vol. 1. 2001. 169-180.
    [65] Hankins RA, Patel JM. Data morphing: An adaptive, cache-conscious storage technique. In: Proc. of the 2003 VLDB Conf. Morgan Kaufmann Publishers, 2003. 417-428.
    [66] Li T, Butrovich M, Ngom A. Mainlining databases: Supporting fast transactional workloads on universal columnar data file formats. arXiv preprint arXiv: 2004.14471, 2020.
    [67] Ramamurthy R, DeWitt DJ, Su Q. A case for fractured mirrors. The VLDB Journal, 2003, 12(2): 89-101.
    [68] Dittrich J, Jindal A. Towards a one size fits all database architecture. In: Proc. of the CIDR. 2011. 195-198.
    [69] Boissier M, Schlosser R, Uflacker M. Hybrid data layouts for tiered HTAP databases with Pareto-optimal data placements. In: Proc. of the 34th IEEE Int'l Conf. on Data Engineering (ICDE). IEEE, 2018. 209-220.
    [70] Lee J, Muehle M, May N, et al. High-performance transaction processing in SAP HANA. IEEE Data Engineering Bulletin, 2013, 36(2): 28-33.
    [71] Sikka V, Färber F, Lehner W, et al. Efficient transaction processing in SAP HANA database: The end of a column store myth. In: Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 731-742.
    [72] Halim F, Idreos S, Karras P, et al. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. Proc. of the VLDB Endowment, 2012, 5(6): 502-513.
    [73] Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proc. of the 31st Int'l Conf. on Neural Information Processing Systems. 2017. 1025-1035.
    [74] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv: 1312.5602, 2013.
    [75] Higginson AS, Dediu M, Arsene O, et al. Database workload capacity planning using time series analysis and machine learning. In: Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 769-783.
    [76] Ma L, Van Aken D, Hefny A, et al. Query-based workload forecasting for self-driving database management systems. In: Proc. of the 2018 Int'l Conf. on Management of Data. 2018. 631-645.
    [77] Smyl S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int'l Journal of Forecasting, 2020, 36(1): 75-85.
    附中文参考文献:
    [3] 孙路明, 张少敏, 姬涛, 李翠平, 陈红. 人工智能赋能的数据管理新技术研究. 软件学报, 2020, 31(3): 600-619. http://www.jos.org.cn/1000-9825/5909.htm [doi: 10.13328/j.cnki.jos.005909]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘欢,刘鹏举,王天一,何雨琪,孙路明,李翠平,陈红.智能数据分区与布局研究.软件学报,2022,33(10):3819-3843

Copy
Share
Article Metrics
  • Abstract:1451
  • PDF: 3037
  • HTML: 2208
  • Cited by: 0
History
  • Received:January 19,2021
  • Revised:April 15,2021
  • Online: August 02,2021
  • Published: October 06,2022
You are the first2037987Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063