Strategy of Parallel Merge Join Based on Prune in Distributed Database

doi:10.13328/j.cnki.jos.005579

微信服务号

微信订阅号

2025-4-16- 7

Home > Archive>Volume 30, Issue 11, 2019 >3364-3381. DOI:10.13328/j.cnki.jos.005579

PDF HTML XML Export Cite reminder

Strategy of Parallel Merge Join Based on Prune in Distributed Database
DOI:
                        10.13328/j.cnki.jos.005579
                    
Author:
                        GAO Jin-TaoGAO Jin-Tao
School of Computer, Northwestern Polytechnical University, Xian 710129, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LI Zhan-HuaiLI Zhan-Huai
School of Computer, Northwestern Polytechnical University, Xian 710129, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
DU Hong-TaoDU Hong-Tao
School of Computer, Northwestern Polytechnical University, Xian 710129, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LIU Wen-JieLIU Wen-Jie
School of Computer, Northwestern Polytechnical University, Xian 710129, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP311
Fund Project:National Natural Science Foundation of China (61732014, 61672432, 61672434, 61472321); Natural Science Basic Research Plan of Shaanxi Province (2017JM6104)

Article

Figures

Metrics

Reference [31]

Related [20]

Cited by

Materials

Comments

Abstract:

Sort-merge join is an important implementation method of join in database system, and is more widely used than hash join. Under distributed environment, data is sharded and distributed across many nodes, and usually needed to be transmitted by network which is very expensive. Therefore, it is far more challenging to efficiently process sort-merge join in distributed database. Traditional strategy firstly sorts data, and then carries out merge-join based on sorted data, which are both related with original data. But original data usually has useless data blocks, which does not participate in join, but will increase the extra cost during join including network cost. The bigger of data size, the higher of possibility of useless data blocks. Traditional sort-merge join strategy does not prehandle these useless data. In this study, a parallel sort-merge join is proposed based on prune, called Pr_PSMJ, which can efficiently prune the useless data ahead from join data, and improve the efficiency of join. Firstly, a bilateral adjacency list (BAL) is constructed by the statistic information from shards of join data. Using BAL, the useless data of join data can be pruned and the correctness of final join result is guaranteed. Secondly, after the pruning, the optimal local-join executing place can be computed by BAL, and the quantity of data mitigating among nodes is minimized. Finally, during the join step, for the independence of local-join guaranteed by BAL, the executing of sort-merge join can be easily parallelled, and in every executing node, it is natural to parallel the local-partitial-join using multi-core environment. The final result is achieved by merging local-result. Because high efficient prune operation is done before executing join, Pr_PSMJ is almost fit for every sort-merge strategy, and it is a good lesson for other join strategies. The correctness, efficiency, and adaptability of algorithm are analyzed based on Pr_PSMJ. By experiments, it is proved that under distributed environment, orienting large data, Pr_PSMJ can effectively decrease the overhead of network and improve the join efficiency than other strategies.

Key words:distribution;sort-merge join;prune;bilateral adjacency list;parallel

Reference

[1] Merrett TH. Why sort-merge gives the best implementation of the natural join. ACM SIGMOD Record, 1983,13(2):39-51.[doi:10.1145/984523.984526]

[2] Graefe G. Sort-merge-join:An idea whose time has(h) passed? In:Proc. of the 10th Int'l Conf. on Data Engineering. IEEE, 1994.406-417.[doi:10.1109/ICDE.1994.283062]

[3] Yang ZK. The architecture of OceanBase relational database system. Journal of East China Normal University (Natural Sciences), 2014,2014(5):141-148,163(in Chinese with English abstract).[doi:10.3969/j.issn.1000-5641.2014.05.012]

[4] Barthels C, Müleler I, Schneider T, Alonso G, Hoefler T. Distributed join algorithms on thousands of cores. Proc. of the VLDB Endowment, 2017,10(5):517-528.[doi:10.14778/3055540.3055545]

[5] Albutiu MC, Kemper A, Neumann T. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. of the VLDB Endowment, 2012,5(10):1064-1075.[doi:10.14778/2336664.2336678]

[6] Balkesen C, Alonso G, Teubner J, Özsu TM. Multi-core, main-memory joins:Sort vs. hash revisited. Proc. of the VLDB Endowment, 2013,7(1):85-96.[doi:10.14778/2732219.2732227]

[7] Kim C, Kaldewey T, Lee VW, Sedlar E, Nguyen AD, Satish N, Chhugani J, Blas AD, Dubey P. Sort vs. Hash revisited:fast join implementation on modern multi-core CPUs. Proc. of the VLDB Endowment, 2009,2(2):1378-1389.[doi:10.14778/1687553.1687564]

[8] Mishra P, Eich MH. Join processing in relational databases. ACM Computing Surveys (CSUR), 1992,24(1):63-113.[doi:10.1145/128762.128764]

[9] Shapiro LD. Join processing in database systems with large main memories. ACM Trans. on Database Systems (TODS), 1986,11(3):239-264.[doi:10.1145/6314.6315]

[10] Graefe G. Query evaluation techniques for large databases. ACM Computing Surveys (CSUR), 1993,25(2):73-169.[doi:10.1145/152610.152611]

[11] Haas PJ, Hellerstein JM. Ripple joins for online aggregation. ACM SIGMOD Record, 1999,28(2):287-298.[doi:10.1145/304181.304208]

[12] Lin X, Zeng X, Pu X, Sun Y. A cardinality estimation approach based on two level histograms. Journal of Information Science & Engineering, 2015,31(5):1733-1756.

[13] Dittrich JP, Seeger B, Taylor DS, Widmayer P. Progressive merge join:A generic and non-blocking sort-based join algorithm. In:Proc. of the 28th Int'l Conf. on Very Large Data Bases. VLDB Endowment, 2002.299-310.

[14] Luo G, Naughton JF, Ellmann CJ. A non-blocking parallel spatial join algorithm. In:Proc. of the 18th Int'l Conf. on Data Engineering. IEEE, 2002.697-705.[doi:10.1109/ICDE.2002.994786]

[15] Mokbel MF, Lu M, Aref WG. Hash-merge join:A non-blocking join algorithm for producing fast and early join results. In:Proc. of the 20th Int'l Conf. on Data Engineering. IEEE, 2004.251-262.[doi:10.1109/ICDE.2004.1320002]

[16] Vitorovic A, Elseidy M, Koch C. Load balancing and skew resilience for parallel joins. In:Proc. of the 2016 IEEE 32nd Int'l Conf. on Data Engineering (ICDE). IEEE, 2016.313-324.[doi:10.1109/ICDE.2016.7498250]

[17] Wilschut AN, Apers PMG. Pipelining in query execution. In:Proc. of the Int'l Conf. on Databases, Parallel Architectures and Their Applications (PARBASE'90). IEEE, 1990.562.[doi:10.1109/PARBSE.1990.77227]

[18] Aslam A, Ansari MS, Varshney S. Non-partitioning merge-sort:Performance enhancement by elimination of division in divide-and-conquer algorithm. In:Proc. of the 2nd Int'l Conf. on Information and Communication Technology for Competitive Strategies. ACM Press, 2016.1-6.[doi:10.1145/2905055.2905092]

[19] Perl Y, Itai A, Avni H. Interpolation search-A log logN search. Communications of the ACM, 1978,21(7):550-553.[doi:10.1145/359545.359557]

[20] Andersson A, Mattsson C. Dynamic interpolation search in o(loglogn) time. In:Proc. of the Int'l Colloquium on Automata, Languages & Programming. 1993.15-27.[doi:10.1007/3-540-56939-1_58]

[21] Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman JJ, Gubarev A, Heiser C, Hochschild P, Hsieh W, Kanthak S, Kogan E, Li H, Lloyd A, Melnik S, Mwaura D, Nagle D, Quinlan S, Rao R, Rolig L, Saito Y, Szymaniak M, Taylor C, Wang R, Woodford D. Spanner:Google's globally distributed database. ACM Trans. on Computer Systems (TOCS), 2013,31(3):251-264.[doi:10.1145/2491245]

[22] Fan QS, Zhou MQ, Zhou AY. A distributed join algorithm on separated data storage. Chinese Journal of Computers, 2016,39(10):2102-2113(in Chinese with English abstract).[doi:10.11897/SP.J.1016.2016.02102]

[23] Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable:A distributed storage system for structured data. ACM Trans. on Computer Systems (TOCS), 2008,26(2):205-218.[doi:10.1145/1365815.1365816]

[24] Silberschatz A, Korth HF, Sudarshan S. Database System Concepts. 6th ed., New York:McGraw-Hill, 2010.

[25] Vengerov D, Menck AC, Zait M, et al. Join size estimation subject to filter conditions. Proc. of the VLDB Endowment, 2015,8(12):1530-1541.[doi:10.14778/2824032.2824051]

[26] Daenen J, Neven F, Tan T, Vansummeren S. Parallel evaluation of multi-semi-joins. Proc. of the VLDB Endowment, 2016,9(10):732-743.[doi:10.14778/2977797.2977800]

[27] Chu S, Balazinska M, Suciu D. From theory to practice:Efficient join query evaluation in a parallel database system. In:Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. ACM Press, 2015.63-78.[doi:10.1145/2723372.2750545]

[28] TaoBao. OceanBase. 2018. https://github.com/alibaba/oceanbase

附中文参考文献:

[3] 阳振坤.OceanBase关系数据库架构.华东师范大学学报(自然科学版),2014,2014(5):141-148,163.[doi:10.3969/j.issn.1000-5641.2014.05.012]

[22] 樊秋实,周敏奇,周傲英.基线与增量数据分离架构下的分布式连接算法.计算机学报,2016,39(10):2102-2113.[doi:10.11897/SP.J.1016.2016.02102]

Get Citation

高锦涛,李战怀,杜洪涛,刘文洁.分布式数据库下基于剪枝的并行合并连接策略.软件学报,2019,30(11):3364-3381

Copy

Article Metrics

Abstract:3304
PDF: 4576
HTML: 1689
Cited by: 0

History

Received:July 27,2017
Revised:September 29,2017
Adopted:
Online: April 16,2018
Published:

You are the first2035257Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History