主页期刊介绍编委会编辑部服务介绍道德声明在线审稿编委办公编辑办公English
2018-2019年专刊出版计划 微信服务介绍 最新一期:2019年第11期
     
在线出版
各期目录
纸质出版
分辑系列
论文检索
论文排行
综述文章
专刊文章
美文分享
各期封面
E-mail Alerts
RSS
旧版入口
中国科学院软件研究所
  
投稿指南 问题解答 下载区 收费标准 在线投稿
高锦涛,李战怀,杜洪涛,刘文洁.分布式数据库下基于剪枝的并行合并连接策略.软件学报,2019,30(11):3364-3381
分布式数据库下基于剪枝的并行合并连接策略
Strategy of Parallel Merge Join Based on Prune in Distributed Database
投稿时间:2017-07-27  修订日期:2017-09-29
DOI:10.13328/j.cnki.jos.005579
中文关键词:  分布式  排序合并连接  剪枝  双边邻接表  并行
英文关键词:distribution  sort-merge join  prune  bilateral adjacency list  parallel
基金项目:国家自然科学基金(61732014,61672432,61672434,61472321);陕西省基础研究计划(2017JM6104)
作者单位E-mail
高锦涛 西北工业大学 计算机学院, 陕西 西安 710129 gaojintao@mail.nwpu.edu.cn 
李战怀 西北工业大学 计算机学院, 陕西 西安 710129  
杜洪涛 西北工业大学 计算机学院, 陕西 西安 710129  
刘文洁 西北工业大学 计算机学院, 陕西 西安 710129  
摘要点击次数: 1075
全文下载次数: 577
中文摘要:
      排序合并连接是数据库系统一种重要的连接实现方式,比哈希连接有更广泛的应用.分布式环境下,数据分片、分布存储,面对昂贵的网络代价,进行高效排序合并连接的挑战巨大.传统策略首先针对连接数据进行排序,然后基于排好序的数据执行合并连接.这两部分操作均基于原始数据进行操作,通常情况下,原始连接数据存在无用数据块,这些数据块无需连接,但会增加额外开销,包括网络开销.随着数据量的增多,出现无用数据块的概率增大,额外开销随之增多.传统策略没有预先处理这些无用数据块.针对这个问题,提出一种分布式环境下基于剪枝的并行排序合并连接策略(parallel sort-merge join based on prune,简称Pr_PSMJ).其特点是,连接发生之前高效完成对连接对象无用数据块的剪枝处理,提高整体连接效率.基本思想是,根据连接对象对应的连接分区数据统计信息,构造一种双边邻接表(bilateral adjacency list,简称BAL),用来对连接数据中无用数据块进行剪枝,并保证最终连接结果的正确性;剪枝完成后,利用BAL计算出各个最佳本地连接执行点,并指导分区数据的迁移,使数据移动量最小;在连接阶段,由于BAL保证本地连接执行节点的独立性,因此能够轻松并行执行整个连接过程,并在每个连接点本地利用多核环境完成局部并行排序合并连接;最后,将局部结果合并成最终结果.由于Pr_PSMJ中的高效剪枝策略是在连接执行之前完成的,因此几乎适合任何合并连接操作,并且对于其他连接策略也有借鉴作用.给出了基于Pr_PSMJ的算法的正确性、效率性以及适应性分析,并且给出实验验证,证明了在分布式大数据量排序合并连接情况下,Pr_PSMJ相对于其他策略能够有效减少网络开销,并提高连接效率.
英文摘要:
      Sort-merge join is an important implementation method of join in database system, and is more widely used than hash join. Under distributed environment, data is sharded and distributed across many nodes, and usually needed to be transmitted by network which is very expensive. Therefore, it is far more challenging to efficiently process sort-merge join in distributed database. Traditional strategy firstly sorts data, and then carries out merge-join based on sorted data, which are both related with original data. But original data usually has useless data blocks, which does not participate in join, but will increase the extra cost during join including network cost. The bigger of data size, the higher of possibility of useless data blocks. Traditional sort-merge join strategy does not prehandle these useless data. In this study, a parallel sort-merge join is proposed based on prune, called Pr_PSMJ, which can efficiently prune the useless data ahead from join data, and improve the efficiency of join. Firstly, a bilateral adjacency list (BAL) is constructed by the statistic information from shards of join data. Using BAL, the useless data of join data can be pruned and the correctness of final join result is guaranteed. Secondly, after the pruning, the optimal local-join executing place can be computed by BAL, and the quantity of data mitigating among nodes is minimized. Finally, during the join step, for the independence of local-join guaranteed by BAL, the executing of sort-merge join can be easily parallelled, and in every executing node, it is natural to parallel the local-partitial-join using multi-core environment. The final result is achieved by merging local-result. Because high efficient prune operation is done before executing join, Pr_PSMJ is almost fit for every sort-merge strategy, and it is a good lesson for other join strategies. The correctness, efficiency, and adaptability of algorithm are analyzed based on Pr_PSMJ. By experiments, it is proved that under distributed environment, orienting large data, Pr_PSMJ can effectively decrease the overhead of network and improve the join efficiency than other strategies.
HTML  下载PDF全文  查看/发表评论  下载PDF阅读器
 

京公网安备 11040202500064号

主办单位:中国科学院软件研究所 中国计算机学会 京ICP备05046678号-4
编辑部电话:+86-10-62562563 E-mail: jos@iscas.ac.cn
Copyright 中国科学院软件研究所《软件学报》版权所有 All Rights Reserved
本刊全文数据库版权所有,未经许可,不得转载,本刊保留追究法律责任的权利