应对倾斜数据流在线连接方法

doi:10.13328/j.cnki.jos.005440

微信服务号

微信订阅号

2025年8月13日 4:28 星期三

首页 > 过刊浏览>2018年第29卷第3期 >869-882. DOI:10.13328/j.cnki.jos.005440

PDF HTML阅读 XML下载导出引用引用提醒

应对倾斜数据流在线连接方法
DOI:
                        10.13328/j.cnki.jos.005440
                    
CSTR:
                        
                    
作者:
                        王春凯王春凯
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
孟小峰孟小峰
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:王春凯(1981-),男,河南开封人,博士,主要研究领域为分布式数据流管理;孟小峰(1964-),男,博士,教授,博士生导师,CCF会士,主要研究领域为Web数据管理,移动数据管理,XML数据管理,云数据管理,隐私保护.
通讯作者:孟小峰,E-mail:xfmeng@ruc.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金（61532016，61379050，61532010，91646203，61762082）；国家重点研发计划（2016YFB1000602，2016YFB1000603）；中国人民大学科学研究基金（11XNL010）；河南省科技开放合作项目（172106000077）

Online Join Method for Skewed Data Streams

Author:

WANG Chun-Kai
WANG Chun-Kai
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
MENG Xiao-Feng
MENG Xiao-Feng
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Natural Science Foundation of China (61532016, 61379050, 61532010, 91646203, 61762082);The National Key Research and Development Program of China (2016YFB1000602, 2016YFB1000603);The Research Funds of Renmin University (11XNL010);the Science and Technology Opening up Cooperation project of He'nan Province (172106000077)

摘要

图/表

访问统计

参考文献 [30]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

并行环境下的分布式连接处理要求制定划分策略以减少状态迁移和通信开销.相对于数据库管理系统而言，分布式数据流管理系统中的在线θ连接操作需要更高的计算成本和内存资源.基于完全二部图的连接模型可支持分布式数据流的连接操作.因为连接操作的每个关系仅存放于二部图模型的一侧处理单元，无需复制数据，且处理单元相互独立，因此该模型具有内存高效、易伸缩和可扩展等特性.然而，由于数据流速的不稳定性和属性值分布的不均衡性，导致倾斜数据流的连接操作易出现集群负载不均衡的现象.针对倾斜数据流的连接操作，模型无法动态分配查询节点，并需要人工干预数据分组的参数设置.尤其是应对全部历史数据的连接查询，模型效率更低.基于上述问题，提出了管理倾斜数据流连接的框架，使用基于键值和元组混合的划分样式，有效应对二部图模型的各侧倾斜数据.设计了重新动态分配查询节点的策略和状态迁移算法，以支持全历史数据的连接查询和自适应的资源管理.针对合成数据和真实数据的实验结果表明，该方案可有效应对倾斜数据的连接操作，并进一步提升分布式数据流管理系统的吞吐率，特别是降低云环境中的计算成本.

关键词:分布式数据流管理系统;在线连接;数据倾斜;状态迁移;二部图连接模型

Abstract:

Scalable distributed join processing in a parallel environment requires a partitioning policy to transfer data while minimizing the size of migrated statement and the number of communicated messages. Online theta-joins over data streams are more computationally expensive and impose higher memory requirement in distributed data stream management systems (DDSMS) than standalone database management systems (DBMS). The complete bipartite graph-based model can support distributed stream joins, and has the characteristics of memory-efficiency, elasticity and scalability. This is because each relation is stored in its corresponding processing units without data replicas and the units are independent of each other. However, due to the instability of data stream rate and the imbalance of attribute value distribution, the online theta-joins over skewed data streams can lead to the load imbalance of cluster. In this case, the bipartite graph-based model is unable to allocate the query nodes dynamically, and requires to set parameters about the grouping manually. The more serious issue is that the effect of the full-history join is worse. In this paper, a framework for handling skewed stream join is presented for enhancing the adaptability of the join model and minimizing the system cost based on the varying workloads. The proposal includes a mixed key-based and tuple-based partitioning scheme to handle skewed data in each side of the bipartite graph-based model, a strategy for redistribution of query nodes in two sides of this model, and a migration algorithm about state consistency to support full-history joins and adaptive resource management. Experiments with synthetic data and real data show that the presented method can effectively handle skewed data streams and improve the throughput of DDSMS, and it also effective especially on reducing the operational cost in the cloud environment.

Key words:distributed data stream management system;online join;data skew;state migration;bipartite graph-based join model

参考文献

[1] Sun DW, Zhang GY, Zheng WM. Big data stream computing:Technologies and instances. Ruan Jian Xue Bao/Journal of Software, 2014,25(4):839-862(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4558.htm[doi:10.13328/j.cnki.jos. 004558]

[2] Cui XC, Yu XH, Liu Y, Lü ZY. Distributed stream processing:A survey. Journal of Computer Research and Development, 2015, 52(2):318-332(in Chinese with English abstract).[doi:10.7544/issn1000-1239.2015.20140268]

[3] Wang CK, Meng XF. Relational query techniques for distributed data stream:A survey. Computer Journal of Computers, 2016, 39(1):80-96(in Chinese with English abstract).[doi:10.11897/SP.J.1016.2016.00080]

[4] Neumeyer L, Robbins B, Nair A, Kesari A. S4:Distributed stream computing platform. In:Proc. of the 10th EEE Int'l Conf. on Data Mining Workshops. Sydney:IEEE, 2010. 170-177.[doi:10.1109/ICDMW.2010.172]

[5] Toshniwal A, Taneja S, Shukla A, Ramasamy K, Pater JM. Storm@Twitter. In:Proc. of the 2014 ACM Int'l Conf. on Management of Data. Snowbird:ACM Press, 2014. 147-156.[doi:10.1145/2588555.2595641]

[6] Squall. https://github.com/epfldata/squall/

[7] Calcite. http://calcite.apache.org/

[8] Hwang JH, Balazinska M, Rasin A, Cetintemel U, Stonebraker M. High-Availability algorithms for distributed stream processing. In:Proc. of the 21st Int'l Conf. on Data Engineering. Tokyo:IEEE, 2005. 779-790.[doi:10.1109/ICDE.2005.72]

[9] Fernandez RC, Migliavacca M, Kalyvianaki E, Pietzuch P. Integrating scale out and fault tolerance in stream processing using operator state management. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. New York:ACM Press, 2013. 725-736.[doi:10.1145/2463676.2465282]

[10] Walton CB, Dale AG, Jenevein RM. A taxonomy and performance model of data skew effects in parallel joins. In:Proc. of the 17th Int'l Conf. on Very Large Data Bases. Barcelona:Morgan Kaufmann Publishers, 1991. 537-548.

[11] Elseidy M, Elguindy A, Vitorovic A, Koch C. Scalable and adaptive online joins. Proc. of the VLDB Endowment, 2014,7(6):441-452.[doi:10.14778/2732279.2732281]

[12] Lin Q, Ooi BC, Wang Z, Yu C. Scalable distributed stream join processing. In:Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. Melbourne:ACM Press, 2015. 811-825.[doi:10.1145/2723372.2746485]

[13] Ananthanarayanan R, Basker V, Das S, Gupta A, Jiang H. Photon:Fault-tolerant and scalable joining of continuous data streams. In:Proc. of the ACM 2013 Int'l Conf. on Management of Data. New York:ACM Press, 2013. 577-588.[doi:10.1145/2463676. 2465272]

[14] Zaharia M, Das T, Li H, Hunter T, Shenker S. Discretized streams:Fault-tolerant streaming computation at scale. In:Proc. of the ACM SIGOPS 24th Symp. on Operating Systems Principles. Farmington:ACM Press, 2013. 423-438.[doi:10.1145/2517349. 2522737]

[15] Spark streaming. https://spark.apache.org/streaming/

[16] Spark. http://spark.apache.org/

[17] Zaharia M, Chowdhury M, Das T, Dave A, Ma J. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In:Proc. of the 9th USENIX Symp. on Networked Systems Design and Implementation. San Jose:USENIX, 2012. 15-28.

[18] Vitorovic A, Elseidy M, Koch C. Load balancing and skew resilience for parallel joins. In:Proc. of the 32nd IEEE Int'l Conf. on Data Engineering. Helsinki:IEEE, 2016. 313-324.[doi:10.1109/ICDE.2016.7498250]

[19] Fang JH, Zhang R, Wang XT, Fu TZJ, Zhang ZJ, Zhou AY. Cost-Effective stream join algorithm on cloud system. In:Proc. of the 25th ACM Int'l Conf. on Information and Knowledge Management. Indianapolis:ACM Press, 2016. 1773-1782.[doi:10.1145/2983323.2983773]

[20] Fang JH, Zhang R, Fu TZJ, Zhang ZJ, Zhou AY, Zhu JH. Parallel stream processing against workload skewness and variance. In:Proc. of the 26th Int'l Symp. on High-Performance Parallel and Distributed Computing. Washington:ACM Press, 2017. 15-26.[doi:10.1145/3078597.3078613]

[21] Karmarkar N, Karp RM. An efficient approximation scheme for the one-dimensional bin-packing problem. In:Proc. of the 23rd Annual Symp. on Foundations of Computer Science. Chicago:IEEE, 1982. 312-320.[doi:10.1109/SFCS.1982.61]

[22] Kafka. http://kafka.apache.org/

[23] Li HY, Ghodsi A, Zaharia M, Shenker S, Stoica I. Tachyon:Reliable, memory speed storage for cluster computing frameworks. In:Proc. of the 2014 ACM Symp. on Cloud Computing. Seattle:ACM Press, 2014. 1-15.[doi:10.1145/2670979.2670985]

[24] Ding JB, Fu TZJ, Ma RTB, Winslett M, Yang Y, Zhang ZJ, Chao HY. Optimal operator state migration for elastic data stream processing. HAL-INRIA, 2015,22(3):1-8.

[25] TPCH. http://www.tpc.org/tpch

[26] Arasu A, Cheniack M, Maier D, Maskey AS, Ryvkina E, Stonebraker M, Tibbetts R. Linear road:A stream data management benchmark. In:Proc. of the 30th Int'l Conf. on Very Large Data Bases. Toronto:Morgan Kaufmann Publishers, 2004. 480-491.[doi:10.1016/B978-012088469-8.50044-9]

附中文参考文献:

[1] 孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例.软件学报,2014,25(4):839-862. http://www.jos.org.cn/1000-9825/4558.htm[doi:10.13328/j.cnki.jos.004558]

[2] 崔星灿,禹晓辉,刘洋,吕朝阳.分布式流处理技术综.计算机研究与发展,2015,52(2):318-332.[doi:10.7544/issn1000-1239.2015. 20140268]

[3] 王春凯,孟小峰.分布式数据流关系查询技术研究.计算机学报,2016,39(1):80-96.[doi:10.11897/SP.J.1016.2016.00080]

引用本文

王春凯,孟小峰.应对倾斜数据流在线连接方法.软件学报,2018,29(3):869-882

复制

文章指标

点击次数:3644
下载次数: 6221
HTML阅读次数: 2987
引用次数: 0

历史

收稿日期:2017-07-31
最后修改日期:2017-09-05
录用日期:
在线发布日期: 2017-12-05
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码