[关键词]
[摘要]
新兴分布式计算框架Apache Flink支持在集群上执行大规模的迭代程序,但其默认的静态资源分配机制导致无法进行合理的资源配置来使迭代作业按时完成.针对这一问题,应该依靠用户来主动表达性能约束而不是被动地进行资源保留,故提出了一种基于运行时间预测的动态资源分配策略RABORP (resource allocation based onruntime prediction),来为具有明确运行时限的Flink迭代作业制定动态资源分配计划并实施.其主要思想是:通过预测各个迭代超步的运行时间,然后根据预测结果在迭代作业提交时和超步间的同步屏障处分别进行资源的初始分配和动态调整,以保证可使用最小资源集,使迭代作业在用户规定的运行时限内完成.通过在不同数据集下执行多种典型的Flink迭代作业进行了相关对比实验,实验结果表明,所建立的运行时间预测模型能够对各个超步的运行时间进行准确预测,而且在单作业和多作业场景下,采用所提出的动态资源分配策略相比于目前最先进算法在各项性能指标上都有所提升.
[Key word]
[Abstract]
Apache Flink, an emerging distributed computing framework, supports the execution of large-scale iterative programs on the cluster, but its default static resource allocation mechanism makes it impossible to carry out reasonable resource allocation to make iterative jobs complete on time. In response to this problem, that users should be relied on to actively express performance constraints rather than passively retain resources. RABORP, a dynamic resource allocation strategy based on runtime prediction is proposed to develop and implement a dynamic resource allocation plan for Flink iterative jobs with clear runtime limits. The main idea is to predict the runtime of each iteration superstep, and then the initial allocation and dynamic adjustment of resources are performed at the time of the iterative job submission and the synchronization barrier between the supersteps according to the predicted results, to ensure that the minimum set of resources can be used to complete the iterative job within the runtime limit specified by the user. A variety of typical Flink iterative jobs were executed under the dataset to carry out relevant comparative experiments. Experimental results show that the established runtime prediction model can accurately predict the runtime of each superstep, and compared with the current state-of-the-art algorithms, the proposed dynamic resource allocation strategy used in single-job and multi-job scenarios has improved various performance indicators.
[中图分类号]
[基金项目]
国家重点研发计划(2018YFB1004402);国家自然科学基金(61772124)