面向Apache Flink流式分析应用的高吞吐优化技术
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家重点研发计划(2021YFB2600301)


High Throughput Optimization Technique for Apache Flink
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着大数据时代的到来, 海量的用户数据赋能了众多数据驱动的行业应用, 例如智慧交通、智能电网、商品推荐等. 在数据实时性要求高的应用场景下, 数据中的业务价值随时间增长快速降低, 因此数据分析系统需要具有高吞吐和低延迟能力, 以Apache Flink为代表的流式大数据处理系统得到广泛应用. Flink通过在集群的计算节点上并行化计算任务, 水平扩展系统吞吐量. 然而, 已有研究指出, Flink存在单点性能弱, 集群水平可扩展性差的问题. 为了提高流式大数据处理系统的吞吐量, 研究者在控制平面设计、系统算子实现和垂直可扩展性等方面开展优化, 但现有工作尚缺乏对流式分析应用数据流的关注. 流式分析应用是由事件流驱动并使用有状态处理函数的应用, 例如智能电网场景下的低电压检测应用、商品推荐场景下的广告活动分析应用等. 对典型的流式分析应用的数据流特征进行分析, 总结其中存在的3个水平可扩展性瓶颈并给出相应的优化策略, 包括: 键级水位线, 动态负载分发策略和基于键值的数据交换策略. 基于上述优化技术, 对Flink框架进行扩展并形成原型系统Trilink, 选取真实场景数据集: 低电压检测应用, 桥梁拱顶监测应用和典型流式分析测试基准Yahoo Streaming Benchmark, 与现有工作进行测试比较. 实验结果表明, 相较于Flink, Trilink在单机环境下吞吐率提升了5倍以上, 8节点下水平扩展加速比提高了1.6倍以上.

    Abstract:

    With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream processing systems in big data, exemplified by Apache Flink, have been widely applied. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, current research indicates that Flink has weak single-point performance and poor cluster scalability. To improve the throughput of stream processing systems, researchers have focused on optimizations in designing control planes, implementing system operators, and improving vertical scalability. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications are driven by event streams and employ stateful processing functions, including low voltage detection in smart grids and advertising recommendation. This study analyzes the data flow characteristics of typical streaming analysis applications, identifies three bottlenecks in optimizing scalability, and proposes corresponding optimization strategies: the key-level watermark strategy, the dynamic load distribution strategy, and the the key-value based exchange strategy. Based on these optimization strategies, this study implements Trilink based on Flink and applies it to various applications such as low voltage detection, bridge arch crowns monitoring, and the Yahoo Streaming Benchmark. Experimental results show that the modified system, Trilink, achieves more than a 5-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scalability acceleration in an 8-node setup, compared to Flink.

    参考文献
    相似文献
    引证文献
引用本文

秦政,许利杰,陈伟,王毅,吴铭钞,曾鸿斌,王伟.面向Apache Flink流式分析应用的高吞吐优化技术.软件学报,,():1-25

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-02-03
  • 最后修改日期:2024-03-29
  • 录用日期:
  • 在线发布日期: 2024-11-20
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号