面向Apache Flink流式分析应用的高吞吐优化技术

doi:10.13328/j.cnki.jos.007235

微信服务号

微信订阅号

2025年7月15日 9:16 星期二

首页 > 过刊浏览>2025年第36卷第7期 >3184-3208. DOI:10.13328/j.cnki.jos.007235

PDF HTML阅读 XML下载导出引用引用提醒

面向Apache Flink流式分析应用的高吞吐优化技术
DOI:
                        10.13328/j.cnki.jos.007235
                    
CSTR:
                        32375.14.jos.007235
                    
作者:
                        秦政秦政
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
许利杰许利杰
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049;计算机科学国家重点实验室 (中国科学院 软件研究所), 北京 100190;中国科学院大学南京学院, 江苏 南京 211135
在期刊界中查找
在百度中查找
在本站中查找
陈伟陈伟
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049;计算机科学国家重点实验室 (中国科学院 软件研究所), 北京 100190;中国科学院大学南京学院, 江苏 南京 211135
在期刊界中查找
在百度中查找
在本站中查找
王毅王毅
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
吴铭钞吴铭钞
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
曾鸿斌曾鸿斌
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
王伟王伟
中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049;计算机科学国家重点实验室 (中国科学院 软件研究所), 北京 100190;中国科学院大学南京学院, 江苏 南京 211135
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP311
基金项目:国家重点研发计划(2021YFB2600301)

High Throughput Optimization Technique for Apache Flink

Author:

QIN Zheng
QIN Zheng
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
XU Li-Jie
XU Li-Jie
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Wei
CHEN Wei
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Yi
WANG Yi
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
WU Ming-Chao
WU Ming-Chao
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG Hong-Bin
ZENG Hong-Bin
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Wei
WANG Wei
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

随着大数据时代的到来, 海量的用户数据赋能了众多数据驱动的行业应用, 例如智慧交通、智能电网、商品推荐等. 在数据实时性要求高的应用场景下, 数据中的业务价值随时间增长快速降低, 因此数据分析系统需要具有高吞吐和低延迟能力, 以Apache Flink为代表的流式大数据处理系统得到广泛应用. Flink通过在集群的计算节点上并行化计算任务, 水平扩展系统吞吐率. 然而, 已有研究指出, Flink存在单点性能弱, 集群水平可扩展性差的问题. 为了提高流式大数据处理系统的吞吐率, 研究者在控制平面设计、系统算子实现和垂直可扩展性等方面开展优化, 但现有工作尚缺乏对流式分析应用数据流的关注. 流式分析应用是由事件流驱动并使用有状态处理函数的应用, 例如智能电网场景下的低电压检测应用、商品推荐场景下的广告活动分析应用等. 对典型的流式分析应用的数据流特征进行分析, 总结其中存在的3个水平可扩展性瓶颈并给出相应的优化策略, 包括: 键级水位线, 动态负载分发策略和基于键值的数据交换策略. 基于上述优化技术, 对Flink框架进行扩展并形成原型系统Trilink, 选取真实场景数据集: 低电压检测应用, 桥梁拱顶监测应用和典型流式分析测试基准Yahoo Streaming Benchmark, 与现有工作进行测试比较. 实验结果表明, 相较于Flink, Trilink在单机环境下吞吐率提升了5倍以上, 8节点下水平扩展加速比提高了1.6倍以上.

关键词:流式处理;分布式系统;性能优化;大数据系统

Abstract:

With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream processing systems in big data, exemplified by Apache Flink, have been widely applied. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, current research indicates that Flink has weak single-point performance and poor cluster scalability. To improve the throughput of stream processing systems, researchers have focused on optimizations in designing control planes, implementing system operators, and improving vertical scalability. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications are driven by event streams and employ stateful processing functions, including low voltage detection in smart grids and advertising recommendation. This study analyzes the data flow characteristics of typical streaming analysis applications, identifies three bottlenecks in optimizing scalability, and proposes corresponding optimization strategies: the key-level watermark strategy, the dynamic load distribution strategy, and the the key-value based exchange strategy. Based on these optimization strategies, this study implements Trilink based on Flink and applies it to various applications such as low voltage detection, bridge arch crowns monitoring, and the Yahoo Streaming Benchmark. Experimental results show that the modified system, Trilink, achieves more than a 5-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scalability acceleration in an 8-node setup, compared to Flink.

Key words:streaming processing;distributed system;performance optimization;big data system

引用本文

秦政,许利杰,陈伟,王毅,吴铭钞,曾鸿斌,王伟.面向Apache Flink流式分析应用的高吞吐优化技术.软件学报,2025,36(7):3184-3208

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-02-03
最后修改日期:2024-03-29
录用日期:
在线发布日期: 2024-11-20
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码