面向Apache Flink流式分析应用的高吞吐优化技术
作者:
中图分类号:

TP311

基金项目:

国家重点研发计划(2021YFB2600301)


High Throughput Optimization Technique for Apache Flink
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [52]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    随着大数据时代的到来, 海量的用户数据赋能了众多数据驱动的行业应用, 例如智慧交通、智能电网、商品推荐等. 在数据实时性要求高的应用场景下, 数据中的业务价值随时间增长快速降低, 因此数据分析系统需要具有高吞吐和低延迟能力, 以Apache Flink为代表的流式大数据处理系统得到广泛应用. Flink通过在集群的计算节点上并行化计算任务, 水平扩展系统吞吐量. 然而, 已有研究指出, Flink存在单点性能弱, 集群水平可扩展性差的问题. 为了提高流式大数据处理系统的吞吐量, 研究者在控制平面设计、系统算子实现和垂直可扩展性等方面开展优化, 但现有工作尚缺乏对流式分析应用数据流的关注. 流式分析应用是由事件流驱动并使用有状态处理函数的应用, 例如智能电网场景下的低电压检测应用、商品推荐场景下的广告活动分析应用等. 对典型的流式分析应用的数据流特征进行分析, 总结其中存在的3个水平可扩展性瓶颈并给出相应的优化策略, 包括: 键级水位线, 动态负载分发策略和基于键值的数据交换策略. 基于上述优化技术, 对Flink框架进行扩展并形成原型系统Trilink, 选取真实场景数据集: 低电压检测应用, 桥梁拱顶监测应用和典型流式分析测试基准Yahoo Streaming Benchmark, 与现有工作进行测试比较. 实验结果表明, 相较于Flink, Trilink在单机环境下吞吐率提升了5倍以上, 8节点下水平扩展加速比提高了1.6倍以上.

    Abstract:

    With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream processing systems in big data, exemplified by Apache Flink, have been widely applied. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, current research indicates that Flink has weak single-point performance and poor cluster scalability. To improve the throughput of stream processing systems, researchers have focused on optimizations in designing control planes, implementing system operators, and improving vertical scalability. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications are driven by event streams and employ stateful processing functions, including low voltage detection in smart grids and advertising recommendation. This study analyzes the data flow characteristics of typical streaming analysis applications, identifies three bottlenecks in optimizing scalability, and proposes corresponding optimization strategies: the key-level watermark strategy, the dynamic load distribution strategy, and the the key-value based exchange strategy. Based on these optimization strategies, this study implements Trilink based on Flink and applies it to various applications such as low voltage detection, bridge arch crowns monitoring, and the Yahoo Streaming Benchmark. Experimental results show that the modified system, Trilink, achieves more than a 5-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scalability acceleration in an 8-node setup, compared to Flink.

    参考文献
    [1] Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data: The next frontier for innovation, competition, and productivity. Technical Report, McKinsey Global Institute, 2011.
    [2] Yaqoob I, Hashem IAT, Gani A, Mokhtar S, Ahmed E, Anuar NB, Vasilakos AV. Big data: From beginning to future. Int’l Journal of Information Management, 2016, 36(6): 1231–1247.
    [3] 傅宣登, 吴志林. Apache Flink复杂事件处理语言的形式语义. 软件学报. http://www.jos.org.cn/1000-9825/6968.htm
    Fu XD, Wu ZL. Formal semantics of Apache Flink complex event processing language. Ruan Jian Xue Bao/Journal of Software (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6968.htm
    [4] Smutny P, Schreiberova P. Chatbots for learning: A review of educational chatbots for the Facebook Messenger. Computers & Education, 2020, 151: 103862.
    [5] Cao W, Gao YS, Li FF, Wang S, Lin BC, Xu K, Feng XJ, Wang YC, Liu ZJ, Zhang GJ. Timon: A timestamped event database for efficient telemetry data processing and analytics. In: Proc. of the 2020 ACM SIGMOD Int’l Conf. on Management of Data. Portland: ACM, 2020. 739–753. [doi: 10.1145/3318464.3386136]
    [6] Apache Flink?. 2024. https://flink.apache.org/
    [7] Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster computing with working sets. In: Proc. of the 2nd USENIX Conf. on Hot Topics in Cloud Computing. Boston: USENIX Association, 2010. 10.
    [8] Fragkoulis M, Carbone P, Kalavri V, Katsifodimos A. A survey on the evolution of stream processing systems. The VLDB Journal, 2024, 33(2): 507–541.
    [9] 孙大为, 张广艳, 郑纬民. 大数据流式计算: 关键技术及系统实例. 软件学报, 2014, 25(4): 839–862. http://www.jos.org.cn/1000-9825/4558.htm
    Sun DW, Zhang GY, Zheng WM. Big data stream computing: Technologies and instances. Ruan Jian Xue Bao/Journal of Software, 2014, 25(4): 839–862 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4558.htm
    [10] 徐志榛, 徐辰, 丁光耀, 陈梓浩, 周傲英. 支持实时流计算应用的关键技术研究进展. 软件学报, 2024, 35(1): 430–454. http://www.jos.org.cn/1000-9825/6917.htm
    Xu ZZ, Xu C, Ding GY, Chen ZH, Zhou AY. Research progress on key technologies towards real-time stream processing applications. Ruan Jian Xue Bao/Journal of Software, 2024, 35(1): 430–454 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6917.htm
    [11] Isah H, Abughofa T, Mahfuz S, Ajerla D, Zulkernine F, Khan S. A survey of distributed data stream processing frameworks. IEEE Access, 2019, 7: 154300–154316.
    [12] Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng BJ, Poulosky P. Benchmarking streaming computation engines: Storm, Flink and spark streaming. In: Proc. of the 2016 IEEE Int’l Parallel and Distributed Processing Symp. Workshops (IPDPSW). Chicago: IEEE, 2016. 1789–1792. [doi: 10.1109/IPDPSW.2016.138]
    [13] Terry D, Goldberg D, Nichols D, Oki B. Continuous queries over append-only databases. ACM SIGMOD Record, 1992, 21(2): 321–330.
    [14] Javed MH, Lu XY, Panda DK. Characterization of big data stream processing pipeline: A case study using Flink and Kafka. In: Proc. of the 4th IEEE/ACM Int’l Conf. on Big Data Computing, Applications and Technologies. Austin: ACM, 2017. 1–10.
    [15] Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Krishnamurthy S, Madden SR, Reiss F, Shah MA. TelegraphCQ: Continuous dataflow processing. In: Proc. of the 2003 ACM SIGMOD Int’l Conf. on Management of Data. San Diego: ACM, 2003. 668. [doi: 10.1145/872757.872857]
    [16] Gedik B, Andrade H, Wu KL, Yu PS, Doo M. SPADE: The system s declarative stream processing engine. In: Proc. of the 2008 ACM SIGMOD Int’l Conf. on Management of Data. Vancouver: ACM, 2008. 1123–1134. [doi: 10.1145/1376616.1376729]
    [17] Arasu A, Babcock B, Babu S, Cieslewicz J, Datar M, Ito K, Motwani R, Srivastava U, Widom J. STREAM: The Stanford data stream management system. In: Garofalakis M, Gehrke J, Rastogi R, eds. Data Stream Management. Berlin: Springer, 2016. 317–336.
    [18] Abadi DJ, Carney D, ?etintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N, Zdonik S. Aurora: A new model and architecture for data stream management. The VLDB Journal, 2003, 12(2): 120–139.
    [19] Abadi DJ, Ahmad Y, Balazinska M, ?etintemel U, Cherniack M, Hwang JH, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y, Zdonik SB. The design of the borealis stream processing engine. In: Proc. of the 2nd Biennial Conf. on Innovative Data Systems Research. Asilomar, 2005. 277–289.
    [20] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113.
    [21] Apache Storm. 2023. https://storm.apache.org/
    [22] Armbrust M, Das T, Torres J, Yavuz B, Zhu SX, Xin R, Ghodsi A, Stoica I, Zaharia M. Structured streaming: A declarative API for real-time applications in Apache Spark. In: Proc. of the 2018 Int’l Conf. on Management of Data. Houston: ACM, 2018. 601–613.
    [23] Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas K. State management in Apache Flink?: Consistent stateful distributed stream processing. Proc. of the VLDB Endowment, 2017, 10(12): 1718–1729.
    [24] Zeuch S, Monte BD, Karimov J, Lutz C, Renz M, Traub J, Bre? S, Rabl T, Markl V. Analyzing efficient stream processing on modern hardware. Proc. of the VLDB Endowment, 2019, 12(5): 516–530.
    [25] Koliousis A, Weidlich M, Fernandez RC, Wolf AL, Costa P, Pietzuch P. SABER: Window-based hybrid stream processing for heterogeneous architectures. In: Proc. of the 2016 Int’l Conf. on Management of Data. San Francisco: ACM, 2016. 555–569.
    [26] Mai L, Zeng K, Potharaju R, Xu L, Suh S, Venkataraman S, Costa P, Kim T, Muthukrishnan S, Kuppa V, Dhulipalla S, Rao S. Chi: A scalable and programmable control plane for distributed stream processing systems. Proc. of the VLDB Endowment, 2018, 11(10): 1303–1316.
    [27] Varga B, Balassi M, Kiss A. Towards autoscaling of Apache Flink jobs. Acta Universitatis Sapientiae, Informatica, 2021, 13(1): 39–59.
    [28] Arkian HR, Pierre G, Tordsson J, Elmroth E. Model-based stream processing auto-scaling in geo-distributed environments. In: Proc. of the 2021 Int’l Conf. on Computer Communications and Networks (ICCCN). Athens: IEEE, 2021. 1–10.
    [29] He CL, Huang Y, Wang CY, Wang N. Dynamic data partitioning strategy based on heterogeneous Flink cluster. In: Proc. of the 5th Int’l Conf. on Artificial Intelligence and Big Data (ICAIBD). Chengdu: IEEE, 2022. 355–360. [doi: 10.1109/ICAIBD55127.2022.9820336]
    [30] Tangwongsan K, Hirzel M, Schneider S. Optimal and general out-of-order sliding-window aggregation. Proc. of the VLDB Endowment, 2019, 12(10): 1167–1180.
    [31] Shahvarani A, Jacobsen HA. Parallel index-based stream join on a multicore CPU. In: Proc. of the 2020 ACM SIGMOD Int’l Conf. on Management of Data. Portland: ACM, 2020. 2523–2537. [doi: 10.1145/3318464.3380576]
    [32] Karimov J, Rabl T, Markl V. AStream: Ad-hoc shared stream processing. In: Proc. of the 2019 Int’l Conf. on Management of Data. Amsterdam: ACM, 2019. 607–622. [doi: 10.1145/3299869.3319884]
    [33] Karimov J, Rabl T, Markl V. AJoin: Ad-hoc stream joins at scale. Proc. of the VLDB Endowment, 2019, 13(4): 435–448.
    [34] McSherry F, Lattuada A, Schwarzkopf M, Roscoe T. Shared arrangements: Practical inter-query sharing for streaming dataflows. arXiv:1812.02639, 2020.
    [35] Zhang XQ, Ma K. Toward sliding time window of low watermark to detect delayed stream arrival. In: Proc. of the 16th EAI Int’l Conf. on Collaborative Computing: Networking, Applications and Worksharing. Shanghai: Springer, 2021. 444–454. [doi: 10.1007/978-3-030-67540-0_28]
    [36] 岳晓飞, 史岚, 赵宇海, 季航旭, 王国仁. 面向Flink迭代作业的动态资源分配策略. 软件学报, 2022, 33(3): 985–1004. http://www.jos.org.cn/1000-9825/6447.htm
    Yue XF, Shi L, Zhao YH, Ji HX, Wang GR. Dynamic resource allocation strategy for Flink iterative jobs. Ruan Jian Xue Bao/Journal of Software, 2022, 33(3): 985–1004 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6447.htm
    [37] Shaikh SA, Mariam K, Kitagawa H, Kim KS. GeoFlink: A distributed and scalable framework for the real-time processing of spatial streams. In: Proc. of the 29th ACM Int’l Conf. on Information & Knowledge Management. Virtual Event: ACM, 2020. 3149–3156. [doi: 10.1145/3340531.3412761]
    [38] Putatunda S, Laha AK. Travel time prediction in real time for GPS taxi data streams and its applications to travel safety. Human-centric Intelligent Systems, 2023, 3: 381–401.
    [39] Apache Kafka. 2023. https://kafka.apache.org/
    [40] Redis. 2024. https://redis.io/
    [41] Akidau T, Begoli E, Chernyak S, Hueske F, Knight K, Knowles K, Mills D, Sotolongo D. Watermarks in stream processing systems: Semantics and comparative analysis of apache Flink and google cloud dataflow. Proc. of the VLDB Endowment, 2021, 14(12): 3135–3147.
    [42] Wilmanns PS, Geuns SJ, Hausmans JPHM, Bekooij MJG. Buffer sizing to reduce interference and increase throughput of real-time stream processing applications. In: Proc. of the 18th IEEE Int’l Symp. on Real-time Distributed Computing. Auckland: IEEE, 2015. 9–18. [doi: 10.1109/ISORC.2015.14]
    [43] Gulisano V, Palyvos-Giannas D, Havers B, Papatriantafilou M. The role of event-time order in data streaming analysis. In: Proc. of the 14th ACM Int’l Conf. on Distributed and Event-based Systems. Montreal: ACM, 2020. 214–217. [doi: 10.1145/3401025.3404088]
    [44] Dahlgaard S, Knudsen MBT, Thorup M. Practical hash functions for similarity estimation and dimensionality reduction. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6618–6628.
    [45] Aumayr D, Marr S, Gonzalez Boix E, M?ssenb?ck H. Asynchronous snapshots of actor systems for latency-sensitive applications. In: Proc. of the 16th ACM SIGPLAN Int’l Conf. on Managed Programming Languages and Runtimes. Athens: ACM, 2019. 157–171. [doi: 10.1145/3357390.3361019]
    [46] Chandy KM, Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computer Systems (TOCS), 1985, 3(1): 63–75.
    [47] Performance Tuning. 2023. https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/tuning/
    [48] MySQL. 2023. https://www.mysql.com/
    相似文献
    引证文献
    引证文献 [0] 您输入的地址无效!
    没有找到您想要的资源,您输入的路径无效!

    网友评论
    网友评论
    分享到微博
    发 布
引用本文

秦政,许利杰,陈伟,王毅,吴铭钞,曾鸿斌,王伟.面向Apache Flink流式分析应用的高吞吐优化技术.软件学报,,():1-25

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-02-03
  • 最后修改日期:2024-03-29
  • 在线发布日期: 2024-11-20
文章二维码
您是第19799526位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号