TP18
山东省重大创新工程 (2021CXGC010101); 国家自然科学基金 (62302489)
近年来, 深度学习研究成果在全球范围内得到广泛应用. 为了提高大规模深度学习模型的训练效率, 业界通常采用建设GPU集群并配置高效的任务调度器的策略. 然而, 深度学习训练任务具有性能异构性和放置拓扑敏感性等复杂性能特性. 对性能无感知的调度容易导致资源利用率低下、训练效率差等问题. 为了应对这一挑战, 近期涌现出大量基于性能建模的深度学习训练任务调度器. 这些调度器通过构建精确的性能模型, 深入了解任务的复杂性能特性, 并据此设计更优化的调度算法, 从而形成更高效的调度方案. 首先基于建模设计思路, 对目前调度器使用的性能建模方法进行分类综述. 随后, 根据调度器利用性能建模的调度优化途径, 对现有的任务调度工作进行系统性分析. 最后, 对性能建模与调度在未来的研究方向进行展望.
In recent years, research achievements in deep learning have found widespread applications globally. To enhance the training efficiency of large-scale deep learning models, industry practices often involve constructing GPU clusters and configuring efficient task schedulers. However, deep learning training tasks exhibit complex performance characteristics such as performance heterogeneity and placement topological sensitivity. Scheduling without considering performance can lead to issues such as low resource utilization and poor training efficiency. In response to this challenge, a great number of schedulers of deep learning training tasks based on performance modeling have emerged. These schedulers, by constructing accurate performance models, delve into the intricate performance characteristics of tasks. Based on this understanding, they design more optimized scheduling algorithms, thereby forming more efficient scheduling solutions. This study begins with a modeling design perspective, providing a categorized review of the performance modeling methods employed by current schedulers. Subsequently, based on the optimized scheduling approaches from performance modeling by schedulers, a systematic analysis of existing task scheduling efforts is presented. Finally, this study outlines prospective research directions for performance modeling and scheduling in the future.
杨紫超,吴恒,吴悦文,张文博.基于性能建模的深度学习训练任务调度综述.软件学报,2025,36(4):1570-1589
复制