基于性能建模的深度学习训练任务调度综述
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP18

基金项目:

山东省重大创新工程 (2021CXGC010101); 国家自然科学基金 (62302489).


Survey on Task Scheduling of Deep Learning Training Based on Performance Modeling
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近年来, 深度学习研究成果在全球范围内得到广泛应用. 为了提高大规模深度学习模型的训练效率, 业界通常采用建设GPU集群并配置高效的任务调度器的策略. 然而, 深度学习训练任务具有性能异构性和放置拓扑敏感性等复杂性能特性. 对性能无感知的调度容易导致资源利用率低下、训练效率差等问题. 为了应对这一挑战, 近期涌现出大量基于性能建模的深度学习训练任务调度器. 这些调度器通过构建精确的性能模型, 深入了解任务的复杂性能特性, 并据此设计更优化的调度算法, 从而形成更高效的调度方案. 本文首先基于建模设计思路, 对目前调度器使用的性能建模方法进行分类综述. 随后, 根据调度器利用性能建模的调度优化途径, 对现有的任务调度工作进行了系统性的分析. 最后, 对性能建模与调度在未来的研究方向进行了展望.

    Abstract:

    In recent years, research achievements in deep learning have found widespread applications globally. To enhance the training efficiency of large-scale deep learning models, industry practices often involve constructing GPU clusters and configuring efficient task schedulers. However, deep learning training tasks exhibit complex performance characteristics such as performance heterogeneity and placement topological sensitivity. Scheduling without considering performance can lead to issues such as low resource utilization and poor training efficiency. In response to this challenge, a great number of schedulers of deep learning training tasks based on performance modeling have emerged. These schedulers, by constructing accurate performance models, delve into the intricate performance characteristics of tasks. Based on this understanding, they design more optimized scheduling algorithms, thereby forming more efficient scheduling solutions. This study begins with a modeling design perspective, providing a categorized review of the performance modeling methods employed by current schedulers. Subsequently, based on the optimized scheduling approaches from performance modeling by schedulers, a systematic analysis of existing task scheduling efforts is presented. Finally, this study outlines prospective research directions for performance modeling and scheduling in the future.

    参考文献
    相似文献
    引证文献
引用本文

杨紫超,吴恒,吴悦文,张文博.基于性能建模的深度学习训练任务调度综述.软件学报,,():1-22

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-09-25
  • 最后修改日期:2023-11-06
  • 录用日期:
  • 在线发布日期: 2024-06-20
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号