面向多GPU的图神经网络训练加速

doi:10.13328/j.cnki.jos.006647

微信服务号

微信订阅号

首页 > 过刊浏览>2023年第34卷第9期 >4407-4420. DOI:10.13328/j.cnki.jos.006647

PDF HTML阅读 XML下载导出引用引用提醒

面向多GPU的图神经网络训练加速
DOI:
                        10.13328/j.cnki.jos.006647
                    
作者:
                        
                        
                    
作者单位:
作者简介:苗旭鹏(1995-),男,博士,CCF学生会员,主要研究领域为人工智能系统,分布式系统,GPU性能优化,图表示学习;王驭捷(1998-),男,博士,CCF学生会员,主要研究领域为分布式深度学习系统,图神经网络.;沈佳(1997-),女,硕士,主要研究领域为分布式深度学习系统,图神经网络;邵蓥侠(1988-),男,博士,副教授,博士生导师,CCF高级会员,主要研究领域为数据库,知识图谱数据管理,并行图计算;崔斌(1975-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库,人工智能系统,大数据管理分析.
通讯作者:崔斌,E-mail:bin.cui@pku.edu.cn
中图分类号:
基金项目:国家重点研发计划(2018YFB1004403); 国家自然科学基金(61832001, U1936104); 北京大学-腾讯协同创新实验室项目; CCF-百度松果基金

Graph Neural Network Training Acceleration for Multi-GPUs

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

图神经网络由于其强大的表示能力和灵活性最近取得了广泛的关注. 随着图数据规模的增长和显存容量的限制, 基于传统的通用深度学习系统进行图神经网络训练已经难以满足要求, 无法充分发挥GPU设备的性能. 如何高效利用GPU硬件进行图神经网络的训练已经成为该领域重要的研究问题之一. 传统做法是基于稀疏矩阵乘法, 完成图神经网络中的计算过程, 当面对GPU显存容量限制时, 通过分布式矩阵乘法, 把计算任务分发到每个设备上, 这类方法的主要不足有: (1)稀疏矩阵乘法忽视了图数据本身的稀疏分布特性, 计算效率不高; (2)忽视了GPU本身的计算和访存特性, 无法充分利用GPU硬件. 为了提高训练效率, 现有一些研究通过图采样方法, 减少每轮迭代的计算带价和存储需求, 同时也可以支持灵活的分布式拓展, 但是由于采样随机性和方差, 它们往往会影响训练的模型精度. 为此, 提出了一套面向多GPU的高性能图神经网络训练框架, 为了保证模型精度, 基于全量图进行训练, 探索了不同的多GPU图神经网络切分方案, 研究了GPU上不同的图数据排布对图神经网络计算过程中GPU性能的影响, 并提出了稀疏块感知的GPU访存优化技术. 基于C++和CuDNN实现了该原型系统, 在4个不同的大规模GNN数据集上的实验表明: (1)通过图重排优化, 提高了GPU约40%的缓存命中率, 计算加速比可达2倍; (2)相比于现有系统DGL, 取得了5.8倍的整体加速比.

Abstract:

In recent years, graph neural networks (GNNs) have attracted wide attention due to their powerful and flexible representation ability. Considering the increasing scale of graph data and the limitation of the video memory capacity, it becomes more challenging to train GNNs with traditional general deep learning systems, and such training cannot give full play to the performance of GPU devices. To achieve efficient use of GPU hardware for GNN training is one of the important research issues in this field. Traditional approaches employ sparse matrix multiplication for the calculation process of GNNs. When the video memory capacity of GPU devices is limited, the computation tasks are distributed to each device by distributed matrix multiplication. Their shortcomings are mainly as follows: (1) Sparse matrix multiplication ignores the sparse distribution of the graph data, which results in low computation efficiency. (2) These methods ignore the computation and memory access characteristics of GPU and fail to utilize the hardware resources. To improve the training efficiency, some studies propose to reduce the costs of each iteration and storage requirements through graph sampling techniques, which also support flexible distributed scaling. Due to the stochastics and variance, however, these methods often affect the model accuracy. Therefore, this study proposes a high-performance GNN training framework for multi-GPUs. Different GNN partition strategies for multi-GPUs are explored, and the influence of different graph ordering patterns on the GPU performance during the calculation process of GNNs is investigated to ensure the accuracy of the model. Moreover, block-sparsity-aware optimization methods are put forward for GPU memory access. The prototype system is achieved using C++ and CuDNN. The experiments on four large-scale GNN datasets demonstrate that (1) the graph re-ordering method improves the cache hit rate of GPU by around 40% and doubles the computation speedup; (2) compared to the existing system DGL, the proposed system achieves a total speedup of 5.8x.

参考文献

相似文献

引证文献

引用本文

苗旭鹏,王驭捷,沈佳,邵蓥侠,崔斌.面向多GPU的图神经网络训练加速.软件学报,2023,34(9):4407-4420

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-08-02
最后修改日期:2021-09-26
录用日期:
在线发布日期: 2023-01-04
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史