Abstract:In recent years, graph neural networks (GNNs) have attracted wide attention due to their powerful and flexible representation ability. Considering the increasing scale of graph data and the limitation of the video memory capacity, it becomes more challenging to train GNNs with traditional general deep learning systems, and such training cannot give full play to the performance of GPU devices. To achieve efficient use of GPU hardware for GNN training is one of the important research issues in this field. Traditional approaches employ sparse matrix multiplication for the calculation process of GNNs. When the video memory capacity of GPU devices is limited, the computation tasks are distributed to each device by distributed matrix multiplication. Their shortcomings are mainly as follows: (1) Sparse matrix multiplication ignores the sparse distribution of the graph data, which results in low computation efficiency. (2) These methods ignore the computation and memory access characteristics of GPU and fail to utilize the hardware resources. To improve the training efficiency, some studies propose to reduce the costs of each iteration and storage requirements through graph sampling techniques, which also support flexible distributed scaling. Due to the stochastics and variance, however, these methods often affect the model accuracy. Therefore, this study proposes a high-performance GNN training framework for multi-GPUs. Different GNN partition strategies for multi-GPUs are explored, and the influence of different graph ordering patterns on the GPU performance during the calculation process of GNNs is investigated to ensure the accuracy of the model. Moreover, block-sparsity-aware optimization methods are put forward for GPU memory access. The prototype system is achieved using C++ and CuDNN. The experiments on four large-scale GNN datasets demonstrate that (1) the graph re-ordering method improves the cache hit rate of GPU by around 40% and doubles the computation speedup; (2) compared to the existing system DGL, the proposed system achieves a total speedup of 5.8x.