ChattyGraph:面向异构多协处理器的高可扩展图计算系统

doi:10.13328/j.cnki.jos.006732

微信服务号

微信订阅号

首页 > 过刊浏览>2023年第34卷第4期 >1977-1996. DOI:10.13328/j.cnki.jos.006732

PDF HTML阅读 XML下载导出引用引用提醒

ChattyGraph:面向异构多协处理器的高可扩展图计算系统
DOI:
                        10.13328/j.cnki.jos.006732
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:蒋筱斌(1996-),男,硕士,主要研究领域为操作系统,云原生,图计算;武延军(1979-),男,博士,研究员,博士生导师,CCF杰出会员,主要研究领域为操作系统;熊轶翔(1996-),男,硕士,主要研究领域为操作系统,云原生,高性能计算;赵琛(1967-),男,博士,研究员,博士生导师,CCF高级会员,主要研究领域为编程语言,编译技术;张珩(1990-),男,博士,CCF专业会员,主要研究领域为分布式与并行计算,大数据处理,操作系统.
通讯作者:
中图分类号:
基金项目:国家自然科学基金(62002350)

ChattyGraph: Highly Scalable Graph Computing System for Heterogeneous Multi Accelerators

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

现阶段, 随着数据规模扩大化和结构多样化的趋势日益凸现, 如何利用现代链路内链的异构多协处理器为大规模数据处理提供实时、可靠的并行运行时环境, 已经成为高性能以及数据库领域的研究热点. 利用多协处理器(GPU)设备的现代服务器(multi-GPU server)硬件架构环境, 已经成为分析大规模、非规则性图数据的首选高性能平台. 现有研究工作基于Multi-GPU服务器架构设计的图计算系统和算法(如广度优先遍历和最短路径算法),整体性能已显著优于多核CPU计算环境. 然而, 这类图计算系统中, 多GPU协处理器间的图分块数据传输性能受限于PCI-E总线带宽和局部延迟, 导致通过增加GPU设备数量无法达到整体系统性能的类线性增长趋势, 甚至会出现严重的时延抖动, 进而已无法满足大规模图并行计算系统的高可扩展性要求. 经过一系列基准实验验证发现,现有系统存在如下两类缺陷: (1) 现代GPU设备间数据通路的硬件架构发展日益更新(如NVLink-V1, NVLink-V2), 其链路带宽和延迟得到大幅改进, 然而现有系统受限于PCI-E总线进行数据分块通信, 无法充分利用现代GPU链路资源(包括链路拓扑、连通性和路由); (2) 在应对不规则图数据集时, 这类系统常采用过于单一的设备间数据组织和移动策略, 带来大量不必要GPU设备间经PCI-E总线的数据同步开销, 导致本地性计算同步等待时延开销过大.因此, 充分地利用各类现代Multi-GPU服务器通信链路架构来设计可扩展性强的图数据高性能计算系统亟待解决.为了达到Multi-GPU下图计算系统的高可扩展性, 提出一种基于混合感知的细粒度通信来增强Multi-GPU图计算系统的可伸缩性, 即采用架构链路预感知技术对图结构化数据采用模块化数据链路和通信策略, 为大规模图数据(结构型数据、应用型数据)最优化选择数据交换方法. 综合上述优化策略, 提出并设计了一种面向Multi-GPU图并行计算系统ChattyGraph. 通过对GPU图数据缓冲区优化, 基于OPENMP与NCCL优化多核GPU协同计算, ChattyGraph能在Multi-GPU HPC平台上自适应、高效地支持各类图并行计算应用和算法. 在8-GPU NVIDIA DGX服务器上, 对各种真实世界图数据的若干实验评估表明: ChattyGraph显著实现了图计算效率和可扩展性的提升, 并优于其他最先进的竞争对手性能, 计算效率平均提升了1.2×-1.5×, 加速比平均提升了2×-3×, 包括WS-VR和Groute.

Abstract:

Recently, with the increasing trend of data scale expansion and structure diversification, how to use the heterogeneous multi accelerators in modern link to provide a real-time and reliable parallel runtime environment for large-scale data processing has become a research hotspot in the field of high performance and database. Modern servers equipped with multi accelerators (GPU) has become the preferred high-performance platform for analyzing large-scale and irregular graph data. The overall performance of existing research designing graph computing systems and algorithms based on multi-GPU server architecture (such as breadth first traversal and shortest path algorithm) has been significantly better than that of multi-core CPU computing environment. However, the data transmission performance between multi-GPU of existing graph computing system is limited by PCI-E bandwidth and local delay, leading to being unable to achieve a linear growth trend of performance by increasing the number of GPU devices, and even serious delay jitter which cannot satisfy the high scalability requirements of large-scale graph parallel computing systems. After a series of benchmark experiments, it is found that the existing system has the following two types of defects. (1) The hardware architecture of the data link between modern GPU devices is rapidly updated (such as NVLink-V1 and NVLink-V2), and its link bandwidth and delay have been greatly improved. However, the existing systems are still limited by PCI-E for data communication, and cannot make full use of modern GPU link resources (including link topology, connectivity, and routing); (2) When dealing with irregular graph data, such systems often adopt single data movement strategy between devices, bringing a lot of unnecessary data synchronization overhead between GPU devices via PCI-E bus, resulting in excessive time-wait overhead of local computing. Therefore, it is urgent to make full use of various communication links between modern multi-GPU to design a highly scalable graph computing system. In order to achieve the high scalability of the multi-GPU graph computing system, a fine-grained communication based on hybrid perception is proposed to enhance the scalability of the multi-GPU graph computing system. It pre-awares the architecture link, uses the modular data link and communication strategy for different graph structured data, and finally selects the optimal data exchange method for large-scale graph data (structural data and application data). Based on above optimization strategies, this study proposes and designs a graph oriented parallel computing system via multi-GPU named ChattyGraph. By optimizing data buffer and multi-GPU collaborative computing with OpenMP and NCCL, ChattyGraph can adaptively and efficiently support various graph parallel computing applications and algorithms on multi-GPU HPC platform. Several experiments of various real-world graph data on 8-GPU NVIDIA DGX server show that ChattyGraph significantly improves graph computing efficiency and scalability, and outperforms other advanced competitors. The average computing efficiency is increased by 1.2×-1.5× and the average acceleration ratio is increased by 2×-3×, including WS-VR and Groute.

参考文献

相似文献

引证文献

引用本文

蒋筱斌,熊轶翔,张珩,武延军,赵琛. ChattyGraph:面向异构多协处理器的高可扩展图计算系统.软件学报,2023,34(4):1977-1996

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-09-12
最后修改日期:2022-04-20
录用日期:
在线发布日期: 2022-07-22
出版日期: 2023-04-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码