随着数据规模扩大化和结构多样化的趋势日益凸现，如何利用现代链路内链的异构多协处理器为大规模数据处理提供实时、可靠的并行运行时环境已经成为高性能以及数据库领域的研究热点.利用多块协处理器（GPU）设备的现代服务器（Multi-GPU Server）硬件架构环境已经成为分析大规模、非规则性图数据的首选高性能平台.现有研究工作基于Multi-GPU服务器架构设计的图计算系统和算法（如广度优先遍历和最短路径算法），整体性能已显著优于多核CPU计算环境.然而，这类图计算系统中多GPU协处理器间的图分块数据传输性能受限于PCI-E总线带宽和局部延迟，导致通过增加GPU设备数量无法达到整体系统性能的类线性增长趋势，甚至会出现严重的时延抖动，进而已无法满足大规模图并行计算系统的高可扩展性要求.经过一系列基准实验验证，发现现有系统存在如下两类缺陷：1）现代GPU设备间数据通路的硬件架构发展日益更新（如NVLink-V1，NVLink-V2），其链路带宽和延迟得到大幅改进，然而现有系统受限于PCI-E总线进行数据分块通信，无法充分利用现代GPU链路资源（包括链路拓扑、连通性和路由）；2）在应对不规则图数据集时，这类系统常采用过于单一的设备间数据组织和移动策略，带来大量不必要GPU设备间经PCI-E总线的数据同步开销，导致本地性计算同步等待时延开销过大.因此，充分地利用各类现代Multi-GPU服务器通信链路架构来设计可扩展性强的图数据高性能计算系统亟待解决.为了达到Multi-GPU下图计算系统的高可扩展性，提出了一种基于混合感知的细粒度通信来增强Multi-GPU图计算系统的可伸缩性，即采用架构链路预感知技术对图结构化数据采用模块化数据链路和通信策略，为大规模图数据（结构型数据、应用型数据）最优化选择数据交换方法.综合上述优化策略，本文提出并设计了一种面向Multi-GPU图并行计算系统ChattyGraph.通过对GPU图数据缓冲区优化，基于OPENMP与NCCL优化多核GPU协同计算，ChattyGraph能在Multi-GPU HPC平台上自适应、高效地支持各类图并行计算应用和算法.在8-GPU NVIDIA DGX服务器上，对各种真实世界图数据的若干实验评估表明，ChattyGraph显著实现图计算效率和可扩展性的提升，并优于其他最先进的竞争对手性能，计算效率平均提升了1.2-1.5X和加速比平均提升了2-3X，包括WS-VR和Groute.
With the increasing trend of data scale expansion and structure diversification, how to use the heterogeneous multi co-processors in modern link to provide a real-time and reliable parallel runtime environment for large-scale data processing has become a research hotspot in the field of high performance and database. Modern servers equipped with multi co-processors (GPU) has become the preferred high-performance platform for analyzing large-scale and irregular graph data. The overall performance of existing research designing graph computing systems and algorithms based on multi-GPU server architecture (such as breadth first traversal and shortest path algorithm) has been significantly better than that of multi-core CPU computing environment. However, the data transmission performance between multi-GPU of existing graph computing system is limited by PCI-E bandwidth and local delay, leading to being unable to achieve a linear growth trend of performance by increasing the number of GPU devices, and even serious delay jitter which can not satisfy the high scalability requirements of large-scale graph parallel computing systems. After a series of benchmark experiments, it is found that the existing system has the following two types of defects:1) the hardware architecture of the data link between modern GPU devices is rapidly updated (such as NVLink-V1 and NVLink-V2), and its link bandwidth and delay have been greatly improved. However, the existing systems are still limited by PCI-E for data communication, and can not make full use of modern GPU link resources (including link topology, connectivity and routing); 2) When dealing with irregular graph data, such systems often adopt single data movement strategy between devices, bringing a lot of unnecessary data synchronization overhead between GPU devices via PCI-E bus, resulting in excessive time-wait overhead of local computing. Therefore, it is urgent to make full use of various communication links between modern multi-GPU to design a highly scalable graph computing system. In order to achieve the high scalability of the multi-GPU graph computing system, a fine-grained communication based on hybrid perception is proposed to enhance the scalability of the multi-GPU graph computing system. It pre-awares the architecture link, uses the modular data link and communication strategy for different graph structured data, and finally selects the optimal data exchange method for large-scale graph data (structural data and application data). Based on above optimization strategies, this paper proposes and designs a graph oriented parallel computing system via multi-GPU named ChattyGraph. By optimizing data buffer and multi-GPU collaborative computing with OpenMP and NCCL, ChattyGraph can adaptively and efficiently support various graph parallel computing applications and algorithms on multi-GPU HPC platform. Several experiments of various real-world graph data on 8-GPU NVIDIA DGX server show that ChattyGraph significantly improves graph computing efficiency and scalability, and outperforms other advanced competitors. The average computing efficiency is increased by 1.2-1.5X and the average acceleration ratio is increased by 2-3X, including WS-VR and Groute.