Abstract:Nowadays, the mainstream supercomputers in the world adopt heterogeneous systems with accelerators more and more. The increase of float point computation performance of the accelerators requires other components to match its speed, including CPU, memory, bus, and network. High performance Linpack (HPL) is the traditional benchmark for high performance computers. Complex heterogeneous systems have brought both opportunities and challenges to the benchmarking with HPL. Therefore, for heterogeneous supercomputers, a new task partitioning scheme between the CPU and the accelerators is proposed, using the balance point theory to guide the optimization of HPL. For optimizing HPL, a look-ahead algorithm is proposed to coordinate the collaboration of CPU and the accelerators, as well as a contiguous row-swap algorithm, enabling the parallelism among CPU, accelerators, and network. Besides, new panel factorization and row-swap implementations have been designed for the system with accelerators, improving the effectiveness and efficiency of the usage of accelerators. With the configuration of 4 GPUs on each computing node, HPL efficiency of 79.51% on a single node.