Abstract:As heterogeneous system becomes one of the most important choices to build super computers, how to orchestrate CPU and accelerator to leverage the great computability of heterogeneous systems is of great significance. HPL is the most important benchmark in HPC field, traditional HPL algorithm targeting at CPU-only systems cannot achieve high performance by only offloading matrix multiplication workload to accelerators. To solve this problem, this work proposes a HPL performance model and a multithread fine-grained pipelining algorithm for domestic-processor-domestic-accelerator heterogeneous system. Meanwhile, a light weight cross-platform heterogeneous framework is implemented to carry out a cross-platform HPL algorithm. The proposed performance model predicts HPL performance accurately on similar heterogeneous systems. On NVIDIA platform, the proposed HPL algorithm outperforms the NVIDIA proprietary counterparts by 9%. On domestic-processor-domestic-accelerator platform, the finally optimized Linpack program achieves 2.3 PFLOPS on 512 nodes, with floating-point efficiency 71.1%.