Abstract:HPL (high performance Linpack) is a widely used benchmark for measuring computer performance. Over the decades, the practice of optimizing and tuning of HPL has constantly drawn great attention in both industrial and academic circle, to evaluate the performance of contemporary cutting-edge computer platforms. For current heterogeneous HPC platforms with multiple accelerating co-processors, an approach of high-performance HPL benchmark, Hetero-HPL, is proposed in this paper. In Hetero-HPL, the mapping between process set and (co-) processor set becomes adjustable, so that the computation within each computing node may avoid inter-process message exchange, and each important procedure of the HPL algorithm may make full use of the hardware resources of the computing node, such as memory, CPU cores, co-processors, and PCI-e bus etc.Without redundant computation and communication, the working set of Hetero-HPL is not restricted by the limit of pinned memory size in a single allocation, and is distributed in a way that the workload is balanced among all the co-processors and massive fine-grained parallelism can be exploited. On one experimental platform with four co-processors, Heter-HPL can reach an efficiency of 76.5% (the efficiency of function dgemm is 84%) in one computing node, and further experiment suggests that Hetero-HPL is also a feasible approach in distributed environment.