HPL Approach for Heterogeneous Computer Platforms
Author:
Affiliation:

Clc Number:

TP303

Fund Project:

National Key Research and Development Program of China (2018YFB0204404); Strategic Priority Research Program of the Chinese Academy of Sciences (Category C) (XDC01030200)

  • Article
  • | |
  • Metrics
  • |
  • Reference [16]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    HPL (high performance Linpack) is a widely used benchmark for measuring computer performance. Over the decades, the practice of optimizing and tuning of HPL has constantly drawn great attention in both industrial and academic circle, to evaluate the performance of contemporary cutting-edge computer platforms. For current heterogeneous HPC platforms with multiple accelerating co-processors, an approach of high-performance HPL benchmark, Hetero-HPL, is proposed in this paper. In Hetero-HPL, the mapping between process set and (co-) processor set becomes adjustable, so that the computation within each computing node may avoid inter-process message exchange, and each important procedure of the HPL algorithm may make full use of the hardware resources of the computing node, such as memory, CPU cores, co-processors, and PCI-e bus etc.Without redundant computation and communication, the working set of Hetero-HPL is not restricted by the limit of pinned memory size in a single allocation, and is distributed in a way that the workload is balanced among all the co-processors and massive fine-grained parallelism can be exploited. On one experimental platform with four co-processors, Heter-HPL can reach an efficiency of 76.5% (the efficiency of function dgemm is 84%) in one computing node, and further experiment suggests that Hetero-HPL is also a feasible approach in distributed environment.

    Reference
    [1] Dongarra JJ, Luszczek P, Petitet A. The LINPACK Benchmark:Past, present and future. Concurrency and Computation Practice & Experience, 2003,15(9):803-820.
    [2] TOP-500 Official website. 2021. http://www.top500.org
    [3] Gan XB, Hu YK, Liu J, Chi LH, Xu H, Gong CY, Li SG, Yan YH. Customizing the HPL for China accelerator. SCIENCE CHINA:Informtaion Sciences, 2018,61(4):Article No.042102.
    [4] Van Zee FG, Van De Geijn RA. BLIS:A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical Software, 2013,41(3):1-33.
    [5] Greer B, Henry G. High performance software on Intel Pentium Pro processors or micro-ops to TeraFLOPS. In:Proc. of the Supercomputing 1997 Conf. San Jose, 1997. 1-13.[doi:10.1145/509593.509639]
    [6] Jia Y, Luszczek P, Dongarra J. Multi-GPU implementation of LU factorization. In:Proc. of the Int'l Conf. on Computational Science, 2012. 106-115.
    [7] Bach M, Kretz M, Lindenstruth V, Rohr D. Optimized HPL for AMD GPU and multi-core CPU usage. Computer Science—Research and Development, 2011,26(3-4):153-164.
    [8] Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX. Optimizing Linpack benchmark on GPU-accelerated petascale supercomputer. Journal of Computer Science and Technology, 2011,26(5):854-865.[doi:10.1007/s11390-011-0184-1]
    [9] Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet A, Chrysos G, Dubey G. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi coprocessor. In:Proc. of the IEEE 27th Int'l Symp. on Parallel and Distributed Processing. 2013.[doi:10.1109/ipdps.2013.113]
    [10] Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In:Proc. of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2009. 46-51.
    [11] Bach M, Rohr D. Scaling DGEMM to multiple Cayman GPUs and Interlagos many-core CPUs for HPL. 2011. http://developer.amd.com/wordpress/media/2013/06/2909_1_final.pdf
    [12] Womble D, Greenberg D, Wheat S, Riesen R. LU factorization and the LINPACK benchmark on the Intel Paragon. Sandia Technical Report, Sandia National Laboratories, 1994.
    [13] Offical website. 2021. https://www.olcf.ornl.gov/summit/
    [14] Chen RZ, Huang LB, Chen XH, Wang ZY. Optimizing HPL benchmark on multi-GPU clusters. Computer Science, 2013,40(3):107-110(in Chinese with English abstract).
    附中文参考文献:
    [14] 陈任之,黄立波,陈顼颢,王志英.单节点多GPU集群下HPL动态负载均衡优化.计算机科学,2013,40(3):107-110.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

孙乔,孙家昶,马文静,赵玉文.面向异构计算机平台的HPL方案.软件学报,2021,32(8):2329-2340

Copy
Share
Article Metrics
  • Abstract:1700
  • PDF: 5114
  • HTML: 3774
  • Cited by: 0
History
  • Received:August 22,2019
  • Revised:December 05,2019
  • Online: August 05,2021
  • Published: August 06,2021
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063