异构HPL算法中CPU端高性能BLAS库优化
作者:
作者简介:

蔡雨(1988-),男,高级主管工程师,主要研究领域为CPU架构,性能优化.
刘子行(1977-),男,高级主管工程师,主要研究领域为安全软件.
孙成国(1985-),男,高级主管工程师,主要研究领域为高性能计算,性能优化.
康梦博(1989-),男,高级工程师,主要研究领域为性能优化.
杜朝晖(1975-),男,主任工程师,主要研究领域为安全软件.
李双双(1984-),男,高级工程师,主要研究领域为数学库.

通讯作者:

孙成国,E-mail:sunchengguo@hygon.cn

中图分类号:

TP303


CPU-side High Performance BLAS Library Optimization in Heterogeneous HPL Algorithm
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [39]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    异构HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用CPU计算能力,加速部件集成了更多的计算核心,负责主要的计算,通用CPU负责任务调度的同时也参与计算.在合理划分任务、平衡负载的前提下,优化CPU端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对BLAS(basic linear algebra subprograms)函数进行优化往往可以更加充分地利用通用CPU计算能力,提高系统整体效率.BLIS(BLAS-like library instantiation software)算法库是开源的BLAS函数框架,具有易开发、易移植和模块化等优点.基于异构系统平台体系结构以及HPL算法特点,充分利用三级缓存、向量化指令和多线程并行等技术手段优化CPU端调用的各级BLAS函数,应用auto-tuning技术优化矩阵分块参数,从而形成了HygonBLIS算法库.与MKL相比,在异构环境下,HPL算法整体性能提高了11.8%.

    Abstract:

    Improving the efficiency of heterogeneous HPL needs to fully utilize the computing power of acceleration components and CPU, the acceleration components integrate more computing cores and are responsible for the main calculation. The general CPU is responsible for task scheduling and also participates in calculation. Under the premise of reasonable division of tasks and load balancing, optimizing CPU-side computing performance is particularly important to improve overall efficiency. Optimizing the basic linear algebra subprogram (BLAS) functions for specific platform architecture characteristics can often make full use of general-purpose CPU computing capabilities to improve the overall system efficiency. The BLAS-like Library Instantiation Software (BLIS) algorithm library is an open source BLAS function framework, which has the advantages of easy development, portability, and modularity. Based on the heterogeneous system platform architecture and HPL algorithm characteristics, this study uses three-level cache, vectorized instructions, and multi-threaded parallel technology to optimize the BLAS functions called by the CPU, applies auto-tuning technology to optimize the matrix block parameters, and eventually forms the HygonBLIS algorithm library. Compared with MKL, the overall performance of the HPL using HygonBLIS has been improved by 11.8% in the heterogeneous environment.

    参考文献
    [1] Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In:Proc. of the 1998 ACM/IEEE Conf. on Supercomputing (SC'98). San Jose, 1998. 1-27.
    [2] Goto K, van de Geijn RA. Anatomy of high-performance matrix Multiplication. ACM Trans. on Mathematical Software, 2008,34(3):1-25.
    [3] Goto K, van de Geijn RA. High-performance implementation of the Level-3 BLAS. ACM Trans. on Mathematical Software, 2008, 35(1):1-14.
    [4] Wang Q, Zhang X, Zhang Y, Qing Y. AUGEM:Automatically generate high performance dense linear algebra kernels on X86 CPUs. In:Proc. of the Int'l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2013). Denver, 2013. 1-12.
    [5] https://github.com/xianyi/OpenBLAS
    [6] https://github.com/flame/blis
    [7] Van Zee FG, van de Geijn RA. BLIS:A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical Software, 2015:41(3):1-33.
    [8] Van Zee FG, Smith TM, Marker B, Low TM, van de Geijn RA, Igual FD, Smelyanskiy M, Zhang XY, Kistler M, Austel V, Gunnels JA, Killough L. The BLIS Framework:Experiments in Portability. ACM Trans. on Mathematical Software, 2016,42(2):1-19.
    [9] Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG. Anatomy of high-performance many-threaded matrix multiplication. In:Proc. of the IEEE 28th Int'l Parallel and Distributed Processing Symp. 2014. 1049-1059.
    [10] Gu N, Li K, Cheng G, Wu C. Optimization of BLAS based on Loongson 2F architecture. Journal of Univercity of Science and Technology of China, 2008,38(7):854-859(in Chinese with English abstract).
    [11] Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark:Past, present, and future. Concurrency and Computation:Practice and Experience, 2003,15(9):803-820.
    [12] Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N. Fast implementation of DGEMM on Fermi GPU. In:Proc. of the 2011 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2011). Seattle, 2011. 1-11.
    [13] Jiang H, Wang F, Zuo K, Su X, Xue L, Yang C. Design and implementation of a highly efficient DGEMM for 64-bit ARMv8 multi-core processors. In:Proc. of the 44th Int'l Conf. on Parallel Processing. Beijing, 2015. 200-209.
    [14] Jiang H, Wang F, Li K, Yang C, Zhao K, Huang C. Implementation of an accurate and efficient compensated DGEMM for 64-bit ARMv8 multi-core processors. In:Proc. of the IEEE 21st Int'l Conf. on Parallel and Distributed Systems (ICPADS). Melbourne, 2015. 491-498.
    [15] Wang L, Wu W, Xu Z, Xiao J, Yang Y. BLASX:A high performance level-3 BLAS library for heterogeneous multi-GPU computing. In:Proc. of the 2016 Int'l Conf. on Supercomputing (ICS 2016). Istanbul, 2016. 1-11.
    [16] Sun J, Sun Q, Deng P, Yang C. Research on the optimization of BLAS level 1 and 2 functions on Shenwei many-core processor. Computer Systems & Applications, 2017,26(11):101-108(in Chinese with English abstract).
    [17] Liu H, Liu F, Zhang P, Yang C, Jiang L. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications, 2016,25(12):234-239(in Chinese with English abstract).
    [18] Guo Z, Guo S, Xu J, Zhang Z. Register allocation in base mathematics library for platform of heterogenerous multi-core. Journal of Computer Applications, 2014,34(S1):86-89(in Chinese with English abstract).
    [19] https://www.mcs.anl.gov/research/projects/mpi/index.htm
    [20] https://www.cs.colostate.edu/cameron/Vsipl.html
    [21] Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In:Proc. of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2009). Washington, 2009. 46-51.
    [22] Yang C, Wang F, Du Y, Chen J, Liu J, Yi H. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In:Proc. of the 2010 IEEE Int'l Conf. on Cluster Computing. Heraklion, 2010. 19-28.
    [23] Yamazaki I, Tomov S, Dongarra J. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. Procedia Computer Science, 2012,9(11):37-46.
    [24] Yang C, Chen C, Tang T, Chen X, Fang J, Xue J. An energy-efficient implementation of LU factorization on heterogeneous systems. In:Proc. of the IEEE 22nd Int'l Conf. on Parallel and Distributed Systems (ICPADS). Wuhan, 2016. 971-979.
    [25] Jo G, Nah J, Lee J, Kim J, Lee J. Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans. on Parallel & Distributed Systems, 2015,26(7):1814-1825.
    [26] Li J, Li X, Tan G, Chen M, Sun N. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In:Proc. of the 26th ACM Int'l Conf. on Supercomputing (ICS 2012). New York, 2012. 377-386.
    [27] Li L, Yang W, Ma W, Zhang Y, Zhao H, Zhao H, Li H, Sun J. Optimization of HPL on complex heterogeneous computing system. Ruan Jian Xue Bao/Journal of Software, 2021,32(8):2307-2318(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6003.htm[doi:10.13328/j.cnki.jos.006003] The optimization of HPL on a complex heterogeneous computing system.
    [28] http://www.netlib.org/benchmark/hpl/HPL_pdpanllN.html
    [29] Sun C, Lan J, Jiang H. Genetic algorithm for deciding blocking size parameters of GEMM in BLAS. Computer Engineering & Science, 2018,40(5):798-804(in Chinese with English abstract).
    [30] Low T, Igual F, Smith T, Quintana-Orti E. Analytical modeling is enough for high-performance BLIS. ACM Trans. on Mathematical Software, 2016,43(2):1-18.
    [31] Dagum L, Menon R. OpenMP:An industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998,5(1):46-55.
    [32] https://computing.llnl.gov/tutorials/pthreads/
    附中文参考文献:
    [10] 顾乃杰,李凯,陈国良,吴超.基于龙芯2F体系结构的BLAS库优化.中国科学技术大学学报,2008,38(7):854-859.
    [16] 孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究.计算机系统应用,2017,26(11):101-108.
    [17] 刘昊,刘芳芳,张鹏,杨超,蒋丽娟.基于申威1600的3级BLAS GEMM函数优化.计算机系统应用,2016,25(12):234-239.
    [18] 郭正红,郭绍忠,许瑾晨,张兆天. 异构多核平台下基础数学库寄存器分配方法.计算机应用,2014,34(S1):86-89.
    [27] 黎雷生,杨文浩,马文静,张娅,赵慧,赵海涛,李会元,孙家昶.复杂异构计算系统HPL的优化.软件学报,2021,32(8):2307-2318(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6003.htm[doi:10.13328/j.cnki.jos.006003]
    [29] 孙成国,兰静,姜浩.一种基于遗传算法的BLAS库优化方法.计算机工程与科学,2018,40(5):798-804.
    引证文献
引用本文

蔡雨,孙成国,杜朝晖,刘子行,康梦博,李双双.异构HPL算法中CPU端高性能BLAS库优化.软件学报,2021,32(8):2289-2306

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-07-25
  • 最后修改日期:2020-03-19
  • 在线发布日期: 2021-08-05
  • 出版日期: 2021-08-06
文章二维码
您是第19727358位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号