国产SW26010-Pro处理器上3级BLAS函数众核并行优化
作者:
中图分类号:

TP303

基金项目:

国家重点研发计划(2020YFB0204601)


Many-core Parallel Optimization of Level-3 BLAS Function on Domestic SW26010-Pro Processor
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外,BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access,RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access,DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术在SW26010-Pro处理器上为BLAS 3级函数带来了明显的性能提升,单核组BLAS 3级函数的浮点计算性能最高可达峰值性能的92%,多核组BLAS 3级函数的浮点计算性能最高可达峰值性能的88%.

    Abstract:

    Basic linear algebra subprogram (BLAS) is one of the most basic and important math libraries. The matrix-matrix operations covered in the level-3 BLAS functions are particularly significant for a standard BLAS library and are widely employed in many large-scale scientific and engineering computing applications. Additionally, level-3 BLAS functions are computing intensive functions and play a vital role in fully exploiting the computing performance of processors. Multi-core parallel optimization technologies are studied for level-3 BLAS functions on SW26010-Pro, a domestic processor. According to the memory hierarchy of SW26010-Pro, this study designs a multi-level blocking algorithm to exploit the parallelism of matrix operations. Then, a data-sharing scheme based on remote memory access (RMA) mechanism is proposed to improve the data transmission efficiency among CPEs. Additionally, it employs triple buffering and parameter tuning to fully optimize the algorithm and hide the memory access costs of direct memory access (DMA) and the communication overhead of RMA. Besides, the study adopts two hardware pipelines and several vectorized arithmetic/memory access instructions of SW26010-Pro and improves the floating-point computing efficiency of level-3 BLAS functions by writing assembly code manually for matrix-matrix multiplication, matrix equation solving, and matrix transposition. The experimental results show that level-3 BLAS functions can significantly improve the performance on SW26010-Pro by leveraging the proposed parallel optimization. The floating-point computing efficiency of single-core level-3 BLAS is up to 92% of the peak performance, while that of multi-core level-3 BLAS is up to 88% of the peak performance.

    参考文献
    [1] Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software, 2002, 28(2):135-151.[doi:10.1145/567806.567807]
    [2] Van de Geijn R, Goto K. BLAS (basic linear algebra subprograms). In:Padua D, ed. Encyclopedia of Parallel Computing. Boston:Springer, 2011. 157-164.
    [3] Hong CE, McMillin BM. Fault-tolerant parallel matrix multiplication with one iteration fault detection latency. In:Proc. of the 15th Annual Int'l Computer Software and Applications Conf. Tokyo:IEEE, 1991. 665-672.
    [4] Plancher B, Brumar CD, Brumar I, Pentecost L, Rama S, Brooks D. Application of approximate matrix multiplication to neural networks and distributed SLAM. In:Proc. of the 2019 IEEE High Performance Extreme Computing Conf. Waltham:IEEE, 2019. 1-7.
    [5] Georgescu IM, Ashhab S, Nori F. Quantum simulation. Physics, 2014, 86(1):153-185.[doi:10.1103/RevModPhys.86.153]
    [6] Shields DS. Prospecting for oil. Gastronomica, 2010, 10(4):25-34.[doi:10.1525/gfc.2010.10.4.25]
    [7] Kågström BO, Ling P, Van Loan C. GEMM-based level 3 BLAS:High-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software, 1998, 24(3):268-302.[doi:10.1145/292395.292412]
    [8] INTEL. Intel®-optimized math library for numerical computing. 2021. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html
    [9] AMD. AMD optimizing CPU libraries. 2021. https://developer.amd.com/amd-aocl/
    [10] NVIDIA. Basic linear algebra on NVIDIA GPUs. 2021. https://developer.nvidia.com/cublas
    [11] Goto K, Van De Geijn R. High-performance implementation of the level-3 BLAS. ACM Trans. on Mathematical Software, 2008, 35(1):4.
    [12] Clint Whaley R, Petitet A, Dongarra JJ. Automated empirical optimizations of software and the atlas project. Parallel Computing, 2001, 27(1-2):3-35.
    [13] Zhang XY, Kroeker M. OpenBLAS. 2023. http://www.openblas.net/
    [14] Jiang LJ, Yang C, Ao YL, Yin WW, Ma WJ, Sun Q, Liu FF, Lin RF, Zhang P. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In:Proc. of the 46th Int'l Conf. on Parallel Processing. Bristol:IEEE, 2017. 422-431.
    [15] Goto K, van de Geijn R A. Anatomy of high-performance matrix multiplication. ACM Trans. on Mathematical Software, 2008, 34(3):12.
    [16] Wang Q, Zhang XY, Zhang YQ, Yi Q. AUGEM:Automatically generate high performance dense linear algebra kernels on x86 CPUs. In:Proc. of the 2013 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. Denver:ACM, 2013. 25.
    [17] Zhang XY, Wang Q, Zhang YQ. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In:Proc. of the 18th Int'l Conf. on Parallel and Distributed Systems. Singapore:IEEE, 2012. 684-691.
    [18] Nath R, Tomov S, Dongarra J. BLAS for GPUs. In:Scientific Computing with Multicore and Accelerators. Boca Raton:CRC Press, 2010. 57-80.
    [19] Wang LN, Wu W, Xiao JX, Yang Y. BLASX:A high performance level-3 BLAS library for heterogeneous multi-GPU computing. arXiv:1510.05041, 2015.
    [20] Igual FD, Quintana-Ortí G, Van De Geijn RA. Level-3 BLAS on a GPU:Picking the low hanging fruit. AIP Conference Proceedings, 2012, 1504(1):1109-1112.[doi:10.1063/1.4772121]
    [21] 刘昊, 刘芳芳, 张鹏, 杨超, 蒋丽娟. 基于申威1600的3级BLAS GEMM函数优化. 计算机系统应用, 2016, 25(12):234-239.
    Liu H, Liu FF, Zhang P, Yang C, Jiang LJ. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications, 2016, 25(12):234-239 (in Chinese with English abstract).
    [22] Hopfield JJ. Artificial neural networks. IEEE Circuits and Devices Magazine, 1988, 4(5):3-10.[doi:10.1109/101.8118]
    [23] Dhillon IS, Guan YQ, Kulis B. Kernel K-means:Spectral clustering and normalized cuts. In:Proc. of the 10th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Seattle:ACM, 2004. 551-556.
    [24] Chen JY, Xiong N, Liang X, Tao DW, Li SH, Ouyang KM, Zhao K, DeBardeleben N, Guan Q, Chen ZZ. TSM2:Optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In:Proc. of the 2019 ACM Int'l Conf. on Supercomputing. Phoenix:ACM, 2019. 106-116.
    [25] Jhurani C, Mullowney P. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. Journal of Parallel and Distributed Computing, 2015, 75:133-140.[doi:10.1016/j.jpdc.2014.09.003]
    [26] Ernst D, Hager G, Thies J, Wellein G. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs. The International Journal of High Performance Computing Applications, 2021, 35(1):5-19.[doi:10.1177/1094342020965661]
    [27] Goto K, Van De Geijn R. On reducing TLB misses in matrix multiplication. 2002. http://users.umiacs.umd.edu/ramani/cmsc662/Goto_vdGeijn.pdf
    [28] Nath R, Tomov S, Dongarra J. An improved magma GEMM for Fermi graphics processing units. The International Journal of High Performance Computing Applications, 2010, 24(4):511-515.[doi:10.1177/1094342010385729]
    [29] Lancee B, Birkelund G, Coenders M, Di Stasio V, Fernandez Reino M, Heath A, Koopmans R, Larsen E, Polavieja JG, Ramos M, Soiné H, Thijssen L, Veit S, Yemane R. The GEMM study:A cross-national harmonized field experiment on labour market discrimination. 2019. https://gemm2020.eu/wp-content/uploads/2019/02/GEMM-WP3-technical-report.pdf
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

胡怡,陈道琨,杨超,马文静,刘芳芳,宋超博,孙强,史俊达.国产SW26010-Pro处理器上3级BLAS函数众核并行优化.软件学报,2024,35(3):1569-1584

复制
分享
文章指标
  • 点击次数:650
  • 下载次数: 2279
  • HTML阅读次数: 1102
  • 引用次数: 0
历史
  • 收稿日期:2021-11-22
  • 最后修改日期:2022-02-23
  • 在线发布日期: 2023-05-10
  • 出版日期: 2024-03-06
文章二维码
您是第20049073位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号