国产SW26010-Pro处理器上3级BLAS函数众核并行优化

doi:10.13328/j.cnki.jos.006811

微信服务号

微信订阅号

2025年8月4日 20:56 星期一

首页 > 过刊浏览>2024年第35卷第3期 >1569-1584. DOI:10.13328/j.cnki.jos.006811

PDF HTML阅读 XML下载导出引用引用提醒

国产SW26010-Pro处理器上3级BLAS函数众核并行优化
DOI:
                        10.13328/j.cnki.jos.006811
                    
CSTR:
                        
                    
作者:
                        胡怡胡怡
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
陈道琨陈道琨
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
杨超杨超
北京大学 数学科学学院, 北京 100871
在期刊界中查找
在百度中查找
在本站中查找
马文静马文静
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
刘芳芳刘芳芳
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
宋超博宋超博
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
孙强孙强
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
史俊达史俊达
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP303
基金项目:国家重点研发计划(2020YFB0204601)

Many-core Parallel Optimization of Level-3 BLAS Function on Domestic SW26010-Pro Processor

Author:

HU Yi
HU Yi
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Dao-Kun
CHEN Dao-Kun
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Chao
YANG Chao
School of Mathematical Sciences, Peking University, Beijing 100871, China
在期刊界中查找
在百度中查找
在本站中查找
MA Wen-Jing
MA Wen-Jing
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Fang-Fang
LIU Fang-Fang
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
SONG Chao-Bo
SONG Chao-Bo
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
SUN Qiang
SUN Qiang
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
SHI Jun-Da
SHI Jun-Da
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [30]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外,BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access,RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access,DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术在SW26010-Pro处理器上为BLAS 3级函数带来了明显的性能提升,单核组BLAS 3级函数的浮点计算性能最高可达峰值性能的92%,多核组BLAS 3级函数的浮点计算性能最高可达峰值性能的88%.

关键词:BLAS 3级;SW26010-Pro众核处理器;直接内存访问;远程内存访问;浮点计算效率

Abstract:

Basic linear algebra subprogram (BLAS) is one of the most basic and important math libraries. The matrix-matrix operations covered in the level-3 BLAS functions are particularly significant for a standard BLAS library and are widely employed in many large-scale scientific and engineering computing applications. Additionally, level-3 BLAS functions are computing intensive functions and play a vital role in fully exploiting the computing performance of processors. Multi-core parallel optimization technologies are studied for level-3 BLAS functions on SW26010-Pro, a domestic processor. According to the memory hierarchy of SW26010-Pro, this study designs a multi-level blocking algorithm to exploit the parallelism of matrix operations. Then, a data-sharing scheme based on remote memory access (RMA) mechanism is proposed to improve the data transmission efficiency among CPEs. Additionally, it employs triple buffering and parameter tuning to fully optimize the algorithm and hide the memory access costs of direct memory access (DMA) and the communication overhead of RMA. Besides, the study adopts two hardware pipelines and several vectorized arithmetic/memory access instructions of SW26010-Pro and improves the floating-point computing efficiency of level-3 BLAS functions by writing assembly code manually for matrix-matrix multiplication, matrix equation solving, and matrix transposition. The experimental results show that level-3 BLAS functions can significantly improve the performance on SW26010-Pro by leveraging the proposed parallel optimization. The floating-point computing efficiency of single-core level-3 BLAS is up to 92% of the peak performance, while that of multi-core level-3 BLAS is up to 88% of the peak performance.

Key words:level-3 BLAS;SW26010-Pro many-core processor;direct memory access (DMA);remote memory access (RMA);floating point computing efficiency

参考文献

[1] Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software, 2002, 28(2):135-151.[doi:10.1145/567806.567807]

[2] Van de Geijn R, Goto K. BLAS (basic linear algebra subprograms). In:Padua D, ed. Encyclopedia of Parallel Computing. Boston:Springer, 2011. 157-164.

[3] Hong CE, McMillin BM. Fault-tolerant parallel matrix multiplication with one iteration fault detection latency. In:Proc. of the 15th Annual Int'l Computer Software and Applications Conf. Tokyo:IEEE, 1991. 665-672.

[4] Plancher B, Brumar CD, Brumar I, Pentecost L, Rama S, Brooks D. Application of approximate matrix multiplication to neural networks and distributed SLAM. In:Proc. of the 2019 IEEE High Performance Extreme Computing Conf. Waltham:IEEE, 2019. 1-7.

[5] Georgescu IM, Ashhab S, Nori F. Quantum simulation. Physics, 2014, 86(1):153-185.[doi:10.1103/RevModPhys.86.153]

[6] Shields DS. Prospecting for oil. Gastronomica, 2010, 10(4):25-34.[doi:10.1525/gfc.2010.10.4.25]

[7] Kågström BO, Ling P, Van Loan C. GEMM-based level 3 BLAS:High-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software, 1998, 24(3):268-302.[doi:10.1145/292395.292412]

[8] INTEL. Intel^{^®}-optimized math library for numerical computing. 2021. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html

[9] AMD. AMD optimizing CPU libraries. 2021. https://developer.amd.com/amd-aocl/

[10] NVIDIA. Basic linear algebra on NVIDIA GPUs. 2021. https://developer.nvidia.com/cublas

[11] Goto K, Van De Geijn R. High-performance implementation of the level-3 BLAS. ACM Trans. on Mathematical Software, 2008, 35(1):4.

[12] Clint Whaley R, Petitet A, Dongarra JJ. Automated empirical optimizations of software and the atlas project. Parallel Computing, 2001, 27(1-2):3-35.

[13] Zhang XY, Kroeker M. OpenBLAS. 2023. http://www.openblas.net/

[14] Jiang LJ, Yang C, Ao YL, Yin WW, Ma WJ, Sun Q, Liu FF, Lin RF, Zhang P. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In:Proc. of the 46th Int'l Conf. on Parallel Processing. Bristol:IEEE, 2017. 422-431.

[15] Goto K, van de Geijn R A. Anatomy of high-performance matrix multiplication. ACM Trans. on Mathematical Software, 2008, 34(3):12.

[16] Wang Q, Zhang XY, Zhang YQ, Yi Q. AUGEM:Automatically generate high performance dense linear algebra kernels on x86 CPUs. In:Proc. of the 2013 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. Denver:ACM, 2013. 25.

[17] Zhang XY, Wang Q, Zhang YQ. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In:Proc. of the 18th Int'l Conf. on Parallel and Distributed Systems. Singapore:IEEE, 2012. 684-691.

[18] Nath R, Tomov S, Dongarra J. BLAS for GPUs. In:Scientific Computing with Multicore and Accelerators. Boca Raton:CRC Press, 2010. 57-80.

[19] Wang LN, Wu W, Xiao JX, Yang Y. BLASX:A high performance level-3 BLAS library for heterogeneous multi-GPU computing. arXiv:1510.05041, 2015.

[20] Igual FD, Quintana-Ortí G, Van De Geijn RA. Level-3 BLAS on a GPU:Picking the low hanging fruit. AIP Conference Proceedings, 2012, 1504(1):1109-1112.[doi:10.1063/1.4772121]

[21] 刘昊, 刘芳芳, 张鹏, 杨超, 蒋丽娟. 基于申威1600的3级BLAS GEMM函数优化. 计算机系统应用, 2016, 25(12):234-239.

Liu H, Liu FF, Zhang P, Yang C, Jiang LJ. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications, 2016, 25(12):234-239 (in Chinese with English abstract).

[22] Hopfield JJ. Artificial neural networks. IEEE Circuits and Devices Magazine, 1988, 4(5):3-10.[doi:10.1109/101.8118]

[23] Dhillon IS, Guan YQ, Kulis B. Kernel K-means:Spectral clustering and normalized cuts. In:Proc. of the 10th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Seattle:ACM, 2004. 551-556.

[24] Chen JY, Xiong N, Liang X, Tao DW, Li SH, Ouyang KM, Zhao K, DeBardeleben N, Guan Q, Chen ZZ. TSM2:Optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In:Proc. of the 2019 ACM Int'l Conf. on Supercomputing. Phoenix:ACM, 2019. 106-116.

[25] Jhurani C, Mullowney P. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. Journal of Parallel and Distributed Computing, 2015, 75:133-140.[doi:10.1016/j.jpdc.2014.09.003]

[26] Ernst D, Hager G, Thies J, Wellein G. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs. The International Journal of High Performance Computing Applications, 2021, 35(1):5-19.[doi:10.1177/1094342020965661]

[27] Goto K, Van De Geijn R. On reducing TLB misses in matrix multiplication. 2002. http://users.umiacs.umd.edu/ramani/cmsc662/Goto_vdGeijn.pdf

[28] Nath R, Tomov S, Dongarra J. An improved magma GEMM for Fermi graphics processing units. The International Journal of High Performance Computing Applications, 2010, 24(4):511-515.[doi:10.1177/1094342010385729]

[29] Lancee B, Birkelund G, Coenders M, Di Stasio V, Fernandez Reino M, Heath A, Koopmans R, Larsen E, Polavieja JG, Ramos M, Soiné H, Thijssen L, Veit S, Yemane R. The GEMM study:A cross-national harmonized field experiment on labour market discrimination. 2019. https://gemm2020.eu/wp-content/uploads/2019/02/GEMM-WP3-technical-report.pdf

引用本文

胡怡,陈道琨,杨超,马文静,刘芳芳,宋超博,孙强,史俊达.国产SW26010-Pro处理器上3级BLAS函数众核并行优化.软件学报,2024,35(3):1569-1584

复制

文章指标

点击次数:685
下载次数: 2361
HTML阅读次数: 1243
引用次数: 0

历史

收稿日期:2021-11-22
最后修改日期:2022-02-23
录用日期:
在线发布日期: 2023-05-10
出版日期: 2024-03-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码