面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

doi:10.13328/j.cnki.jos.006527

微信服务号

微信订阅号

2025年3月30日 21:11 星期日

首页 > 过刊浏览>2023年第34卷第9期 >4421-4436. DOI:10.13328/j.cnki.jos.006527

PDF HTML阅读 XML下载导出引用引用提醒

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术
DOI:
                        10.13328/j.cnki.jos.006527
                    
CSTR:
                        
                    
作者:
                        胡怡胡怡
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
陈道琨陈道琨
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
杨超杨超
北京大学 数学科学学院, 北京 100871
在期刊界中查找
在百度中查找
在本站中查找
刘芳芳刘芳芳
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
马文静马文静
中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049
在期刊界中查找
在百度中查找
在本站中查找
尹万旺尹万旺
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
袁欣辉袁欣辉
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
林蓉芬林蓉芬
国家并行计算机工程技术研究中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:胡怡(1995-),女,博士生,主要研究领域为高性能计算,异构并行,BLAS库,稠密矩阵的相关算法研究;陈道琨(1994-),男,博士生,主要研究领域为高性能计算,异构并行,稀疏矩阵的相关算法研究;杨超(1979-),男,博士,教授,博士生导师,主要研究领域为高性能计算,科学与工程计算;刘芳芳(1982-),女,正高级工程师,CCF专业会员,主要研究领域为高性能扩展数学库,超级计算机评测软件;马文静(1981-),女,副研究员,CCF专业会员,主要研究领域为高性能计算,代码生成与优化;尹万旺(1980-),男,副研究员,主要研究领域为高性能计算,数值模拟,并行调试;袁欣辉(1989-),男,助理研究员,主要研究领域为软硬件协同设计,并行算法设计与优化;林蓉芬(1984-),女,工程师,主要研究领域为高性能计算及其应用.
通讯作者:杨超,E-mail:chao_yang@pku.edu.cn
中图分类号:
基金项目:国家重点研发计划(2020YFB0204601)

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

Author:

HU Yi
HU Yi
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Dao-Kun
CHEN Dao-Kun
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Chao
YANG Chao
School of Mathematical Sciences, Peking University, Beijing 100871, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Fang-Fang
LIU Fang-Fang
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
MA Wen-Jing
MA Wen-Jing
Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
在期刊界中查找
在百度中查找
在本站中查找
YIN Wan-Wang
YIN Wan-Wang
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
YUAN Xin-Hui
YUAN Xin-Hui
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
LIN Rong-Fen
LIN Rong-Fen
National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [46]

相似文献

引证文献

资源附件

文章评论

摘要:

BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块, 广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算. 针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数. 基于RMA通信机制设计了从核归约策略, 提升了BLAS 1、2级若干函数的归约效率. 针对TRSV、TPSV等存在数据依赖关系的函数, 提出了一套高效并行算法, 该算法通过点对点同步维持数据依赖关系, 设计了适用于三角矩阵的高效任务映射机制, 有效减少了从核点对点同步的次数, 提高了函数的执行效率. 通过自适应优化、向量压缩、数据复用等技术, 进一步提升了BLAS 1、2级函数的访存带宽利用率. 实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%, 平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%, 平均可达80%以上. 与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果.

关键词:BLAS 1级;BLAS 2级;访存带宽;SW26010-Pro众核处理器;RMA通信;点对点同步;自适应优化

Abstract:

BLAS (basic linear algebra subprograms) is an important module of the high-performance extended math library, which is widely used in the field of scientific and engineering computing. Level 1 BLAS provides vector-vector operation, Level 2 BLAS provides matrix-vector operation. This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro, a domestic many-core processor. A reduction strategy among CPEs is designed based on the RMA communication mechanism, which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines. For TRSV and TPSV and other routines that have data dependencies, a series of efficient parallelization algorithms are proposed. The algorithm maintains data dependencies through point-to-point synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices, which reduces the number of point-to-point synchronizations effectively, and improves the execution efficiency. In this study, adaptive optimization, vector compression, data multiplexing, and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines. The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%, with an average bandwidth of more than 90%. The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%, with an average bandwidth of more than 80%. Compared with the widely used open-source linear algebra library GotoBLAS, the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times. With the optimized Level 1 and Level 2 BLAS routines, LQ decomposition, QR decomposition, and eigenvalue problems achieved an average speedup of 10.99 times.

Key words:level 1 BLAS;level 2 BLAS;memory access bandwidth;Sunway 26010-Pro many-core processor;RMA communication;point-to-point synchronization;adaptive optimization

参考文献

[1] Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software, 2002, 28(2): 135-151. [doi: 10.1145/567806.567807]

[2] Choi J, Dongarra JJ, Susan Ostrouchoy L, Petitet AP, Walker DW, Clint Whaley R. Design and implementation of the scaLAPACK LU, QR, and cholesky factorization routines. Scientific Programming, 1996, 5(3): 483083. [doi: 10.1155/1996/483083]

[3] King DE. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 2009, 10(3): 1755-1758. [doi: 10.1155/1996/483083]

[4] Demšar J, Curk T, Erjavec A, Gorup Č, Hočevar T, Milutinovič M, Možina M, Polajnar M, Toplak M, Starič A, Štajdohar M, Umek L, Žagar L, Žbontar J, Žitnik M, Zupan B. Orange: Data mining toolbox in python. Journal of Machine Learning Research, 2013, 14(1): 2349-2353. [doi: 10.5555/2567709.2567736]

[5] Messer OEB, Harris JA, Parete-Koon S, Chertkow MA. Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proc. of the 11th Int’l Conf. on Applied Parallel and Scientific Computing. Helsinki: Springer, 2012. 92-106.

[6] Auer AA, Baumgartner G, Bernholdt DE, Bibireata A, Choppella V, Cociorva D, Gao XY, Harrison R, Krishnamoorthy S, Krishnan S, Lam CC, Lu QD, Nooijen M, Pitzer R, Ramanujam J, Sadayappan P, Sibiryakov A. Automatic code generation for many-body electronic structure methods: The tensor contraction engine. Molecular Physics, 2006, 104(2): 211-228. [doi: 10.1080/00268970500275780]

[7] Ao YL, Yang C, Wang XL, Xue W, Fu HH, Liu FF, Gan L, Xu P, Ma WJ. 26 PFLOPS stencil computations for atmospheric modeling on sunway taihulight. In: Proc. of the 31st IEEE Int’l Parallel and Distributed Processing Symp. Orlando: IEEE, 2017. 535-544.

[8] Cook KH. Large-scale atmospheric dynamics and sahelian precipitation. Journal of Climate, 1997, 10(6): 1137-1152. [doi: 10.1175/1520-0442(1997)010<1137:LSADAS>2.0.CO;2]

[9] Abhyankar S, Betrie G, Maldonado DA, Mcinnes LC, Smith B, Zhang H. PETSc DMNetwork: A library for scalable network PDE-based multiphysics simulations. ACM Transactions on Mathematical Software, 2020, 46(1): 5. [doi: 10.1145/3344587]

[10] Dongarra JJ, Luszczek P, Petite A. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 2003, 15(9): 803-820. [doi: 10.1002/cpe.728]

[11] Choi J, Demmel J, Dhillon I, Dongarra J, Ostrouchova S, Petitet A, Stanley K, Walker D, Whaley RC. ScaLAPACK: A portable linear algebra library for distributed memory computers—design issues and performance. Computer Physics Communications, 1996, 97(1-2): 1-15. [doi: 10.1016/0010-4655(96)00017-3]

[12] Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Proc. of the 13th Int’l Conf. on Neural Information Processing Systems. Cambridge: MIT Press, 2001. 535-541.

[13] Guo JJ, Qiu BB, Yang M, Zhang YN. Zhang neural network model for solving LQ decomposition problem of dynamic matrix with application to mobile object localization. In: Proc. of the 2021 Int’l Joint Conf. on Neural Networks. Shenzhen: IEEE, 2021. 1-6.

[14] Zheng YM, Xu AB. Tensor completion via tensor QR decomposition and L_{2, 1}-norm minimization. Signal Processing, 2021, 189: 108240. [doi: 10.1016/j.sigpro.2021.108240]

[15] Syrocki Ł, Pestka G. Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem. Open Computer Science, 2016, 6(1): 79-90. [doi: 10.1515/comp-2016-0006]

[16] Lin CP, Lu D, Bai ZJ. Backward stability of explicit external deflation for the symmetric eigenvalue problem. arXiv:2105.01298v1, 2021.

[17] Sørensen HHB. Auto-tuning of level 1 and level 2 BLAS for GPUS. Concurrency and Computation: Practice and Experience, 2013, 25(8): 1183-1198. [doi: 10.1002/cpe.2916]

[18] 何颂颂, 顾乃杰, 朱海涛, 刘燕君. 面向龙芯3A体系结构的BLAS库优化. 小型微型计算机系统, 2012, 33(3): 571-575. [doi: 10.3969/j.issn.1000-1220.2012.03.024]

He SS, Gu NJ, Zhu HT, Liu YJ. Optimization of BLAS for loongson-3A Architecture. Journal of Chinese Computer Systems, 2012, 33(3): 571-575 (in Chinese with English abstract). [doi: 10.3969/j.issn.1000-1220.2012.03.024]

[19] Mukunoki D, Imamura T, Takahashi D. Fast implementation of general matrix-vector multiplication (GEMV) on kepler GPUS. In: Proc. of the 23rd Euromicro Int’l Conf. on Parallel, Distributed, and Network-based Processing. Turku: IEEE, 2015. 642-650.

[20] Jiang LJ, Yang C, Ao YL, Yin WW, Ma WJ, Sun Q, Liu FF, Lin RF, Zhang P. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In: Proc. of the 6th Int’l Conf. on Parallel Processing. Bristol: IEEE, 2017. 422-431.

[21] Wang XL, Xu P, Xue W, Ao YL, Yang C, Fu HH, Gan L, Yang GW, Zheng WM. A fast sparse triangular solver for structured-grid problems on Sunway many-core processor SW26010. In: Proc. of the 47th Int’l Conf. on Parallel Proceeding. Eugene: Association for Computing Machinery, 2018. 53.

[22] Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, van Zee FG. Anatomy of high-performance many-threaded matrix multiplication. In: Proc. of the 28th IEEE Int’l Parallel and Distributed Processing Symp. Phoenix: IEEE, 2014. 1049-1059.

[23] Goto K, van de Gejin RA. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 2008, 34(3): 12. [doi: 10.1145/1356052.1356053]

[24] Intel. Intel^{^®}-optimized math library for numerical computing. 2021. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html

[25] AMD. AMD optimizing CPU libraries. 2021. https://developer.amd.com/amd-aocl/

[26] NVIDIA. Basic linear algebra on Nvidia GPUs. 2021. https://developer.nvidia.com/cublas

[27] Clint Whaley R, Petitet A, Dongarra JJ. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 2001, 27(1-2): 3-35. [doi: 10.1016/S0167-8191(00)00087-9]

[28] 吴少刚, 许解峰, 杨耀忠, 任钢. 高性能BLAS在类Beowulf机群系统上的实现. 小型微型计算机系统, 2001, 22(8): 897-900 (in Chinese with English abstract). [doi: 10.3969/j.issn.1000-1220.2001.08.001]

Wu SG, Xu JF, Yang YZ, Ren G. Implementing of the high performance of BLAS on Beowulf class cluster of computers. Mini-Micro Systems, 2001, 22(8): 897-900 (in Chinese with English abstract). [doi: 10.3969/j.issn.1000-1220.2001.08.001]

[29] 顾乃杰, 李凯, 陈国良, 吴超. 基于龙芯2F体系结构的BLAS库优化. 中国科学技术大学学报, 2008, 38(7): 854-859.

Gu NJ, Li K, Chen GL, Wu C. Optimization of BLAS based on loongson 2F architecture. Journal of University of Science and Technology of China, 2008, 38(7): 854-859 (in Chinese with English abstract).

[30] 刘昊, 刘芳芳, 张鹏, 杨超, 蒋丽娟. 基于申威1600的3级BLAS GEMM函数优化. 计算机系统应用, 2016, 25(12): 234-239. [doi: 10.15888/j.cnki.csa.005456]

Liu H, Liu FF, Zhang P, Yang C, Jiang LJ. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications, 2016, 25(12): 234-239 (in Chinese with English abstract). [doi: 10.15888/j.cnki.csa.005456]

[31] 孙家栋, 孙乔, 邓攀, 杨超. 基于申威众核处理器的1、2级BLAS函数优化研究. 计算机系统应用, 2017, 26(11): 101-108. [doi: 10.15888/j.cnki.csa.006045]

Sun JD, Sun Q, Deng P, Yang C. Research on the optimization of BLAS level 1 and 2 functions on shenwei many-core processor. Computer Systems & Applications, 2017, 26(11): 101-108 (in Chinese with English abstract). [doi: 10.15888/j.cnki.csa.006045]

[32] Yin J, Yu H, Xu WZ, Wang YX, Tian Z, Zhang YP, Chen BC. Highly parallel GEMV with register blocking method on GPU architecture. Journal of Visual Communication and Image Representation, 2014, 25(7): 1566-1573. [doi: 10.1016/j.jvcir.2014.06.002]

[33] Xu WZ, Liu ZY, Wu J, Ye XC, Jiao S, Wang D, Song FL, Fan DR. Auto-tuning GEMV on many-core GPU. In: Proc. of the 18th IEEE Int’l Conf. on Parallel and Distributed Systems-ICPADS. Singapore: IEEE, 2012. 30-36.

[34] Nath R, Tomov S, Dong TT, Dongarra J. Optimizing symmetric dense matrix-vector multiplication on GPUs. In: Proc. of the 2011 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Seattle: ACM, 2011. 6.

[35] 李毅, 何颂颂, 李恺. 多核龙芯3A上二级BLAS库的优化. 计算机系统应用, 2011, 20(1): 163-167.

Li Y, He SS, Li K. Optimization of BLAS level 2 based on multi-core loongson 3A. Computer Systems & Applications, 2011, 20(1): 163-167 (in Chinese with English abstract). [doi: 10.3969/j.issn.1003-3254.2011.01.035]

[36] Chohra C, Langlois P, Parello D. Efficiency of reproducible level 1 BLAS. In: Nehmeier M, Von Gudenberg JW, Tucker W, eds. Scientific Computing, Computer Arithmetic, and Validated Numerics. Cham: Springer, 2016, 9553: 99-108. [doi: 10.1007%2F978-3-319-31769-4_8]

[37] Imamura T, Yamada S, Machida M. A high performance SYMV kernel on a fermi-core GPU. In: Daydé MJ, Marques O, Nakajima K, eds. High Performance Computing for Computational Science-VECPAR 2012. Berlin: Springer, 2012, 7851: 59-71. [doi: 10.1007%2F978-3-642-38718-0_9]

[38] 王茜. 多核CPU上稠密线性代数函数优化及自动代码生成研究 [博士学位论文]. 北京: 中国科学院大学. 2015.

Wang Q. Research on dense linear algebra subroutines optimization and automatic code generation on multi-core cpus [Ph.D. Thesis]. Beijing: University of Chinese Academy of Sciences, 2015 (in Chinese with English abstract).

[39] Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra JJ, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D. LAPACK Users’ Guide. 3rd ed., Philadelphia: Society for Industrial and Applied Mathematics, 1999. 1-404.

引用本文

胡怡,陈道琨,杨超,刘芳芳,马文静,尹万旺,袁欣辉,林蓉芬.面向SW26010-Pro的1、2级BLAS函数众核并行优化技术.软件学报,2023,34(9):4421-4436

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-07-02
最后修改日期:2021-09-22
录用日期:
在线发布日期: 2022-11-30
出版日期: 2023-09-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码