Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

doi:10.13328/j.cnki.jos.006527

微信服务号

微信订阅号

Home > Archive>Volume 34, Issue 9, 2023 >4421-4436. DOI:10.13328/j.cnki.jos.006527

PDF HTML XML Export Cite reminder

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro
DOI:
                        10.13328/j.cnki.jos.006527
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

BLAS (basic linear algebra subprograms) is an important module of the high-performance extended math library, which is widely used in the field of scientific and engineering computing. Level 1 BLAS provides vector-vector operation, Level 2 BLAS provides matrix-vector operation. This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro, a domestic many-core processor. A reduction strategy among CPEs is designed based on the RMA communication mechanism, which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines. For TRSV and TPSV and other routines that have data dependencies, a series of efficient parallelization algorithms are proposed. The algorithm maintains data dependencies through point-to-point synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices, which reduces the number of point-to-point synchronizations effectively, and improves the execution efficiency. In this study, adaptive optimization, vector compression, data multiplexing, and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines. The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%, with an average bandwidth of more than 90%. The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%, with an average bandwidth of more than 80%. Compared with the widely used open-source linear algebra library GotoBLAS, the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times. With the optimized Level 1 and Level 2 BLAS routines, LQ decomposition, QR decomposition, and eigenvalue problems achieved an average speedup of 10.99 times.

Reference

Cited by

Get Citation

胡怡,陈道琨,杨超,刘芳芳,马文静,尹万旺,袁欣辉,林蓉芬.面向SW26010-Pro的1、2级BLAS函数众核并行优化技术.软件学报,2023,34(9):4421-4436

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:July 02,2021
Revised:September 22,2021
Adopted:
Online: November 30,2022
Published: September 06,2023

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History