Many-core Parallel Optimization of Level-3 BLAS Function on Domestic SW26010-Pro Processor

doi:10.13328/j.cnki.jos.006811

微信服务号

微信订阅号

Home > Archive>Volume 35, Issue 3, 2024 >1569-1584. DOI:10.13328/j.cnki.jos.006811

PDF HTML XML Export Cite reminder

Many-core Parallel Optimization of Level-3 BLAS Function on Domestic SW26010-Pro Processor
DOI:
                        10.13328/j.cnki.jos.006811
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:TP303
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Basic linear algebra subprogram (BLAS) is one of the most basic and important math libraries. The matrix-matrix operations covered in the level-3 BLAS functions are particularly significant for a standard BLAS library and are widely employed in many large-scale scientific and engineering computing applications. Additionally, level-3 BLAS functions are computing intensive functions and play a vital role in fully exploiting the computing performance of processors. Multi-core parallel optimization technologies are studied for level-3 BLAS functions on SW26010-Pro, a domestic processor. According to the memory hierarchy of SW26010-Pro, this study designs a multi-level blocking algorithm to exploit the parallelism of matrix operations. Then, a data-sharing scheme based on remote memory access (RMA) mechanism is proposed to improve the data transmission efficiency among CPEs. Additionally, it employs triple buffering and parameter tuning to fully optimize the algorithm and hide the memory access costs of direct memory access (DMA) and the communication overhead of RMA. Besides, the study adopts two hardware pipelines and several vectorized arithmetic/memory access instructions of SW26010-Pro and improves the floating-point computing efficiency of level-3 BLAS functions by writing assembly code manually for matrix-matrix multiplication, matrix equation solving, and matrix transposition. The experimental results show that level-3 BLAS functions can significantly improve the performance on SW26010-Pro by leveraging the proposed parallel optimization. The floating-point computing efficiency of single-core level-3 BLAS is up to 92% of the peak performance, while that of multi-core level-3 BLAS is up to 88% of the peak performance.

Reference

Cited by

Get Citation

胡怡,陈道琨,杨超,马文静,刘芳芳,宋超博,孙强,史俊达.国产SW26010-Pro处理器上3级BLAS函数众核并行优化.软件学报,2024,35(3):1569-1584

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 22,2021
Revised:February 23,2022
Adopted:
Online: May 10,2023
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History