Abstract:BLAS (basic linear algebra subprograms) is an important module of the high-performance extended math library, which is widely used in the field of scientific and engineering computing. Level 1 BLAS provides vector-vector operation, Level 2 BLAS provides matrix-vector operation. This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro, a domestic many-core processor. A reduction strategy among CPEs is designed based on the RMA communication mechanism, which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines. For TRSV and TPSV and other routines that have data dependencies, a series of efficient parallelization algorithms are proposed. The algorithm maintains data dependencies through point-to-point synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices, which reduces the number of point-to-point synchronizations effectively, and improves the execution efficiency. In this study, adaptive optimization, vector compression, data multiplexing, and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines. The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%, with an average bandwidth of more than 90%. The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%, with an average bandwidth of more than 80%. Compared with the widely used open-source linear algebra library GotoBLAS, the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times. With the optimized Level 1 and Level 2 BLAS routines, LQ decomposition, QR decomposition, and eigenvalue problems achieved an average speedup of 10.99 times.