Abstract:The fastest supercomputer in the world-Sunway TaihuLight with performance of more than 100P has been released. It makes use of heterogeneous many-core processors which is different from the existing pure CPU, CPU-MIC, CPU-GPU architecture. Each processor has 4 core groups (CGs), with each including one management processing element (MPE) and one computing processing element (CPE) cluster of 64 CPEs. The peak performance of single processor is 3TFlops/s, the memory bandwidth is 130GB/s. Sparse matrix-vector multiplication is a very important kernel in scientific and engineering computing, which is bandwidth limited and subject to indirect memory access. Implementing an efficient SpMV kernel is a big challenge in Sunway processor. This paper proposes a general SpMV heterogeneous manycore algorithm for the traditional sparse matrix storage format CSR, which divides the task and LDM space in detail, a cache mechanism of dynamic and static buffers to improve the hit rate of vector x, and a dynamic-static task scheduling method to achieve load balancing. In addition, several key factors affecting the performance of SpMV are analyzed, and adaptive optimization is carried out to further enhance the performance. Finally 16 matrix from matrix market collection are used to perform tests. The experimental results show that the algorithm achieves bandwidth of 86% and average bandwidth utilization of 47%. Compared with the implementation of the controller core, the speedup can be up to 10x, and average speedup is 6.51x.