稠密矩阵乘法(GEMM)是很多科学与工程计算应用中大量使用的函数, 也是很多代数函数库中的基础函数, 其性能高低对整个应用往往有决定性的影响. 另外, 因其计算密集的特点, 矩阵乘法效率往往也是体现硬件平台性能的重要指标. 针对国产申威1621处理器, 对稠密矩阵乘法进行了系统性地优化. 基于对各部分开销的分析, 以及对体系结构特点与指令集的充分利用, 对DGEMM函数从循环与分块方案, 打包方式, 核心计算函数实现, 数据预取等方面进行了深入优化. 此外, 开发了代码生成器, 为不同的输入参数生成不同版本的汇编代码和C语言代码, 配合自动调优脚本, 选取最佳参数. 经过优化和调优, 单线程DGEMM性能达到了单核浮点峰值性能的85%, 16线程DGEMM性能达到16核浮点峰值性能的80%. 对DGEMM函数的优化不仅提高了申威1621平台BLAS函数库性能, 也为国产申威系列多核处理器上稠密数据计算优化提供了重要参考.
General matrix multiply (GEMM) is one of the most used functions in scientific and engineering computation, and it is also the base function of many linear algebra libraries. Its performance usually has essential influence on the whole application. Besides, because of its intensity in computation, its efficiency is often considered as an important metric of the hardware platform. This study conducts systematic optimization to dense GEMM on the domestic SW1621 processor. Based on analysis of the baseline code and profiling of various overhead, as well as utilization of the architectural features and instruction set, optimization for DGEMM is carefully designed and performed, including blocking scheme, packing mechanism, kernel function implementation, data prefetch, etc. Besides, a code generator is developed, which can generate different assembly and C code according to the input parameters. Using the code generator, together with auto-tuning scripts, it is able to find the optimal values for the tunable parameters. After applying the optimizations and tuning, the proposed single thread DGEMM achieved 85% of the peak performance of a single core, and 80% of the performance of the entire chip of 16 cores. The optimization to DGEMM not only improves the performance of BLAS on SW1621, but also provides an important reference for optimizing dense data computation on SW series multi-core machines.