申威1621处理器上矩阵乘法优化研究

doi:10.13328/j.cnki.jos.006519

微信服务号

微信订阅号

首页 > 过刊浏览>2023年第34卷第7期 >3451-3463. DOI:10.13328/j.cnki.jos.006519

PDF HTML阅读 XML下载导出引用引用提醒

申威1621处理器上矩阵乘法优化研究
DOI:
                        10.13328/j.cnki.jos.006519
                    
作者:
                        
                        
                    
作者单位:
作者简介:闫昊(1996-),男,硕士,主要研究领域为高性能数值计算;刘芳芳(1982-),女,正高级工程师,CCF专业会员,主要研究领域为高性能扩展数学库,超级计算机评测软件;马文静(1981-),女,副研究员,CCF专业会员,主要研究领域为高性能计算,代码生成与优化;陈道琨(1994-),男,博士,主要研究领域为高性能计算,异构并行,稀疏矩阵相关的算法研究
通讯作者:马文静,E-mail:wenjing@iscas.ac.cn
中图分类号:TP303
基金项目:国家重点研发计划（2020YFB0204601）

Optimization of GEMM on SW1621 Processors

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

稠密矩阵乘法（GEMM）是很多科学与工程计算应用中大量使用的函数，也是很多代数函数库中的基础函数，其性能高低对整个应用往往有决定性的影响.另外，因其计算密集的特点，矩阵乘法效率往往也是体现硬件平台性能的重要指标.针对国产申威1621处理器，对稠密矩阵乘法进行了系统性地优化.基于对各部分开销的分析，以及对体系结构特点与指令集的充分利用，对DGEMM函数从循环与分块方案，打包方式，核心计算函数实现，数据预取等方面进行了深入优化.此外，开发了代码生成器，为不同的输入参数生成不同版本的汇编代码和C语言代码，配合自动调优脚本，选取最佳参数.经过优化和调优，单线程DGEMM性能达到了单核浮点峰值性能的85%，16线程DGEMM性能达到16核浮点峰值性能的80%.对DGEMM函数的优化不仅提高了申威1621平台BLAS函数库性能，也为国产申威系列多核处理器上稠密数据计算优化提供了重要参考.

Abstract:

General matrix multiply (GEMM) is one of the most used functions in scientific and engineering computation, and it is also the base function of many linear algebra libraries. Its performance usually has essential influence on the whole application. Besides, because of its intensity in computation, its efficiency is often considered as an important metric of the hardware platform. This study conducts systematic optimization to dense GEMM on the domestic SW1621 processor. Based on analysis of the baseline code and profiling of various overhead, as well as utilization of the architectural features and instruction set, optimization for DGEMM is carefully designed and performed, including blocking scheme, packing mechanism, kernel function implementation, data prefetch, etc. Besides, a code generator is developed, which can generate different assembly and C code according to the input parameters. Using the code generator, together with auto-tuning scripts, it is able to find the optimal values for the tunable parameters. After applying the optimizations and tuning, the proposed single thread DGEMM achieved 85% of the peak performance of a single core, and 80% of the performance of the entire chip of 16 cores. The optimization to DGEMM not only improves the performance of BLAS on SW1621, but also provides an important reference for optimizing dense data computation on SW series multi-core machines.

参考文献

相似文献

引证文献

引用本文

闫昊,刘芳芳,马文静,陈道琨.申威1621处理器上矩阵乘法优化研究.软件学报,2023,34(7):3451-3463

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-06-07
最后修改日期:2021-08-07
录用日期:
在线发布日期: 2022-11-30
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史