SDAA: 面向申威智能加速卡的运行时系统

doi:10.13328/j.cnki.jos.007084

微信服务号

微信订阅号

首页 > 过刊浏览>年第卷第期 >1-15. DOI:10.13328/j.cnki.jos.007084

PDF HTML阅读 XML下载导出引用引用提醒

SDAA: 面向申威智能加速卡的运行时系统
DOI:
                        10.13328/j.cnki.jos.007084
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP303
基金项目:国家重点研发计划(2018ZX01028102)

SDAA: Runtime System for Shenwei AI Acceleration Card

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

自主研制的申威智能加速卡上搭载了脉动阵列增强的申威众核处理器, 其智能计算能力与主流GPU相当, 但仍缺少配套的基础软件. 为降低申威智能加速卡的使用门槛, 有效支撑人工智能应用开发, 设计面向申威智能加速卡的运行时系统SDAA, 语义与主流的CUDA运行时保持一致. 针对内存管理、数据传输、核函数启动等关键路径, 采用软硬协同的设计方法实现卡上段页结合的多级内存分配算法、可分页内存多线程多通道的传输模型、多异构部件自适应的数据传输算法和基于片上阵列通信的快速核函数启动方法, 使得SDAA运行时性能优于主流GPU. 实验结果表明, SDAA运行时系统的内存分配速度是NVIDIA V100对应接口的120倍, 数据传输开销是对应接口的1/2, 数据传输带宽达到对应接口的1.7倍, 核函数启动时间与对应接口相当. SDAA运行时已支撑主流框架和实际模型训练在申威智能加速卡上的高效运行.

Abstract:

The homegrown Shenwei AI acceleration card is equipped with the Shenwei many-core processor based on systolic array enhancement, and although its intelligent computing power can be comparable to the mainstream GPU, there is still a lack of basic software support. To lower the utilization threshold of the Shenwei AI acceleration card and effectively support the development of AI applications, this study designs a runtime system SDAA for the Shenwei AI acceleration card, whose semantics is consistent with the mainstream CUDA. For key paths such as memory management, data transmission, and kernel function launch, the software and hardware co-design method is adopted to realize the multi-level memory allocation algorithm with segment and paged memory combined on the card, pageable memory transmission model of multiple threads and channels, adaptive data transmission algorithm with multi-heterogeneous components, and fast kernel function launch method based on on-chip array communication. As a result, the runtime performance of SDAA is better than that of the mainstream GPU. The experimental results indicate that the memory allocation speed of SDAA is 120 times the corresponding interface of NVIDIA V100, the memory transmission overhead is 1/2 of the corresponding interface, and the data transmission bandwidth is 1.7 times the corresponding interface. Additionally, the launch time of the kernel function is equivalent to the corresponding interface, and thus the SDAA runtime system can support the efficient operation of mainstream frameworks and actual model training on the Shenwei AI acceleration card.

参考文献

相似文献

引证文献

引用本文

赵玉龙,张鲁飞,许国春,李宇轩,孙茹君,刘鑫. SDAA: 面向申威智能加速卡的运行时系统.软件学报,,():1-15

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-03-15
最后修改日期:2023-08-18
录用日期:
在线发布日期: 2024-03-27
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史