[关键词]
[摘要]
异构众核架构具有超高的能效比, 已成为超级计算机体系结构的重要发展方向. 然而, 异构系统的复杂性给应用开发和优化提出了更高要求, 其在发展过程中面临好用性和可编程性等众多技术挑战. 我国自主研制的神威新一代超级计算机采用了国产申威异构众核处理器SW26010Pro. 为了发挥新一代众核处理器的性能优势, 支撑新兴科学计算应用的开发和优化, 设计并实现面向SW26010Pro平台的优化编译器swLLVM. 该编译器支持Athread和SDAA双模态异构编程模型, 提供多级存储层次描述及向量操作扩展, 并且针对SW26010Pro架构特点实现控制流向量化、基于代价的节点合并以及针对多级存储层次的编译优化. 测试结果表明, 所设计并实现的编译优化效果显著, 其中, 控制流向量化和节点合并优化的平均加速比分别为1.23和1.11, 而访存相关优化最高可获得2.49倍的性能提升. 最后, 使用SPEC CPU2006标准测试集从多个维度对swLLVM进行了综合评估, 相较于SWGCC的相同优化级别, swLLVM整型课题性能平均下降0.12%, 浮点型课题性能平均提升9.04%, 整体性能平均提升5.25%, 编译速度平均提升79.1%, 代码尺寸平均减少1.15%.
[Key word]
[Abstract]
The heterogeneous many-core architecture with an ultra-high energy efficiency ratio has become an important development trend of supercomputer architecture. However, the complexity of heterogeneous systems puts forward higher requirements for application development and optimization, and they face many technical challenges such as usability and programmability in the development process. The independently developed new-generation Sunway supercomputer is equipped with a homegrown heterogeneous many-core processor, SW26010Pro. To take full advantage of the performance of the new-generation many-core processors and support the development and optimization of emerging scientific computing applications, this study designs and implements an optimized compiler swLLVM oriented to the SW26010Pro platform. The compiler supports Athread and SDAA dual-mode heterogeneous programming models and provides multi-level storage hierarchy description and SIMD extensions for vector-like operations. In addition, it realizes control-flow vectorization, cost-based node combination, and compiler optimization for multi-level storage hierarchy according to the architecture characteristics of SW26010Pro. The experimental results show that the compiler optimization designed and implemented in this paper achieves significant performance improvements. The average speedup of control-flow vectorization and node combination and optimization is 1.23 and 1.11, respectively, and the memory access optimization achieves a maximum performance improvement of 2.49 times. Finally, a comprehensive evaluation of swLLVM is performed from multiple dimensions on the standard test set SPEC CPU2006. The results show that swLLVM reports an average increase of 9.04% in the performance of floating-point projects, 5.25% in overall performance, and 79.1% in compilation speed and an average decline of 0.12% in the performance of integer projects and 1.15% in the code size compared to SWGCC with the same optimization level.
[中图分类号]
[基金项目]
国家重点研发计划(2018YFB0204200);浙江省科技厅重大项目(2022C01250)