申威众核处理器访存与通信融合编译优化
作者:
作者简介:

方燕飞(1980-), 女, 高级工程师, 主要研究领域为并行语言, 编译优化.
李雁冰(1989-), 男, 博士, 助理研究员, 主要研究领域为并行编译.
董恩铭(1988-), 男, 博士, 助理研究员, 主要研究领域为高性能计算软件.
王云飞(1995-), 男, 硕士, 主要研究领域为软件工程.
刘齐(1992-), 男, 助理研究员, 主要研究领域为并行语言, 编译优化.

通讯作者:

方燕飞, E-mail: flyyaj@163.com

基金项目:

先进计算与智能工程(国家级)实验室基金; 国家重点研发计划重点专项(2021YFB0301100)


Memory Access and Communication Fusion Compiler Optimization for Sunway Many-core Processors
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [24]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    申威众核片上多级存储层次是缓解众核“访存墙”的重要结构. 完全由软件管理的SPM结构和片上RMA通信机制给应用性能提升带来很多机会, 但也给应用程序开发优化与移植提出了很大挑战. 为充分挖掘片上存储层次特点提升应用程序性能, 同时减轻用户编程优化负担, 提出一种多级存储层次访存与通信融合的编译优化方法. 该方法首先设计融合编译指示, 将程序高层信息传递给编译器. 其次构建编译优化收益模型并设计启发式循环优化方案迭代求解框架, 并由编译器完成循环优化方案的求解和优化代码的变换. 通过编译生成的DMA和RMA批量数据传输操作, 将较低存储层次空间中高访问延迟的核心数据批量缓冲进低访问延迟的更高存储层次空间中. 在3个典型测试用例上进行优化实验测试与分析, 结果表明所提出的优化在性能上与手工优化相当, 较未优化版程序性能有显著提升.

    Abstract:

    The on-chip memory hierarchy of Sunway many-core processors is an important structure to alleviate the many-core “memory access wall”. The SPM structure and on-chip RMA communication mechanism completely managed by software bring many opportunities for improving application performance but also pose great challenges for development optimization and porting of applications. To fully explore the hierarchical features of on-chip memory, improve application performance, and reduce the burden of user programming optimization, this study proposes a compiler optimization method that integrates multi-level memory access and communication. This method first designs a fusion compiler directive to transfer high-level information of the program to the compiler. Secondly, a compiler optimization revenue model is built and an iterative solution framework of a heuristic loop optimization scheme is designed. Meanwhile, the compiler completes the solution and code transformation of the loop optimization scheme. DMA and RMA batch data transmission operations are generated by compilation, batch buffer core data with high access latency from lower storage hierarchy spaces into higher storage hierarchy spaces with low access latency. Optimization experiments and analysis are conducted on three typical test cases, and the results show that the program performance optimized by this method is comparable to manual optimization, and significantly improves compared to the unoptimized version.

    参考文献
    [1] TOP500 List. 2021. https://www.top500.org/lists/top500/2021/11/
    [2] Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In: Proc. of the 10th Int’l Symp. on Hardware/Software Codesign. Estes Park: ACM, 2002. 73–78.
    [3] Sato M, Ishikawa Y, Tomita H, Kodama Y, Odajima T, Tsuji M, Yashiro H, Aoki M, Shida N, Miyoshi I, Hirai K, Furuya A, Asato A, Morita K, Shimizu T. Co-design for A64FX manycore processor and “Fugaku”. In: Proc. of the 2020 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020. 1–15.
    [4] Wen H, Zhang W. Reducing cache leakage energy for hybrid SPM-cache architectures. In: Proc. of the 2014 Int’l Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES). New Delhi: ACM, 2014. 21. [doi: 10.1145/2656106.2656124]
    [5] 方燕飞, 刘齐, 董恩铭, 李雁冰, 过锋, 王谛, 何王全, 漆锋滨. 面向E级超算系统的众核片上存储层次研究. 计算机工程, 2023, 49(12): 10–24.
    Fang YF, Liu Q, Dong EM, Li YB, Guo F, Wang D, He WQ, Qi FB. Research on manycore on-chip storage hierarchy for exascale supercomputer systems. Computer Engineering, 2023, 49(12): 10–24 (in Chinese with English abstract).
    [6] 高剑刚, 刘鑫, 李芳, 刘勇, 彭达佳, 陈鑫, 陈德训. 面向神威众核超算系统的并行计算模型研究. 计算机学报, 2023, 46(7): 1339–1349.
    Gao JG, Liu X, Li F, Liu Y, Peng DJ, Chen X, Chen DX. Research on parallel computing model for sunway many-core supercomputing system. Chinese Journal of Computers, 2023, 46(7): 1339–1349 (in Chinese with English abstract).
    [7] Venkataramani V, Chan MC, Mitra T. Scratchpad-memory management for multi-threaded applications on many-core architectures. ACM Trans. on Embedded Computing Systems, 2019, 18(1): 10.
    [8] Tao XH, Pang JM, Xu JL, Zhu Y. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture. The Journal of Supercomputing, 2021, 77(12): 14502–14524.
    [9] Chakraborty P, Panda PR, Sen S. Partitioning and data mapping in reconfigurable cache and scratchpad memory-based architectures. ACM Trans. on Design Automation of Electronic Systems, 2016, 22(1): 12.
    [10] 李建江, 刘珍珍, 王珏. 基于IBM Cell多核平台的OpenMP数组私有化技术研究. 计算机研究与发展, 2010, 47(8): 1434–1441.
    Li JJ, Liu ZZ, Wang J. Optimizing OpenMP by array privatization on the multi-core platform of IBM cell. Journal of Computer Research and Development, 2010, 47(8): 1434–1441 (in Chinese with English abstract).
    [11] Yu C, Bai YB, Sun QX, Yang HL. Improving thread-level parallelism in GPUs through expanding register file to scratchpad memory. ACM Trans. on Architecture and Code Optimization, 2018, 15(4): 48.
    [12] 何王全, 刘勇, 方燕飞, 魏迪, 漆锋滨. 面向国产异构众核系统的Parallel C语言设计与实现. 软件学报, 2017, 28(4): 764–785. http://www.jos.org.cn/1000-9825/5197.htm
    He WQ, Liu Y, Fang YF, Wei D, Qi FB. Design and implementation of Parallel C programming language for domestic heterogeneous many-core systems. Ruan Jian Xue Bao/Journal of Software, 2017, 28(4): 764–785 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5197.htm
    [13] 刘勇, 刘丽, 何王全. 面向众核多级访存资源的静态数据布局优化模型. 计算机应用与软件, 2011, 28(7): 53–56.
    Liu Y, Liu L, He WQ. A static data placement optimisation model oriented towards multi-core hierarchical accessible resources. Computer Applications and Software, 2011, 28(7): 53–56 (in Chinese with English abstract).
    [14] Wu MC, Liu Y, Cui HM, Wei QF, Li QF, Li LM, Lv F, Xue JL, Feng XB. Bandwidth-aware loop tiling for DMA-supported scratchpad memory. In: Proc. of the 2020 ACM Int’l Conf. on Parallel Architectures and Compilation Techniques. New York: ACM, 2020. 97–109.
    [15] 伍明川, 刘颖, 李立民, 冯晓兵. 面向神威·太湖之光的多核组协同的OpenCL编译方法. 高技术通讯, 2022, 32(9): 927–936.
    Wu MC, Liu Y, Li LM, Feng XB. An inter-CG collaborative OpenCL compilation method on the Sunway TaihuLight supercomputer. Chinese High Technology Letters, 2022, 32(9): 927–936 (in Chinese with English abstract).
    [16] Zhou B, Huang YZ, Xu JC, Guo SZ, Qi HY. Memory latency optimizations for the elementary functions on the Sunway architecture. The Journal of Supercomputing, 2019, 75(7): 3917–3944.
    [17] 姜云桥. 基于新一代申威众核处理器的Transformer模型并行优化的研究 [硕士学位论文]. 上海: 华东师范大学, 2022.
    Jiang YQ. Research on parallel optimization of transformer model based on the new generation of Sunway many-core processors [MS. Thesis]. Shanghai: East China Normal University, 2022 (in Chinese with English abstract).
    相似文献
    引证文献
引用本文

方燕飞,李雁冰,董恩铭,王云飞,刘齐.申威众核处理器访存与通信融合编译优化.软件学报,2024,35(6):2648-2667

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-09-11
  • 最后修改日期:2023-10-30
  • 在线发布日期: 2024-01-05
  • 出版日期: 2024-06-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号