Parallel Code Generation for Sunway Heterogeneous Architecture
Author:
Affiliation:

Clc Number:

TP311

  • Article
  • | |
  • Metrics
  • |
  • Reference [49]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Heterogeneous architectures are dominating the realm of high-performance computing. However, these architectures also complicate the programming issue due to its increasingly complex hardware and memory hierarchy compared to homogeneous architectures. One of the most promising solutions to this issue is making use of optimizing compilers which can help programmers develop high-performance code executable on target machines, thereby simplifying the difficulty of programming. The polyhedral model is widely studied due to its ability to generate effective code and portability to various targets, which is realized by first converting a program into its intermediate representation and then combining the compositions of loop transformations and hardware binding strategies. This paper presents a source-to-source parallel code generator targeting the domestic, heterogeneous architecture of the Sunway machine using the polyhedral model. In particular, the computation is deployed automatedly onto the Sunway architecture and memory management, minimizing the amount of data movements between the management processing element and computing processing elements of the target. The experiments are conducted on 13 linear algebra applications extracted from the Polybench Benchmarks. The experimental results show that the proposed approach can generate effective code executable on the Sunway heterogeneous architecture, providing a mean speedup of 539.16× on 64 threads over the sequential implementation executed on a management processing element.

    Reference
    [1] Vazhkudai SS, de Supinski BR, Bland AS, et al. The design, deployment, and evaluation of the CORAL pre-Exascale systems. In:Proc. of the Int'l Conf. for High Performance Computing, Networking, Storage, and Analysis (SC 2018). 2018, Article 52, 1-12.
    [2] NVIDIA. NVIDIA Tesla V100 GPU architecture, 2017.
    [3] Fu HH, Liao JF, Yang JZ, et al. The Sunway Taihu Light supercomputer:System and applications. Science China (Information Sciences), 2016, 59(7):113-128(in Chinese with English abstract).
    [4] Verdoolaege S, Juega JC, Cohen A, et al. Polyheral parallel code generation for CUDA. ACM Trans. on Architecture and Code Optimization, 2013, 9(4):54:1-54:24.
    [5] Feautrier P, Lengauer C. Polyhedron model. Encyclopedia of Parallel Computing. Berlin, Heidelberg:Springer-Verlag, 2011. 1581-1592.
    [6] Zheng F, Xu Y, Li HL, et al. A homegrown many-core processor architecture for high-performance computing. Science China (Information Sciences), 2015, 45:523-534(in Chinese with English abstract).
    [7] NSCCWX. Sunway TaihuLight Compiler user guide. 2016. http://www.nsccwx.cn/
    [8] Zhao J, Li YY, Zhao RC. "Black magic" of polyhedral compilation. Ruan Jian Xue Bao/Journal of Software, 2018, 29(8):2371-2396(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5563.htm[doi:10.13328/j.cnki.jos.005563]
    [9] Feautrier P. Dataflow analysis of array and scalar references. Int'l Journal of Parallel Programming, 1991, 20(1):23-53.
    [10] Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In:Proc. of the 29th ACM SIGPLAN Conf. on Programming Language Design and Implementation. 2008. 101-113.
    [11] Bondhugula U, Acharya A, Cohen A. The Pluto+ algorithm:A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. on Programming Languages and Systems, 2016, 38(3):12:1-12:32.
    [12] Acharya A, Bondhugula U, Cohen A. Polyhedral autotransformation with no integer linear programming. In:Proc. of the 39th ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI 2018). 2018. 529-542.
    [13] Kong M, Pouchet LN. Model-driven transformations for multi- and many-core CPUs. In:Proc. of the 40th ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI 2019). 2019. 469-484.
    [14] Bastoul C. Code generation in the polyhedral model is easier than you think. In:Proc. of the 13th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT 2004). 2004. 7-16.
    [15] Chen C. Polyhedra scanning revisited. In:Proc. of the 33rd ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI 2012). 2012. 499-508.
    [16] Grosser T, Verdoolaege S, Cohen A. Polyhedral AST generation is more than scanning polyhedral. ACM Trans. on Programming Languages and Systems, 2015, 37(4):12:1-12:50.
    [17] Kelly W, Pugh W. A unifying framework for iteration reordering transformations. In:Proc. of the 1st IEEE Int'l Conf. on Algorithms and Architectures for Parallel Processing (ICAPP 1995), 1995.
    [18] Girbal S, Vasilache N, Bastoul C, et al. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int'l Journal of Parallel Programming, 2006, 34(3):261-317.
    [19] Verdoolaege S. Counting affine calculator and applications. In:Proc. of the 1st Int'l Workshop on Polyhedral Compilation Techniques (IMPACT 2011). 2011.
    [20] Chelini L, Zinenko O, Grosser T, Corporaal H. Declarative loop tactics for domain-specific optimization. ACM Trans. on Architecture and Code Optimization, 2019, 16(4):55:1-55:25.
    [21] Liu FF, Yang C, Yuan XH, et al. General SpMV implementation in many-core domestic Sunway 26010 processor. Ruan Jian Xue Bao/Journal of Software, 2018, 29(12):3921-3932(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5309.htm[doi:10.13328/j.cnki.jos.005309]
    [22] Xu ZG. Optimizations of scientific kernels on SW26010 many-core processor[MS. Thesis]. Shanghai:Shanghai Jiaotong University, 2018(in Chinese with English abstract).
    [23] Zhu X, Zeng Y, Wei Y, et al. An auto code generator for stencil on SW26010. In:Proc. of the 21st IEEE Int'l Conf. on High Performance Computing and Communications; the 17th IEEE Int'l Conf. on Smart City; the 5th IEEE Int'l Conf. on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2019.
    [24] Li YB, Zhao RC, Han L, et al. Parallelizing compilation framework for heterogeneous manycore processors. Ruan Jian Xue Bao/Journal of Software, 2019, 30(4):981-1001(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5370.htm[doi:10.13328/j.cnki.jos.005370]
    [25] Shirako J, Hayashi A, Sarkar V. Optimized two-level parallelization for GPU accelerators using the polyhedral model. In:Proc. of the 26th Int'l Conf. on Compiler Construction (CC). 2017. 22-33.
    [26] Grosser T, Hoefler T. Polly-ACC transparent compilation to heterogeneous hardware. In:Proc. of the 2016 Int'l Conf. on Supercomputing (ICS 2016). 2016. 1-13.
    [27] Baghdadi R, Ray J, Romdhane MB, et al. Tiramisu:A polyhedral compiler for expressing fast and portable code. In:Proc. of the 2019 IEEE/ACM Int'l Symp. on Code Generation and Optimization (CGO 2019). 2019. 193-205.
    [28] Zhao J, Li BJ, Nie W, et al. AKG:Automatic kernel generation for neural processing units using polyhedral transformations. In:Proc. of the 42nd ACM SIGPLAN Int'l Conf. on Programming Language Design and Implementation (PLDI 2021). New York:Association for Computing Machinery, 2021. 1233-1248.
    [29] Liao H, Tu JJ, Xia J, Zhou XP. DaVinci:A scalable architecture for neural network computing. In:Proc. of the 2019 IEEE Hot Chips 31 Symp. (HCS), 2019. 1-44.
    [30] Zhao J, Di P. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In:Proc. of the 53rd Annual IEEE/ACM Int'l Symp. on Microarchitecture (MICRO). 2020. 427-441.
    [31] Verdoolaege S. isl:An integer set library for the polyhedral model. In:Proc. of the ICMS 2010. LNCS 6327, Berlin, Heidelberg:Springer-Verlag, 2010. 299-302.
    [32] Verdoolaege S, Grosser T. Polyhedral extraction tool. In:Proc. of the 2nd Int'l Workshop on Polyhedral Compilation Techniques (IMPACT 2012). 2012.
    [33] Kennedy K, Allen R. Optimizing Compilers for Modern Architectures:A Dependence-based Approach. San Francisco:Morgan Kaufmann Publishers Inc., 2001.
    [34] Bondhugula U. Effective automatic parallelization and locality optimization using the polyhedral model[Ph.D. Thesis]. Ohio State University, 2008.
    [35] Polybench[EB/OL]. 2019. http://polybench.source-forge.net
    [36] Li YY, Zhao J, Pang JM. Split tiling design and implementation in the polyhedral model. Chinese Journal of Computers, 2020, 43(6):1038-1051(in Chinese with English abstract).
    [37] Zhao J, Cohen A. Flextended tiles:A flexible extension of overlapped tiles for polyhedral compilation. ACM Trans. on Architecture and Code Optimization, 2019, 16(4):47:1-47:25.
    [38] Feautrier P. Some efficient solutions to the affine scheduling problem. Part I:One-dimensional time. Int'l Journal of Parallel Programming (IJPP), 1992, 21(5):313-347.
    [39] Feautrier P. Some efficient solutions to the affine scheduling problem. Part II:Multidimensional time. Int'l Journal of Parallel Programming (IJPP), 1992, 21(6):389-420.
    [40] Jiang L, Chao Y, Ao Y, et al. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In:Proc. of the Int'l Conf. on Parallel Processing. IEEE, 2017.
    [41] Bondhugula U. High performance code generation in MLIR:An early case study with gemm. arXiv:2003.00532, 2020.
    [42] Tao XH, Zhu Y, Wang BY, et al. Automatically generating high-performance matrix multiplication kernels on the latest sunway processor. In:Proc. of the 51st Int'l Conf. on Parallel Processing (ICPP 2022). New York:Association for Computing Machinery, 2022. Article 52. https://doi.org/10.1145/3545008.3545031
    附中文参考文献
    [6] 郑方, 许勇, 李宏亮, 等. 一种面向高性能计算的自主众核处理器结构. 中国科学:信息科学, 2015, 45:523-534.
    [8] 赵捷, 李颖颖, 赵荣彩. 基于多面体模型的编译"黑魔法". 软件学报, 2018, 29(8):2371-2396. http://www.jos.org.cn/1000-9825/5563.htm[doi:10.13328/j.cnki.jos.005563]
    [21] 刘芳芳, 杨超, 袁欣辉, 等. 面向国产申威26010众核处理器的SpMV实现与优化. 软件学报, 2018, 29(12):3921-3932. http://www.jos.org.cn/1000-9825/5309.htm[doi:10.13328/j.cnki.jos.005309]
    [22] 许志耿. 面向国产SW26010众核处理器的科学计算核心深度优化研究[硕士学位论文]. 上海:上海交通大学, 2018.
    [24] 李雁冰, 赵荣彩, 韩林, 等. 一种面向异构众核处理器的并行编译框架. 软件学报, 2019, 30(4):981-1001. http://www.jos.org.cn/1000-9825/5370.htm[doi:10.13328/j.cnki.jos.005370]
    [36] 李颖颖, 赵捷, 庞建民. 多面体模型中分裂分块算法的设计与实现. 计算机学报, 2020, 43(6):1038-1051.
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

陶小涵,朱雨,庞建民,赵捷,徐金龙.面向申威异构架构的并行代码自动生成.软件学报,2023,34(4):1570-1593

Copy
Share
Article Metrics
  • Abstract:821
  • PDF: 2236
  • HTML: 1615
  • Cited by: 0
History
  • Received:November 25,2021
  • Revised:February 02,2022
  • Online: April 04,2023
  • Published: April 06,2023
You are the first2038043Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063