面向GPU平台的并行结构化稀疏三角方程组求解器
作者:
作者简介:

陈道琨(1994-),男,博士,助理研究员,主要研究领域为高性能并行算法及相关自动编译优化技术;杨超(1979-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为高性能计算,科学与工程计算;刘芳芳(1982-),女,正高级工程师,CCF专业会员,主要研究领域为高性能数学库,稀疏迭代解法器,异构众核并行;马文静(1981-),女,副研究员,CCF专业会员,主要研究领域为高性能计算,代码生成与优化

通讯作者:

杨超,chao_yang@pku.edu.cn

中图分类号:

TP30

基金项目:

国家重点研发计划高性能计算重点专项(2020YFB0204601)


Parallel Structured Sparse Triangular Solver for GPU Platform
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [44]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    稀疏三角线性方程组求解(SpTRSV)是预条件子部分的重要操作, 其中结构化SpTRSV问题, 在以迭代方法求解偏微分方程组的科学计算程序中, 是一种较为常见的问题类型, 而且通常是科学计算程序的需要解决的一个性能瓶颈. 针对GPU平台, 目前以CUSPARSE为代表的商用GPU数学库, 采用分层调度(level-scheduling)方法并行化SpTRSV操作. 该方法不仅预处理耗时较长, 而且在处理结构化SpTRSV问题时会出现较为严重GPU线程闲置问题. 针对结构化SpTRSV问题, 提出一种面向结构化SpTRSV问题的并行算法. 该算法利用结构化SpTRSV问题的特殊非零元分布规律进行任务划分, 避免对输入问题的非零元结构进行预处理分析. 并对现有分层调度方法的逐元素处理策略进行改进, 在有效缓解GPU线程闲置问题的基础上, 还隐藏了部分矩阵非零元素的访存延迟. 还根据算法的任务划分特点, 采用状态变量压缩技术, 显著提高算法状态变量操作的缓存命中率. 在此基础上, 还结合谓词执行等GPU硬件特性, 对算法实现进行全面的优化. 所提算法在NVIDIA V100 GPU上的实测性能, 相比CUSPARSE平均有2.71倍的加速效果, 有效访存带宽最高可达225.2 GB/s. 改进后的逐元素处理策略, 配合针对GPU硬件的一系列调优手段, 优化效果显著, 将算法的有效访存带宽提高了约1.15倍.

    Abstract:

    Sparse triangular solver (SpTRSV) is a vital operation in preconditioners. In particular, in scientific computing program that solves partial differential equation systems iteratively, structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program. The commercial mathematical libraries tailored to the graphics processing unit (GPU) platform, represented by CUSPARSE, parallelize SpTRSV operations by level-scheduling methods. However, this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues. This study proposes a parallel algorithm tailored to structured SpTRSV issues. The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue. Furthermore, the element-wise operation strategy used in the existing level-scheduling methods is modified. As a result, the problem of GPU thread idle is effectively alleviated, and the memory access latency of some non-zero elements in the matrix is concealed. This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm, significantly improving the cache hit rate of the algorithm in state variable operations. Additionally, several hardware features of the GPU, including predicated execution, are investigated to comprehensively optimize algorithm implementation. The proposed algorithm is tested on NVIDIA V100 GPU, achieving an average 2.71×acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s. The modified element-wise operation strategy, combined with a series of other optimization measures for GPU hardware, attains a prominent optimization effect by yielding a nearly 115% increase in the effective memory-access bandwidth of the proposed algorithm.

    参考文献
    [1] Björck Å. Numerical Methods in Matrix Computations. Cham: Springer, 2015.
    [2] Wang XL, Xu P, Xue W, Ao YL, Yang C, Fu HH, Gan L, Yang GW, Zheng WM. A fast sparse triangular solver for structured-grid problems on Sunway many-core processor sw26010. In: Proc. of the 47th Int’l Conf. on Parallel Processing. Eugene: Association for Computing Machinery, 2018. 53.
    [3] Ao YL, Yang C, Wang XL, Xue W, Fu HH, Liu FF, Gan L, Xu P, Ma WJ. 26 PFLOPS stencil computations for atmospheric modeling on Sunway TaihuLight. In: Proc. of the 2017 IEEE Int’l Parallel and Distributed Processing Symp. Orlando: IEEE, 2017. 535–544.
    [4] Ao YL, Yang C, Liu FF, Yin WW, Jiang LJ, Sun Q. Performance optimization of the HPCG benchmark on the Sunway TaihuLight supercomputer. ACM Trans. on Architecture and Code Optimization, 2018, 15(1): 11. [doi: 10.1145/3182177
    [5] Liu YD, Wang B, Ji ZZ. Research on atmospheric motion in horizontal discrete grids. Advances in Atmospheric Sciences, 2003, 20(1): 139–148. [doi: 10.1007/bf03342058
    [6] Swillam MA, Gohary RH, Bakr MH, Li X. Efficient approach for sensitivity analysis of lossy and leaky structures using FDTD. Progress in Electromagnetics Research, 2009, 94: 197–212. [doi: 10.2528/pier09061708
    [7] Curduman L, Nastac S, Debeleac C, Modiga M. Computational dynamics of the rotational heavy loads mastered by hydrostatical driving systems. Procedia Engineering, 2017, 181: 509–517. [doi: 10.1016/j.proeng.2017.02.427
    [8] Saad Y. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics, 2013.
    [9] Oak ridge national laboratory. Summit user guide. 2002. https://docs.olcf.ornl.gov/systems/summit-user-guide.html
    [10] IBM. Sierra fact sheet. 2018. https://asc.llnl.gov/sites/asc/files/sierra-fact-sheet.pdf
    [11] AMD. ROCSPARSE documentation. 2021. https://rocsparse.readthedocs.io/en/master/
    [12] NVIDIA. CUSPARSE library. 2020. https://docs.nvidia.com/pdf/CUSPARSE_Library.pdf
    [13] Liu WF, Li A, Hogg JD, Duff IS, Vinter, B. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience, 2017, 29(21): e4244. [doi: 10.1002/cpe.4244
    [14] Liu WF, Li A, Hogg J, Du IS, Vinter B. A synchronization-free algorithm for parallel sparse triangular solves. In: Proc. of the 22nd European Conf. on Parallel Processing. Grenoble: Springer, 2016. 617–630.
    [15] Su JY, Zhang F, Liu WF, He BS, Wu RF, Du XY, Wang RJ. CapelliniSpTRSV: A thread-level synchronization-free sparse triangular solve on GPUs. In: Proc. of the 49th Int’l Conf. on Parallel Processing. Edmonton: Association for Computing Machinery, 2020. 2.
    [16] Li RP, Zhang CY. Efficient parallel implementations of sparse triangular solves for GPU architectures. In: Proc. of the 2020 SIAM Conf. on Parallel Processing for Scientific Computing. Seattle: Society for Industrial and Applied Mathematics, 2020. 106–117.
    [17] Dufrechou E, Ezzatti P. Solving sparse triangular linear systems in modern GPUs: A synchronization-free algorithm. In: Proc. of the 26th Euromicro Int’l Conf. on Parallel, Distributed and Network-based Processing. Cambridge: IEEE, 2018. 196–203.
    [18] Saltz JH. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM Journal on Scientific and Statistical Computing, 1990, 11(1): 123–144. [doi: 10.1137/0911008
    [19] Choi JW, Singh A, Vuduc RW. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Bangalore: Association for Computing Machinery, 2010. 115–126.
    [20] NVIDIA. CUDA C++ programming guide. 2021. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [21] NVIDIA. NVIDIA tesla V100 GPU architecture. 2020. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
    [22] NVIDIA. NVIDIA a100 tensor core GPU architecture. 2020. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
    [23] Choo K, Panlener W, Jang B. Understanding and optimizing GPU cache memory performance for compute workloads. In: Proc. of the 13th IEEE Int’l Symp. on Parallel and Distributed Computing. Marseille: IEEE, 2014. 189–196.
    [24] Choi H, Ahn J, Sung W. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In: Proc. of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. London: ACM Press, 2012. 110–119.
    [25] Gaster BR, Hower D, Howes L. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. ACM Trans. on Architecture and Code Optimization, 2015, 12(1): 7. [doi: 10.1145/2701618
    [26] Orr MS, Che S, Yilmazer A, Beckmann BM, Hill MD, Wood DA. Synchronization using remote-scope promotion. In: Proc. of the 20th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Istanbul: Association for Computing Machinery, 2015. 73–86.
    [27] NVIDIA. Parallel thread execution ISA application guide. 2021. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [28] Tian ZW, Sun G. An if-conversion algorithm based on predication execution. Information Technology Journal, 2010, 9(5): 984–988. [doi: 10.3923/itj.2010.984.988
    [29] Anderson E, Saad Y. Solving sparse triangular linear systems on parallel computers. Int’l Journal of High Speed Computing, 1989, 1(1): 73–95. [doi: 10.1142/S0129053389000056
    [30] Chien LC. How to avoid global synchronization by domino scheme. 2014. https://on-demand.gputechconf.com/gtc/2014/presentations/S4188-avoid-global-synchronization-domino-scheme.pdf
    [31] Naumov M. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical Report, NVR-2011-001, NVIDIA, 2011.
    [32] Park J, Smelyanskiy M, Sundaram N, Dubey P. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: Proc. of the 29th Int’l Supercomputing Conf. Leipzig: Springer, 2014. 124–140.
    [33] Picciau A, Inggs GE, Wickerson J, Kerrigan EC, Constantinides GA. Balancing locality and concurrency: Solving sparse triangular systems on GPUs. In: Proc. of the 23rd IEEE Int’l Conf. on High Performance Computing. Hyderabad: IEEE, 2016. 183–192.
    [34] Kabir H, Booth JD, Aupy G, Benoit A, Robert Y, Raghavan P. STS-K: A multilevel sparse triangular solution scheme for NUMA multicores. In: Proc. of the 2015 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Austin: IEEE, 2015. 1–11.
    [35] Iwashita T, Nakanishi Y, Shimasaki M. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing, 2005, 26(4): 1234–1260. [doi: 10.1137/03060076x
    [36] Iwashita T, Nakashima H, Takahashi Y. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In: Proc. of the 26th IEEE Int’l Parallel and Distributed Processing Symp. Shanghai: IEEE, 2012. 474–483.
    [37] Iwashita T, Shimasaki M. Block red-black ordering: A new ordering strategy for parallelization of ICCG method. Int’l Journal of Parallel Programming, 2003, 31(1): 55–75. [doi: 10.1023/a:1021738303840
    [38] Anzt H, Chow E, Dongarra J. Iterative sparse triangular solves for preconditioning. In: Proc. of the 21st European Conf. on Parallel Processing. Vienna: Springer, 2015. 650–661.
    [39] Raghavan P, Teranishi K. Parallel hybrid preconditioning: Incomplete factorization with selective sparse approximate inversion. SIAM Journal on Scientific Computing, 2010, 32(3): 1323–1345. [doi: 10.1137/080739987
    [40] Alvarado FL, Schreiber R. Optimal parallel solution of sparse triangular systems. SIAM Journal on Scientific Computing, 1993, 14(2): 446–460. [doi: 10.1137/0914027
    [41] AMD. Introducing RDNA Architecture. 2019. https://www.cxybb.com/article/chengyq116/122413939
    [42] IEEE. Interconnects for 2D and 3D architectures. 2019. https://eps.ieee.org/images/files/HIR_2019/HIR1_ch22_2D-3D.pdf
    [43] MEPTEC. The evolution of multi-chip packaging: From MCMs to 2.5/3D to Photonics. 2016. http://www.meptec.org/Resources/9%20-%20McCann.pdf
    [44] Isaac SB, Black-Schaffer D, Casas M, Moretó M, Stupnikova A, Popov M, Modeling and optimizing NUMA effects and prefetching with machine learning. In: Proc. of the 34th ACM Int’l Conf. on Supercomputing. Barcelona: Association for Computing Machinery, 2020. 34.
    相似文献
    引证文献
引用本文

陈道琨,杨超,刘芳芳,马文静.面向GPU平台的并行结构化稀疏三角方程组求解器.软件学报,2023,34(11):4941-4951

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-08-13
  • 最后修改日期:2022-03-08
  • 在线发布日期: 2023-04-27
  • 出版日期: 2023-11-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号