Parallel Structured Sparse Triangular Solver for GPU Platform
Author:
Affiliation:

Clc Number:

TP30

  • Article
  • | |
  • Metrics
  • |
  • Reference [44]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Sparse triangular solver (SpTRSV) is a vital operation in preconditioners. In particular, in scientific computing program that solves partial differential equation systems iteratively, structured SpTRSV is a common type of issue and often a performance bottleneck that needs to be addressed by the scientific computing program. The commercial mathematical libraries tailored to the graphics processing unit (GPU) platform, represented by CUSPARSE, parallelize SpTRSV operations by level-scheduling methods. However, this method is weakened by time-consuming preprocessing and serious GPU thread idle when it is employed to deal with structured SpTRSV issues. This study proposes a parallel algorithm tailored to structured SpTRSV issues. The proposed algorithm leverages the special non-zero element distribution pattern of structured SpTRSV issues during task allocation to skip the preprocessing and analysis of the non-zero element structure of the input issue. Furthermore, the element-wise operation strategy used in the existing level-scheduling methods is modified. As a result, the problem of GPU thread idle is effectively alleviated, and the memory access latency of some non-zero elements in the matrix is concealed. This study also adopts a state variable compression technique according to the task allocation characteristics of the proposed algorithm, significantly improving the cache hit rate of the algorithm in state variable operations. Additionally, several hardware features of the GPU, including predicated execution, are investigated to comprehensively optimize algorithm implementation. The proposed algorithm is tested on NVIDIA V100 GPU, achieving an average 2.71×acceleration over CUSPARSE and a peak effective memory-access bandwidth of 225.2 GB/s. The modified element-wise operation strategy, combined with a series of other optimization measures for GPU hardware, attains a prominent optimization effect by yielding a nearly 115% increase in the effective memory-access bandwidth of the proposed algorithm.

    Reference
    [1] Björck Å. Numerical Methods in Matrix Computations. Cham: Springer, 2015.
    [2] Wang XL, Xu P, Xue W, Ao YL, Yang C, Fu HH, Gan L, Yang GW, Zheng WM. A fast sparse triangular solver for structured-grid problems on Sunway many-core processor sw26010. In: Proc. of the 47th Int’l Conf. on Parallel Processing. Eugene: Association for Computing Machinery, 2018. 53.
    [3] Ao YL, Yang C, Wang XL, Xue W, Fu HH, Liu FF, Gan L, Xu P, Ma WJ. 26 PFLOPS stencil computations for atmospheric modeling on Sunway TaihuLight. In: Proc. of the 2017 IEEE Int’l Parallel and Distributed Processing Symp. Orlando: IEEE, 2017. 535–544.
    [4] Ao YL, Yang C, Liu FF, Yin WW, Jiang LJ, Sun Q. Performance optimization of the HPCG benchmark on the Sunway TaihuLight supercomputer. ACM Trans. on Architecture and Code Optimization, 2018, 15(1): 11. [doi: 10.1145/3182177
    [5] Liu YD, Wang B, Ji ZZ. Research on atmospheric motion in horizontal discrete grids. Advances in Atmospheric Sciences, 2003, 20(1): 139–148. [doi: 10.1007/bf03342058
    [6] Swillam MA, Gohary RH, Bakr MH, Li X. Efficient approach for sensitivity analysis of lossy and leaky structures using FDTD. Progress in Electromagnetics Research, 2009, 94: 197–212. [doi: 10.2528/pier09061708
    [7] Curduman L, Nastac S, Debeleac C, Modiga M. Computational dynamics of the rotational heavy loads mastered by hydrostatical driving systems. Procedia Engineering, 2017, 181: 509–517. [doi: 10.1016/j.proeng.2017.02.427
    [8] Saad Y. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics, 2013.
    [9] Oak ridge national laboratory. Summit user guide. 2002. https://docs.olcf.ornl.gov/systems/summit-user-guide.html
    [10] IBM. Sierra fact sheet. 2018. https://asc.llnl.gov/sites/asc/files/sierra-fact-sheet.pdf
    [11] AMD. ROCSPARSE documentation. 2021. https://rocsparse.readthedocs.io/en/master/
    [12] NVIDIA. CUSPARSE library. 2020. https://docs.nvidia.com/pdf/CUSPARSE_Library.pdf
    [13] Liu WF, Li A, Hogg JD, Duff IS, Vinter, B. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience, 2017, 29(21): e4244. [doi: 10.1002/cpe.4244
    [14] Liu WF, Li A, Hogg J, Du IS, Vinter B. A synchronization-free algorithm for parallel sparse triangular solves. In: Proc. of the 22nd European Conf. on Parallel Processing. Grenoble: Springer, 2016. 617–630.
    [15] Su JY, Zhang F, Liu WF, He BS, Wu RF, Du XY, Wang RJ. CapelliniSpTRSV: A thread-level synchronization-free sparse triangular solve on GPUs. In: Proc. of the 49th Int’l Conf. on Parallel Processing. Edmonton: Association for Computing Machinery, 2020. 2.
    [16] Li RP, Zhang CY. Efficient parallel implementations of sparse triangular solves for GPU architectures. In: Proc. of the 2020 SIAM Conf. on Parallel Processing for Scientific Computing. Seattle: Society for Industrial and Applied Mathematics, 2020. 106–117.
    [17] Dufrechou E, Ezzatti P. Solving sparse triangular linear systems in modern GPUs: A synchronization-free algorithm. In: Proc. of the 26th Euromicro Int’l Conf. on Parallel, Distributed and Network-based Processing. Cambridge: IEEE, 2018. 196–203.
    [18] Saltz JH. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM Journal on Scientific and Statistical Computing, 1990, 11(1): 123–144. [doi: 10.1137/0911008
    [19] Choi JW, Singh A, Vuduc RW. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Bangalore: Association for Computing Machinery, 2010. 115–126.
    [20] NVIDIA. CUDA C++ programming guide. 2021. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [21] NVIDIA. NVIDIA tesla V100 GPU architecture. 2020. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
    [22] NVIDIA. NVIDIA a100 tensor core GPU architecture. 2020. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
    [23] Choo K, Panlener W, Jang B. Understanding and optimizing GPU cache memory performance for compute workloads. In: Proc. of the 13th IEEE Int’l Symp. on Parallel and Distributed Computing. Marseille: IEEE, 2014. 189–196.
    [24] Choi H, Ahn J, Sung W. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In: Proc. of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. London: ACM Press, 2012. 110–119.
    [25] Gaster BR, Hower D, Howes L. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. ACM Trans. on Architecture and Code Optimization, 2015, 12(1): 7. [doi: 10.1145/2701618
    [26] Orr MS, Che S, Yilmazer A, Beckmann BM, Hill MD, Wood DA. Synchronization using remote-scope promotion. In: Proc. of the 20th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Istanbul: Association for Computing Machinery, 2015. 73–86.
    [27] NVIDIA. Parallel thread execution ISA application guide. 2021. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [28] Tian ZW, Sun G. An if-conversion algorithm based on predication execution. Information Technology Journal, 2010, 9(5): 984–988. [doi: 10.3923/itj.2010.984.988
    [29] Anderson E, Saad Y. Solving sparse triangular linear systems on parallel computers. Int’l Journal of High Speed Computing, 1989, 1(1): 73–95. [doi: 10.1142/S0129053389000056
    [30] Chien LC. How to avoid global synchronization by domino scheme. 2014. https://on-demand.gputechconf.com/gtc/2014/presentations/S4188-avoid-global-synchronization-domino-scheme.pdf
    [31] Naumov M. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical Report, NVR-2011-001, NVIDIA, 2011.
    [32] Park J, Smelyanskiy M, Sundaram N, Dubey P. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: Proc. of the 29th Int’l Supercomputing Conf. Leipzig: Springer, 2014. 124–140.
    [33] Picciau A, Inggs GE, Wickerson J, Kerrigan EC, Constantinides GA. Balancing locality and concurrency: Solving sparse triangular systems on GPUs. In: Proc. of the 23rd IEEE Int’l Conf. on High Performance Computing. Hyderabad: IEEE, 2016. 183–192.
    [34] Kabir H, Booth JD, Aupy G, Benoit A, Robert Y, Raghavan P. STS-K: A multilevel sparse triangular solution scheme for NUMA multicores. In: Proc. of the 2015 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Austin: IEEE, 2015. 1–11.
    [35] Iwashita T, Nakanishi Y, Shimasaki M. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing, 2005, 26(4): 1234–1260. [doi: 10.1137/03060076x
    [36] Iwashita T, Nakashima H, Takahashi Y. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In: Proc. of the 26th IEEE Int’l Parallel and Distributed Processing Symp. Shanghai: IEEE, 2012. 474–483.
    [37] Iwashita T, Shimasaki M. Block red-black ordering: A new ordering strategy for parallelization of ICCG method. Int’l Journal of Parallel Programming, 2003, 31(1): 55–75. [doi: 10.1023/a:1021738303840
    [38] Anzt H, Chow E, Dongarra J. Iterative sparse triangular solves for preconditioning. In: Proc. of the 21st European Conf. on Parallel Processing. Vienna: Springer, 2015. 650–661.
    [39] Raghavan P, Teranishi K. Parallel hybrid preconditioning: Incomplete factorization with selective sparse approximate inversion. SIAM Journal on Scientific Computing, 2010, 32(3): 1323–1345. [doi: 10.1137/080739987
    [40] Alvarado FL, Schreiber R. Optimal parallel solution of sparse triangular systems. SIAM Journal on Scientific Computing, 1993, 14(2): 446–460. [doi: 10.1137/0914027
    [41] AMD. Introducing RDNA Architecture. 2019. https://www.cxybb.com/article/chengyq116/122413939
    [42] IEEE. Interconnects for 2D and 3D architectures. 2019. https://eps.ieee.org/images/files/HIR_2019/HIR1_ch22_2D-3D.pdf
    [43] MEPTEC. The evolution of multi-chip packaging: From MCMs to 2.5/3D to Photonics. 2016. http://www.meptec.org/Resources/9%20-%20McCann.pdf
    [44] Isaac SB, Black-Schaffer D, Casas M, Moretó M, Stupnikova A, Popov M, Modeling and optimizing NUMA effects and prefetching with machine learning. In: Proc. of the 34th ACM Int’l Conf. on Supercomputing. Barcelona: Association for Computing Machinery, 2020. 34.
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

陈道琨,杨超,刘芳芳,马文静.面向GPU平台的并行结构化稀疏三角方程组求解器.软件学报,2023,34(11):4941-4951

Copy
Share
Article Metrics
  • Abstract:1157
  • PDF: 2876
  • HTML: 1340
  • Cited by: 0
History
  • Received:August 13,2021
  • Revised:March 08,2022
  • Online: April 27,2023
  • Published: November 06,2023
You are the first2043747Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063