基于批量LU分解的矩阵求逆在GPU上的有效实现
作者:
作者单位:

作者简介:

刘世芳(1994-),女,博士,主要研究领域为数值并行算法库的实现,异构并行机下并行算法库的优化;赵永华(1966-),男,博士,研究员,博士生导师,主要研究领域为并行算法与软件,并行编程模型,谱聚类数据分析;黄荣锋(1990-),男,博士,主要研究领域为基础数学库在异构平台上的高效实现;于天禹(1985-)男,博士,主要研究领域为数值并行算法,求解器;张馨尹(1994-),女,博士,主要研究领域为高性能计算,张量计算,高光谱图像处理

通讯作者:

赵永华,yhzhao@sccas.cn

中图分类号:

TP301

基金项目:

国家重点研发计划(2020YFB0204802, 2017YFB0202202); 中国科学院战略性先导科技专项(C类)(XDC05000000); 光合基金A类(20210701)


Effective Implementation of Matrix Inversion Based on Batched LU Decomposition on GPU
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    给出批量矩阵的LU分解和批量求逆算法在GPU上实现及优化方法. 针对批量LU分解问题, 分析Left-looking和Right-looking等常用LU分解块算法在GPU上实现时对全局内存的数据读写次数, 针对GPU架构特点, 选择具有较少访存数据量的Left-looking块算法. 在LU分解的选主元过程, 采用适合GPU架构的并行二叉树搜索算法. 此外, 为了降低选主元引起的行交换过程对算法性能的影响, 提出Warp分组行交换和行交换延迟2个优化技术. 针对LU分解后的批量求逆问题, 分析矩阵求逆过程中修正方法, 为了减少修正过程对全局内存的访问, 在批量求逆的GPU实现中采用延迟修正的矩阵求逆块算法. 同时, 为了加快数据读写速度, 采用更多利用寄存器和共享内存的优化方法和减少访存数据量的列交换优化方法. 另外, 为了避免线程的闲置和共享内存等GPU资源浪费, 提出运行时动态GPU资源分配方法, 相较于一次性分配的静资源分配方法性能得到明显提升. 最终, 在TITAN V GPU上, 对10000个规模在33–190之间的随机矩阵进行测试, 测试的数据类型为单精度复数、双精度复数、单精度实数和双精度实数. 所实现的批量LU分解算法的浮点计算性能分别可达到约2 TFLOPS、1.2 TFLOPS、1 TFLOPS、0.67 TFLOPS, 与CUBLAS中的实现相比加速比最高分别达到了约9×、8×、12×、13×, 与MAGMA中的实现相比加速比分别达到了约1.2×–2.5×、1.2×–3.2×、1.1×–3×、1.1×–2.7×. 批量求逆算法的浮点计算性能分别可达到约4 TFLOPS、2 TFLOPS、2.2 TFLOPS、1.2 TFLOPS, 与CUBLAS中的实现相比加速比最高分别达到了约5×、4×、7×、7×, 与MAGMA中的实现相比加速比分别达到了约2×–3×、2×–3×、2.8×–3.4×、1.6×–2×.

    Abstract:

    This study presents the existing and optimized implementation methods for batched lower-upper (LU) matrix decomposition and batched inversion algorithms on the graphics processing unit (GPU). For batched LU decomposition, the study analyzes the number of reads and writes to the global memory when the Left-looking, Right-looking, and other commonly used blocked LU decomposition algorithms are implemented on the GPU. The blocked Left-looking algorithm with less memory access data is selected due to the characteristics of the GPU architecture. In the process of pivoting during LU decomposition, a parallel binary tree search algorithm suitable for the GPU architecture is adopted. In addition, to reduce the impact of the row interchange process caused by the pivoting on the performance of the algorithm, this study proposes two optimization techniques, namely, the Warp-based packet row interchange and row interchange delay. For batched inversion after LU decomposition, this study investigates the correction method employed in the matrix inversion process. When batched inversion is implemented on the GPU, a blocked matrix inversion algorithm with delayed correction is adopted to reduce access to the global memory during the correction. Furthermore, to speed up data reading and writing, the study adopts the optimization method of using more registers and shared memory and that of performing column interchange to reduce memory access data. In addition, a method of dynamic GPU resource allocation during operation is proposed to avoid the idleness of threads and the waste of shared memory and other GPU resources. Compared with the static one-time resource allocation method, the dynamic allocation method improves the performance of the algorithm significantly. Finally, 10000 random matrices with sizes between 33 and 190 data are tested on the TITAN V GPU, and the types of the tested data are single-precision complex, double-precision complex, single-precision real, and double-precision real. The floating-point arithmetic performance of the batched LU decomposition algorithm implemented in this study reaches about 2 TFLOPS, 1.2 TFLOPS, 1 TFLOPS, and 0.67 TFLOPS, respectively. This algorithm achieves the highest speedup of about 9×, 8×, 12×, and 13×, respectively, compared with the implementation in CUBLAS. The highest speedup achieved is about 1.2×–2.5×, 1.2×–3.2×, 1.1×–3×and 1.1×–2.7×, respectively, compared with the implementation in MAGMA. The floating-point arithmetic performance of the proposed batched inversion algorithm can reach about 4 TFLOPS, 2 TFLOPS, 2.2 TFLOPS, and 1.2 TFLOPS, respectively. This algorithm achieves the highest speedup of about 5×, 4×, 7×, and 7×, respectively, compared with the implementation in CUBLAS. The speedup is about 2×–3×, 2×–3×, 2.8×–3.4×and 1.6×–2×, respectively, compared with the implementation in MAGMA.

    参考文献
    相似文献
    引证文献
引用本文

刘世芳,赵永华,黄荣锋,于天禹,张馨尹.基于批量LU分解的矩阵求逆在GPU上的有效实现.软件学报,2023,34(11):4952-4972

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-12-28
  • 最后修改日期:2022-01-07
  • 录用日期:
  • 在线发布日期: 2023-05-18
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号