Abstract:This study presents the existing and optimized implementation methods for batched lower-upper (LU) matrix decomposition and batched inversion algorithms on the graphics processing unit (GPU). For batched LU decomposition, the study analyzes the number of reads and writes to the global memory when the Left-looking, Right-looking, and other commonly used blocked LU decomposition algorithms are implemented on the GPU. The blocked Left-looking algorithm with less memory access data is selected due to the characteristics of the GPU architecture. In the process of pivoting during LU decomposition, a parallel binary tree search algorithm suitable for the GPU architecture is adopted. In addition, to reduce the impact of the row interchange process caused by the pivoting on the performance of the algorithm, this study proposes two optimization techniques, namely, the Warp-based packet row interchange and row interchange delay. For batched inversion after LU decomposition, this study investigates the correction method employed in the matrix inversion process. When batched inversion is implemented on the GPU, a blocked matrix inversion algorithm with delayed correction is adopted to reduce access to the global memory during the correction. Furthermore, to speed up data reading and writing, the study adopts the optimization method of using more registers and shared memory and that of performing column interchange to reduce memory access data. In addition, a method of dynamic GPU resource allocation during operation is proposed to avoid the idleness of threads and the waste of shared memory and other GPU resources. Compared with the static one-time resource allocation method, the dynamic allocation method improves the performance of the algorithm significantly. Finally, 10000 random matrices with sizes between 33 and 190 data are tested on the TITAN V GPU, and the types of the tested data are single-precision complex, double-precision complex, single-precision real, and double-precision real. The floating-point arithmetic performance of the batched LU decomposition algorithm implemented in this study reaches about 2 TFLOPS, 1.2 TFLOPS, 1 TFLOPS, and 0.67 TFLOPS, respectively. This algorithm achieves the highest speedup of about 9×, 8×, 12×, and 13×, respectively, compared with the implementation in CUBLAS. The highest speedup achieved is about 1.2×–2.5×, 1.2×–3.2×, 1.1×–3×and 1.1×–2.7×, respectively, compared with the implementation in MAGMA. The floating-point arithmetic performance of the proposed batched inversion algorithm can reach about 4 TFLOPS, 2 TFLOPS, 2.2 TFLOPS, and 1.2 TFLOPS, respectively. This algorithm achieves the highest speedup of about 5×, 4×, 7×, and 7×, respectively, compared with the implementation in CUBLAS. The speedup is about 2×–3×, 2×–3×, 2.8×–3.4×and 1.6×–2×, respectively, compared with the implementation in MAGMA.