Abstract:The rapid development of quantum computers poses significant threats to existing cryptographic systems. The implementation and migration of post-quantum cryptographic algorithms are therefore of utmost importance. Among these, NTRU lattice-based cryptographic schemes have gained attention due to their simplicity and computational efficiency. The CTRU-Prime scheme, based on NTRU lattices, stands out for its excellent performance in security, bandwidth, and implementation efficiency. Given the powerful capabilities of GPUs in handling large-scale parallel processing tasks, this study presents the first high-throughput implementation of CTRU-Prime using Tensor Core and compute unified device architecture (CUDA) Core. The underlying algebraic structure of CTRU-Prime is large-Galois-group prime-degree prime-ideal number field (LPPNF), which not only resists attacks targeting cyclotomic rings but also presents challenges for the implementation of polynomial multiplication. First, two GPU implementations of polynomial multiplication over LPPNF are proposed. The CUDA Core-based Pseudo-Mersenne incomplete NTT polynomial multiplication uses layer fusion techniques to optimize memory access patterns, achieving a throughput of 256.98 times. The Tensor Core-based schoolbook polynomial multiplication converts polynomial multiplication into matrix operations, leveraging low-precision matrix-multiply-and-accumulate (MMA) operations, achieving a throughput of 177.24 times. Next, an overall architecture for CTRU-Prime on the GPU platform is presented, focusing on throughput. This architecture combines batch mode and single mode, multi-stream technology, and multi-thread techniques. Optimization strategies such as fused kernels, coalesced global memory access, and optimized memory access patterns are employed to accelerate memory access and computation speeds of various kernel functions. Experimental results show that, on the RTX 3060 platform, CTRU-Prime-653, CTRU-Prime-761, and CTRU-Prime-1277 can perform key generation at rates of 63000, 54000, and 16000 times per second, respectively; key encapsulation at rates of 635000, 2745000, and 1601000 times per second, respectively; and key decapsulation at rates of 351000, 2622000, and 1524000 times per second, respectively. These rates are 68.85, 79.78, and 66.84 times higher for key generation, 10.32, 46.57, and 46.81 times higher for key encapsulation, and 11.43, 89.19, and 90.32 times higher for key decapsulation compared to the C implementation. Compared to the latest Kyber implementation, the key encapsulation throughput is 1.46 times higher, and the key decapsulation throughput is 1.74 times higher, making it 26 times more efficient than other high-throughput NTRU lattice-based GPU implementations.