基于CUDA Core和Tensor Core的CTRU-Prime高吞吐量实现

doi:10.13328/j.cnki.jos.007559

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>年第卷第期 >1-20. DOI:10.13328/j.cnki.jos.007559

PDF HTML阅读 XML下载导出引用引用提醒

基于CUDA Core和Tensor Core的CTRU-Prime高吞吐量实现
DOI:
                        10.13328/j.cnki.jos.007559
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP309
基金项目:上海市协同创新基金(XTCX-KJ-2023-54); 上海市科委区块链关键技术攻关专项基金(23511100300)

CTRU-Prime High-throughput Implementation Based on CUDA Core and Tensor Core

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

量子计算机的迅猛发展对现存密码体制造成了极大的威胁, 后量子密码算法的实现和迁移部署尤为重要. 其中, 基于NTRU格的密码方案因结构简洁、计算效率高等优点备受瞩目. CTRU-Prime方案基于NTRU格构造, 鉴于其在安全性、带宽和实现效率上的出色表现和GPU在大规模并行处理任务上的强大能力, 在Tensor Core和CUDA (compute unified device architecture) Core的基础上给出了CTRU-Prime的首个高吞吐量实现. CTRU-Prime的底层代数结构为素阶数域, 在抵御针对分圆环攻击的同时, 也为多项式乘法的实现带来挑战. 首先, 提出两种素阶数域上多项式乘法的GPU实现方案. 基于CUDA Core的伪梅森数不完整NTT的多项式乘法使用层融合技术优化访存模式, 能够达到256.98倍吞吐量, 基于Tensor Core的教科书式多项式乘法, 将多项式乘法转化为矩阵操作, 利用低精度MMA (matrix-multiply-and-accumulate)操作实现, 能够达到177.24倍吞吐量. 接着, 结合批量模式和单一模式、多流技术和多线程技术, 给出了GPU平台上面向吞吐量的CTRU-Prime总体架构, 使用融合内核、合并全局内存访问、优化访存模式等优化策略, 加快各个核函数的访存和计算速度. 实验结果表明, 基于RTX3060平台, CTRU-Prime-653、CTRU-Prime-761、CTRU-Prime-1277每秒钟可以分别进行密钥生成6.3、5.4、1.6万次, 密钥封装63.5、274.5、160.1万次, 密钥解封装35.1、262.2、152.4万次, 是C实现版密钥生成吞吐量的68.85、79.78、66.84倍, 密钥封装吞吐量的10.32、46.57、46.81倍, 密钥解封装吞吐量的11.43、89.19、90.32倍. 同最新实现的Kyber相比, 密钥封装吞吐量达到1.46倍, 密钥解封装达到1.74倍, 是其他NTRU格基GPU高吞吐量实现的26倍.

Abstract:

The rapid development of quantum computers poses significant threats to existing cryptographic systems. The implementation and migration of post-quantum cryptographic algorithms are therefore of utmost importance. Among these, NTRU lattice-based cryptographic schemes have gained attention due to their simplicity and computational efficiency. The CTRU-Prime scheme, based on NTRU lattices, stands out for its excellent performance in security, bandwidth, and implementation efficiency. Given the powerful capabilities of GPUs in handling large-scale parallel processing tasks, this study presents the first high-throughput implementation of CTRU-Prime using Tensor Core and compute unified device architecture (CUDA) Core. The underlying algebraic structure of CTRU-Prime is large-Galois-group prime-degree prime-ideal number field (LPPNF), which not only resists attacks targeting cyclotomic rings but also presents challenges for the implementation of polynomial multiplication. First, two GPU implementations of polynomial multiplication over LPPNF are proposed. The CUDA Core-based Pseudo-Mersenne incomplete NTT polynomial multiplication uses layer fusion techniques to optimize memory access patterns, achieving a throughput of 256.98 times. The Tensor Core-based schoolbook polynomial multiplication converts polynomial multiplication into matrix operations, leveraging low-precision matrix-multiply-and-accumulate (MMA) operations, achieving a throughput of 177.24 times. Next, an overall architecture for CTRU-Prime on the GPU platform is presented, focusing on throughput. This architecture combines batch mode and single mode, multi-stream technology, and multi-thread techniques. Optimization strategies such as fused kernels, coalesced global memory access, and optimized memory access patterns are employed to accelerate memory access and computation speeds of various kernel functions. Experimental results show that, on the RTX 3060 platform, CTRU-Prime-653, CTRU-Prime-761, and CTRU-Prime-1277 can perform key generation at rates of 63000, 54000, and 16000 times per second, respectively; key encapsulation at rates of 635000, 2745000, and 1601000 times per second, respectively; and key decapsulation at rates of 351000, 2622000, and 1524000 times per second, respectively. These rates are 68.85, 79.78, and 66.84 times higher for key generation, 10.32, 46.57, and 46.81 times higher for key encapsulation, and 11.43, 89.19, and 90.32 times higher for key decapsulation compared to the C implementation. Compared to the latest Kyber implementation, the key encapsulation throughput is 1.46 times higher, and the key decapsulation throughput is 1.74 times higher, making it 26 times more efficient than other high-throughput NTRU lattice-based GPU implementations.

参考文献

相似文献

引证文献

引用本文

胡晓雯,邹恒川,沈诗羽,李文倩,赵运磊.基于CUDA Core和Tensor Core的CTRU-Prime高吞吐量实现.软件学报,,():1-20

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-11-26
最后修改日期:2025-03-17
录用日期:
在线发布日期: 2026-01-14
出版日期:

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码