NUMA感知的持久内存存储引擎优化设计
作者:
作者简介:

屠要峰(1972-),男,博士生,研究员,CCF高级会员,主要研究领域为大数据,数据库,机器学习,云计算;
闫宗帅(1987-),男,硕士,主要研究领域为数据库,分布式系统;
陈河堆(1972-),男,硕士,高级工程师,CCF专业会员,主要研究领域为数据库,分布式系统,数据挖掘与分析;
孔鲁(1989-),男,硕士,主要研究领域为数据库,分布式系统;
王涵毅(1982-),男,硕士,CCF专业会员,主要研究领域为数据库,分布式系统;
陈兵(1970-),男,教授,博士生导师,CCF杰出会员,主要研究领域为大数据,云计算,认知无线电网络.

通讯作者:

屠要峰,E-mail:13605151819@qq.com

基金项目:

国家重点研发计划(2019YFB2102002);江苏省重点研发计划(BE2019012)


Optimal Design of NUMA-aware Persistent Memory Storage Engine
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [34]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    持久性内存(persist memory,PM)具有非易失、字节寻址、低时延和大容量等特性,打破了传统内外存之间的界限,对现有软件体系结构带来颠覆性影响.但是,当前PM硬件还存在着磨损不均衡、读写不对称等问题,特别是当跨NUMA (non uniform memory access)节点访问PM时,存在着严重的I/O性能衰减问题.提出了一种NUMA感知的PM存储引擎优化设计,并应用到中兴新一代数据库系统GoldenX中,显著降低了数据库系统跨NUMA节点访问持久内存的开销.主要创新点包括:提出了一种DRAM+PM混合内存架构下跨NUMA节点的数据空间分布策略和分布式存取模型,实现了PM数据空间的高效使用;针对跨NUMA访问PM的高开销问题,提出了I/O代理例程访问方法,将跨NUMA访问PM开销转化为一次远程DRAM内存拷贝和本地访问PM的开销,设计了Cache Line Area (CLA)缓存页机制,缓解了I/O写放大问题,提升了本地访问PM的效率;扩展了传统表空间概念,让每个表空间既拥有独立的表数据存储,也拥有专门的WAL (write-ahead logging)日志存储,针对该分布式WAL存储架构提出了一种基于全局顺序号的事务处理机制,解决了单点WAL性能瓶颈问题,并实现了NUMA感知的事务处理、检查点和灾难恢复等优化机制及算法.实验结果表明,所提出的方法能够有效提升NUMA架构下PM存储引擎的性能,在YCSB多种测试场景下分别提升了105%-317%,在TPC-C场景下提升了90%-134%.关键词:数据库;存储引擎;持久性内存;NUMA (non uniform memory access);WAL (write-ahead logging)

    Abstract:

    Persistent memory (PM) has the characteristics of non-volatility, byte addressable, low latency, and large capacity, which breaks the boundary between traditional internal and external memory and has a has a disruptive impact on the existing software architecture. However, the current PM hardware still has problems such as uneven wear and asymmetric read and write. Especially serious I/O performance degradation problem will occur when the CPU accesses the PM across NUMA (non uniform memory access) nodes. An NUMA-aware PM storage engine optimization design is proposed and applied to Zhongxing’s new generation database system GoldenX, which significantly reduces the overhead of database system accessing persistent memory across NUMA nodes. The main innovations include: a data space distribution strategy and distributed access model across NUMA nodes are proposed under a DRAM+PM hybrid memory architecture, which realizes the efficient use of PM data space; aiming at the high latency problem of accessing PM across NUMA nodes, an I/O proxy routines access method is proposed, which converts the overhead of accessing PM across NUMA into the overhead of a remote DRAM memory copy and local access to PM. The Cache Line Area cache page mechanism is designed to alleviate the problem of I/O write amplification and improve the efficiency of local access to PM. The concept of traditional table space is extended, so that each table space has both independent table data storage and dedicated WAL (write-ahead logging) storage. For the distributed WAL storage architecture, a transaction processing mechanism based on global sequence numbers is proposed, which addresses the problem of single-point the WAL performance bottleneck, and implement NUMA-aware transaction processing, checkpoint and disaster recovery optimization mechanisms and algorithms. Experimental results show that the method proposed in this study can effectively improve the performance of the PM storage engine under the NUMA architecture, by 105%-317% in various test scenarios of YCSB and 90%-134% in TPC-C.

    参考文献
    [1] Liu HK, Chen D. A survey of non-volatile main memory technologies:State-of-the-arts, practices, and future directions. Journal of Computer Science and Technology, 2021, 36(1):4-32.
    [2] Luo YP, Jin PQ. Optimizing join algorithms for NVM+DRAM-based hybrid memory architecure. Chinese Journal of Computers, 2020, 43(6):1069-1085(in Chinese with English abstract).
    [3] Hirofuchi T, Takano R. A prompt report on the performance of Intel optane DC persistent memory module. IEICE Trans. on Information and Systems, 2020, E103.D(5):1168-1172.
    [4] Oracle 21c, persistent memory database. 2021. https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/using-PMEMdb-support.html#GUID-E5D17A8C-D508-4A50-8774-9AAA85562621
    [5] Direct access for files. 2018. https://www.kernel.org/doc/Documentation/filesystems/dax.txt
    [6] Williams S, Ionkov L, Lang M. NUMA distance for heterogeneous memory. In:Proc. of the Workshop on Memory Centric Programming for HPC (MCHPC 2017). New York:Association for Computing Machinery, 2017. 30-34.
    [7] Yang J, Kim J, Hoseinzadeh M, Izraelevitz J, Swanson S. An empirical guide to the behavior and use of scalable persistent memory. In:Proc. of the 18th USENIX Conf. on File and Storage Technologies (FAST). Santa Clara:USENIX Association, 2020. 169-182.
    [8] Shi W, Wang DS. Survey on transactional storage systems based on non-volatile memory. Journal of Computer Research and Development, 2016, 53(2):399-415(in Chinese with English abstract).
    [9] Han SK, Xiong ZW, Jiang DJ, Xiong J. Rethinking index design based on persistent memory device. Journal of Computer Research and Development, 2021, 58(2):356-370(in Chinese with English abstract).
    [10] Pan W, Li ZH, Du HT, Zhou CC, Su J. State-of-the-Art survey of transaction processing in non-volatile memory environments. Ruan Jian Xue Bao/Journal of Software, 2017, 28(1):59-83(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5141.htm[doi:10.13328/j.cnki.jos.005141]
    [11] Arulraj J, Pavlo A. How to build a non-volatile memory database management system. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data (SIGMOD 2017). New York:Association for Computing Machinery, 2017. 1753-1758.
    [12] Xiao RZ, Feng D, Hu YC, Zhang XY, Cheng LF. A survey of data consistency research for non-volatile memory. Journal of Computer Research and Development, 2020, 57(1):85-101. in Chinese with English abstract).
    [13] Chen SM, Qin J. Persistent B+-trees in non-volatile main memory. Proc. of the VLDB Endowment, 2015, 8(7):786-797.
    [14] Zhou X, Shou L, Chen K, Hu W, Chen G. DPTree:Differential indexing for persistent memory. Proc. of the VLDB Endowment, 2019, 13(4):421-434.
    [15] Venkataraman S, Tolia N, Ranganathan P, Campbell R. Consistent and durable data structures for non-volatile byte-addressable memory. In:Proc. of the 9th USENIX Conf. on File and Storage Technologies (FAST 2011). San Jose:USENIX Association, 2011.
    [16] Oukid I, Lasperas J, Nica A, Willhalm T, Lehner W. FPTree:A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In:Proc. of the 2016 Int'l Conf. on Management of Data. New York:Association for Computing Machinery, 2016. 371-386.
    [17] Arulraj J, Levandoski J, Minhas UF, Larson P. Bztree:A high-performance latch-free range index for non-volatile memory. Proc. of the VLDB Endowment, 2018, 11. 5):553-565.
    [18] Renen AV, Leis V, Kemper A, Neumann T, Hashida T, Oe K, Doi Y, Harada L, Sato M. Managing non-volatile memory in database systems. In:Proc. of the 2018 SIGMOD Int'l Conf. New York:Association for Computing Machinery, 2018. 1541-1555.
    [19] Zhou X, Arulraj J, Pavlo A, Cohen D. Spitfire:A three-tier buffer manager for volatile and non-volatile memory. In:Proc. of the 2021 Int'l Conf. on Management of Data. New York:Association for Computing Machinery, 2021. 2195-2207.
    [20] Huang J, Schwan K, Qureshi MK. NVRAM-aware logging in transaction systems. Proc. of the VLDB Endowment, 2014, 8(4): 389-400.
    [21] Arulraj J, Perron M, Pavlo A. Write-behind logging. Proc. of the VLDB Endowment, 2016, 10(4):337-348.
    [22] Haubenschild M, Sauer C, Neumann T, Leis V. Rethinking logging, checkpoints, and recovery for high-performance storage engines. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. New York:Association for Computing Machinery, 2020. 877-892.
    [23] Kimura H. FOEDUS:OLTP engine for a thousand cores and NVRAM. In:Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. New York:Association for Computing Machinery, 2015. 691-706.
    [24] Kim JH, Kim Y, Jamil S, Park S. A NUMA-aware NVM file system design for many core server applications. In:Proc. of the 28th Int'l Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Nice:IEEE, 2020. 1-5.
    [25] Duan ZH, Liu HK, Liao X, Jin H, Jiang W, Zhang Y. HiNUMA:NUMA-aware data placement and migration in hybrid memory systems. In:Proc. of the IEEE 37th Int'l Conf. on Computer Design (ICCD). New York:IEEE, 2019. 367-375.
    [26] DeBrabant J, Arulraj J, Pavlo A, Stonebraker M, Zdonik S, Dulloor SR. A prolegomenon on OLTP database systems for nonvolatile memory. Proc. of the VLDB Endowment, 2014, 7(14):57-63.
    [27] Xu J, Kim J, Memaripour A, Swanson S. Finding and fixing performance pathologies in persistent memory software stacks. In: Proc. of the ASPLOS 2019. New York:Association for Computing Machinery, 2019. 427-439.
    [28] Wang T, Johnson R. Scalable logging through emerging non-volatile memory. Proc. of the VLDB Endowment, 2014, 7(10): 865-876.
    附中文参考文献:
    [2] 罗永平, 金培权. NVM+DRAM混合内存架构下的连接算法优化. 计算机学报, 2020, 43(6):1069-1085.
    [8] 石伟, 汪东升. 基于非易失性存储器的事务存储系统综述. 计算机研究与发展, 2016, 53(2):399-415.
    [9] 韩书楷, 熊子威, 蒋德钧, 熊劲. 基于持久化内存的索引设计重新思考与优化. 计算机研究与发展, 2021, 58(2):356-370.
    [10] 潘巍, 李战怀, 杜洪涛, 周陈超, 苏静. 新型非易失存储环境下事务型数据管理技术研究. 软件学报, 2017, 28(1):59-83. http://www.jos.org.cn/1000-9825/5141.htm[doi:10.13328/j.cnki.jos.005141]
    [12] 肖仁智, 冯丹, 胡燏翀, 张晓祎, 程良锋. 面向非易失内存的数据一致性研究综述. 计算机研究与发展, 2020, 57(1):85-101.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

屠要峰,陈河堆,王涵毅,闫宗帅,孔鲁,陈兵. NUMA感知的持久内存存储引擎优化设计.软件学报,2022,33(3):891-908

复制
分享
文章指标
  • 点击次数:1396
  • 下载次数: 5413
  • HTML阅读次数: 3760
  • 引用次数: 0
历史
  • 收稿日期:2021-06-30
  • 最后修改日期:2021-07-31
  • 在线发布日期: 2021-10-21
  • 出版日期: 2022-03-06
文章二维码
您是第19701019位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号