榫卯：一种可组合的定制化内存分配框架

doi:10.13328/j.cnki.jos.006830

微信服务号

微信订阅号

2025年6月1日 15:55 星期日

首页 > 过刊浏览>2024年第35卷第4期 >2076-2098. DOI:10.13328/j.cnki.jos.006830

PDF HTML阅读 XML下载导出引用引用提醒

榫卯：一种可组合的定制化内存分配框架
DOI:
                        10.13328/j.cnki.jos.006830
                    
CSTR:
                        
                    
作者:
                        欧阳湘臻欧阳湘臻
西北工业大学 计算机学院, 陕西 西安 710072
在期刊界中查找
在百度中查找
在本站中查找
朱怡安朱怡安
西北工业大学 计算机学院, 陕西 西安 710072
在期刊界中查找
在百度中查找
在本站中查找
史先琛史先琛
西北工业大学 计算机学院, 陕西 西安 710072
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家重点研发计划(2021YFC2802503, 2020YFB1712201)；陕西省重点研发计划(2021ZDLGY05-05)

Mortise: Composable and Customizable Memory Allocator Framework

Author:

OUYANG Xiang-Zhen
OUYANG Xiang-Zhen
School of Computer Science, Northwestern Polytechnical University, Xian 710072, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Yi-An
ZHU Yi-An
School of Computer Science, Northwestern Polytechnical University, Xian 710072, China
在期刊界中查找
在百度中查找
在本站中查找
SHI Xian-Chen
SHI Xian-Chen
School of Computer Science, Northwestern Polytechnical University, Xian 710072, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [46]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

动态内存分配器是现代应用程序重要组成部分, 它负责管理空闲内存并处理用户内存请求. 现代通用动态内存分配器能够提供较为平衡的性能与内存利用率, 但考虑到不同应用场景的内存使用情况和优化目标不同, 使用通用内存分配器并非最优解. 针对应用场景定制的专用内存分配器通常能够更好地满足系统需要, 然而编写专用内存分配器较为费时, 也容易出错. 开发者通常使用内存分配框架搭建专用动态内存分配器. 然而, 现有的内存分配框架存在抽象能力较差, 组合性与定制性不足的问题. 为此, 从函数式编程视角审视动态内存分配过程, 基于函数可组合性提出了一种可组合的定制化动态内存分配器框架榫卯. 榫卯框架将系统内存分配抽象为多个互不耦合的内存分配层级函数的组合, 这些层级函数能够扩展出策略槽, 以提供更高的定制性和组合性. 榫卯框架基于标准C实现, 依赖C预处理器的元编程特性实现层级函数组合的零性能开销. 开发者能够通过组合与定制分配器的层级函数, 快速构建出适合应用场景的内存分配器. 为了证明榫卯框架的有效性, 使用榫卯框架构建了3种不同的内存分配器实例: tlsfcc, hslab与wfslab, 其中tlsfcc针对多核嵌入式应用场景, 通过替换同步策略优化并发吞吐率; hslab是核心感知的slab式分配器, 通过定制线程缓存优化在异构硬件的性能; wfslab是低延迟的无等待/无锁分配器. 为了评估这3种内存分配器实例, 通过运行基准测试对比现有内存分配器. 实验分别在8核x86/64平台和8核异构aarch64嵌入式平台进行. 实验表明tlsfcc与原始tlsf分配器相比, 在上述两个平台上分别取得了平均1.76和1.59的加速比; 对比hslab与类似架构的tcmalloc, 它在两个平台的平均执行时间仅为tcmalloc的69.6%和85.0%; wfslab则取得了参与实验对比的内存分配器中最小的最差情况内存请求延迟, 其中包括目前最先进的无锁内存分配器mimalloc和snmalloc.

关键词:内存分配;阻塞式同步;异构系统;操作系统;函数式编程

Abstract:

Dynamic memory allocators are fundamental components of modern applications. They manage free memory and handle user memory requests. Modern general-purpose dynamic memory allocators ensure the balance of performance and memory footprint. However, in view of different memory footprints and optimization goals in application scenarios, a general-purpose memory allocator is not the optimal solution. Special-purpose memory allocators for specific application scenarios usually can better satisfy system requirements. However, they are time-consuming and error-prone to implement. Developers often use the memory allocation framework to build special-purpose dynamic memory allocators. However, the existing memory allocator framework has the problems of poor abstraction ability and insufficient composability and customizability. For this reason, this study proposes a composable and customizable dynamic memory allocator framework, namely mortise, based on function composability by reviewing the dynamic memory allocation process from the perspective of functional programming. The framework abstracts system memory allocation as a composition of hierarchical functions of several multiple decoupled memory allocations, and these functions can provide policies to ensure higher customizability and composability. Mortise is implemented by using standard C. To achieve zero performance overhead of hierarchical function composition, mortise uses the metaprogramming features offered by the C preprocessor. Developers can quickly build a memory allocator for targeted application scenarios by composing and customizing the hierarchical function of allocators. In order to prove the effectiveness of mortise, this study presents three different memory allocator instances, namely tlsfcc, hslab, and wfslab, by using mortise. Specifically, tlsfcc is designed for multi-core embedded application scenarios, which improves the parallel throughput by replacing the synchronization strategy; hslab is a core-aware slab-type allocator, which optimizes performance on heterogeneous hardware by customizing thread cache; wfslab is a low-latency and wait-free/lock-free allocator. This study runs benchmarks to compare these allocators with several existing memory allocators. The experiments are carried out on an 8-core x86/64 platform and an 8-core heterogeneous aarch64 embedded platform, and the experimental results show that tlsfcc achieves a mean speedup of 1.76 and 1.59 on the two platforms compared with the original tlsf allocator; hlsab achieves only 69.6% and 85.0% execution time compared with the tcmalloc with a similar architecture; the worst-case memory request latency of wfslab is the smallest among all memory allocators in the experiment, including the state-of-art lock-free memory allocators: mimalloc and snmalloc.

Key words:memory allocation;blocking synchronization;heterogeneous system;operating systems;functional programming

参考文献

[1] Donahue SM, Hampton MP, Deters M, Nye JM, Cytron R, Kavi KM. Storage allocation for real-time, embedded systems. In:Proc. of the 1st Int'l Workshop on Embedded Software. Tahoe City:Springer, 2001. 131-147.

[2] Puaut I. Real-time performance of dynamic memory allocation algorithms. In:Proc. of the 14th Euromicro Conf. on Real-time Systems. Euromicro RTS 2002. New York:IEEE, 2002. 41-49.

[3] Kanev S, Darago JP, Hazelwood K, Ranganathan P, Moseley T, Wei GY, Brooks D. Profiling a warehouse-scale computer. In:Proc. of the 42nd Annual Int'l Symp. on Computer Architecture. Oregon:Association for Computing Machinery, 2015. 158-169.

[4] Hunter AH, Kennelly C, Turner P, Gove D, Moseley T, Ranganathan P. Beyond malloc efficiency to fleet efficiency:A hugepage-aware memory allocator. In:Proc. of the 15th USENIX Symp. on Operating Systems Design and Implementation. New York:USENIX Association, 2021. 257-273.

[5] Berger ED, Zorn BG, McKinley KS. Composing high-performance memory allocators. In:Proc. of the 2001 ACM SIGPLAN Conf. on Programming Language Design and Implementation. Utah:Association for Computing Machinery, 2001. 114-124.

[6] Yun H, Mancuso R, Wu ZP, Pellizzoni R. PALLOC:Dram bank-aware memory allocator for performance isolation on multicore platforms. In:Proc. of the 19th IEEE Real-time and Embedded Technology and Applications Symp. Berlin:IEEE, 2014. 155-166.

[7] Herter J, Backes P, Haupenthal F, Reineke J. CAMA:A predictable cache-aware memory allocator. In:Proc. of the 23rd Euromicro Conf. on Real-time Systems. Porto:IEEE, 2011. 23-32.

[8] 邱杰凡, 华宗汉, 范菁, 刘磊. 内存体系划分技术的研究与发展. 软件学报, 2022, 33(2):751-769. http://www.jos.org.cn/1000-9825/6370.htm

Qiu JF, Hua ZH, Fan J, Liu L. Evolution of memory partitioning technologies:Case study through page coloring. Ruan Jian Xue Bao/Journal of Software, 2022, 33(2):751-769 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6370.htm

[9] Roghanchi S, Eriksson J, Basu N. Ffwd:Delegation is (much) faster than you think. In:Proc. of the 26th Symp. on Operating Systems Principles. Shanghai:Association for Computing Machinery, 2017. 342-358.

[10] Hendler D, Incze I, Shavit N, Tzafrir M. Flat combining and the synchronization-parallelism tradeoff. In:Proc. of the 32nd Annual ACM Symp. on Parallelism in Algorithms and Architectures. Santorini:Association for Computing Machinery, 2010. 355-364.

[11] Fatourou P, Kallimanis ND. Revisiting the combining synchronization technique. In:Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Louisiana:Association for Computing Machinery, 2012. 257-266.

[12] Dice D, Marathe VJ, Shavit N. Flat-combining NUMA locks. In:Proc. of the 33rd Annual ACM Symp. on Parallelism in Algorithms and Architectures. California:Association for Computing Machinery, 2011. 65-74.

[13] Luchangco V, Nussbaum D, Shavit N. A hierarchical CLH queue lock. In:Proc of the 12th Int'l Conf. on Parallel Processing. Dresden:Springer, 2006. 801-810.

[14] Lozi JP, David F, Thomas G, Lawall JL, Muller G. Remote core locking:Migrating critical-section execution to improve the performance of multithreaded applications. In:Proc. of the 2012 USENIX Conf on Annual Technical Conf. Boston:USENIX Association, 2012. 65-76.

[15] Mellor-Crummey JM, Scott ML. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 1991, 9(1):21-65.[doi:10.1145/103727.103729]

[16] Bracha G, Cook W. Mixin-based inheritance. In:Proc. of the 1990 European Conf. on Object-oriented Programming Systems, Languages, and Applications. Ottawa:ACM, 1990. 303-311.

[17] Masmano M, Ripoll I, Crespo A, Real J. TLSF:A new dynamic memory allocator for real-time systems. In:Proc. of the 16th Euromicro Conf. on Real-time Systems, 2004. ECRTS 2004. Catania:IEEE, 2004. 79-88.

[18] Berger ED, McKinley KS, Blumofe RD, Wilson PR. Hoard:A scalable memory allocator for multithreaded applications. In:Proc. of the 9th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. Cambridge:Association for Computing Machinery, 2000. 117-128.

[19] Kukanov A, Voss MJ. The foundations for scalable multi-core software in Intel threading building blocks. Intel Technology Journal, 2007, 11(4):309-322.

[20] Leijen D, Zorn BG, de Moura L. Mimalloc:Free list sharding in action. In:Proc. of the 17th Asian Symp. on Programming Languages and Systems. Nusa Dua:Springer, 2019. 244-265.

[21] Liétar P, Butler T, Clebsch S, Drossopoulou S, Franco J, Parkinson MJ, Shamis A, Wintersteiger CM, Chisnall D. Snmalloc:A message passing allocator. In:Proc. of the 2019 ACM SIGPLAN Int'l Symp. on Memory Management. Phoenix:Association for Computing Machinery, 2019. 122-135.

[22] Berger ED, Zorn BG. Diehard:Probabilistic memory safety for unsafe languages. In:Proc. of the 27th ACM SIGPLAN Conf. on Programming Language Design and Implementation. Ottawa:Association for Computing Machinery, 2006. 158-168.

[23] Dice D, Garthwaite A. Mostly lock-free malloc. In:Proc. of the 3rd Int'l Symp. on Memory Management. Berlin:Association for Computing Machinery, 2002. 163-174.

[24] Michael MM. Scalable lock-free dynamic memory allocation. In:Proc. of the 2004 ACM SIGPLAN Conf. on Programming Language Design and Implementation. Washington:Association for Computing Machinery, 2004. 35-46.

[25] Aigner M, Kirsch CM, Lippautz M, Sokolova A. Fast, multicore-scalable, low-fragmentation memory allocation through large virtual memory and global data structures. In:Proc. of the 2015 ACM SIGPLAN Int'l Conf. on Object-oriented Programming, Systems, Languages, and Applications. Pittsburgh:Association for Computing Machinery, 2015. 451-469.

[26] Kuszmaul BC. SuperMalloc:A super fast multithreaded malloc for 64-bit machines. In:Proc. of the 2015 Int'l Symp. on Memory Management. Portland:ACM, 2015. 41-55.

[27] Hendler D, Shavit N, Yerushalmi L. A scalable lock-free stack algorithm. Journal of Parallel and Distributed Computing, 2010, 70(1):1-12.[doi:10.1016/j.jpdc.2009.08.011]

[28] Michael MM, Scott ML. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In:Proc. of the 15th Annual ACM Symp. on Principles of Distributed Computing. Philadelphia:Association for Computing Machinery, 1996. 267-275.

[29] Timnat S, Petrank E. A practical wait-free simulation for lock-free data structures. In:Proc. of the 19th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programmings. Orlando:ACM, 2014. 357-368.

[30] Kogan A, Petrank E. Wait-free queues with multiple enqueuers and dequeuers. In:Proc. of the 16th ACM Symp. on Principles and Practice of Parallel Programming. San Antonio:Association for Computing Machinery, 2011. 223-234.

[31] Fatourou P, Kallimanis ND. Highly-efficient wait-free synchronization. Theory of Computing Systems, 2014, 55(3):475-520.[doi:10.1007/s00224-013-9491-y]

[32] Kogan A, Petrank E. A methodology for creating fast wait-free data structures. In:Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. New Orleans:Association for Computing Machinery, 2014. 141-150.

[33] Peng YQ, Hao ZY. FA-stack:A fast array-based stack with wait-free progress guarantee. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(4):843-857.[doi:10.1109/TPDS.2017.2770121]

[34] Yang CR, Mellor-Crummey J. A wait-free queue as fast as fetch-and-add. In:Proc. of the 21st ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Barcelona:Association for Computing Machinery, 2016. 16.

[35] Wen HS, Izraelevitz J, Cai WT, Beadle HA, Scott ML. Interval-based memory reclamation. In:Proc. of the 23rd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Vienna:Association for Computing Machinery, 2018. 1-13.

[36] Michael MM. Hazard pointers:Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems, 2004, 15(6):491-504.

[37] Nikolaev R, Ravindran B. Universal wait-free memory reclamation. In:Proc. of the 25th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. San Diego:Association for Computing Machinery, 2020. 130-143.

[38] Powers B, Tench D, Berger ED, McGregor A. Mesh:Compacting memory management for C/C++ applications. In:Proc. of the 40th ACM SIGPLAN Conf. on Programming Language Design and Implementation. Phoenix:Association for Computing Machinery, 2019. 333-346.

[39] Herlihy M. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 1991, 13(1):124-149.[doi:10.1145/114005.102808]

[40] Petrank E, Musuvathi M, Steesngaard B. Progress guarantee for parallel programs via bounded lock-freedom. In:Proc. of the 30th ACM SIGPLAN Conf. on Programming Language Design and Implementation. Dublin:Association for Computing Machinery, 2009. 144-154.

[41] Woo SC, Ohara M, Torrie E, Pal Singh J, Gupta A. The SPLASH-2 programs:Characterization and methodological considerations. In:Proc. of the 22nd Annual Int'l Symp. on Computer Architecture. New York:Association for Computing Machinery, 1995. 24-36.

[42] Grunwald D, Zorn B, Henderson R. Improving the cache locality of memory allocation. In:Proc. of the 1993 ACM SIGPLAN Conf. on Programming Language Design and Implementation. Albuquerque:Association for Computing Machinery, 1993. 177-186.

[43] Correia A, Ramalhete P, Felber P. A wait-free universal construction for large objects. In:Proc. of the 25th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. San Diego:Association for Computing Machinery, 2020. 102-116.

[44] Ramalhete P, Correia A. POSTER:A wait-free queue with wait-free memory reclamation. In:Proc. of the 22nd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Austin:Association for Computing Machinery, 2017. 453-454.

[45] Larson PA, Krishnan M. Memory allocation for long-running server applications. In:Proc. of the 1st Int'l Symp. on Memory Management. Vancouver:Association for Computing Machinery, 1998. 176-185.

引用本文

欧阳湘臻,朱怡安,史先琛.榫卯：一种可组合的定制化内存分配框架.软件学报,2024,35(4):2076-2098

复制

文章指标

点击次数:730
下载次数: 2202
HTML阅读次数: 1212
引用次数: 0

历史

收稿日期:2022-04-17
最后修改日期:2022-08-02
录用日期:
在线发布日期: 2023-06-28
出版日期: 2024-04-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码