支持乱序执行的Raft协议
作者:
作者简介:

谷晓松(1997-),男,学士,CCF学生会员,主要研究领域为分布式数据一致性,分布式系统,形式化方法.
乔磊(1982-),男,博士,研究员,CCF专业会员,主要研究领域为航天器嵌入式操作系统设计及验证.
魏恒峰(1986-),男,博士,CCF专业会员,主要研究领域为分布数据一致性,形式化方法.
黄宇(1982-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为分布式算法,分布式系统,网络化软件系统.

通讯作者:

魏恒峰,Email:hfwei@nju.edu.cn;乔磊,Email:fly2moon@aliyun.com

基金项目:

国家自然科学基金(61932021,61772258);五〇二所空间先进计算与电子信息专业实验室开放基金(OBCandETL-2020-04)


Raft with Out-of-order Executions
Author:
Fund Project:

National Natural Science Foundation of China (61932021, 61772258); Space Advanced Computing and Electronic Information Laboratory of BICE (OBCandETL-2020-04)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [39]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    PolarFS是阿里巴巴开发的分布式文件系统,它实现了分布式共识协议Raft的一种变体,称为ParallelRaft.ParallelRaft突破了Raft中顺序提交、顺序执行的限制,允许状态机乱序执行用户命令.然而文献表明:ParallelRaft并未开源,仅有简短的文字描述,更缺乏严格的形式化规约.更进一步,它的正确性也尚未经过必要的数学论证或形式化检验.旨在为ParallelRaft提供严格的形式化规约并证明其正确性,主要贡献包括:首先,为了理清ParallelRaft与Raft之间的关系,提出了允许乱序提交、顺序执行的ParallelRaft-SE (sequential execution)协议,并建立了从ParallelRaft-SE到Multi-Paxos的精化关系;其次,现有的ParallelRaft描述忽略了可能会违反状态一致性的“幽灵日志”问题,因此在ParallelRaft-SE的基础上提出了ParallelRaft-CE (concurrent execution)协议.ParallelRaft-CE限制了ParallelRaft-SE在乱序提交阶段的并行度,避免了“幽灵日志”问题,支持乱序执行,并保证乱序执行下的状态机一致性.证明了ParallelRaft-CE的正确性.最后,使用TLA+给出了ParallelRaft-SE和ParallelRaft-CE的形式化规约,并对协议参与者数量较小的情形,使用TLC模型检验与模拟测试工具验证了从ParallelRaft-SE到Multi-Paxos的精化关系以及ParallelRaft-CE的正确性.

    Abstract:

    PolarFS is a distributed file system developed by Alibaba Inc. with ultra-low latency and high availability. It implemented a variant of the Raft consensus protocol, called ParallelRaft. ParallelRaft breaks Raft's strict serialization restrictions in the commitment and execution of log entries and enables state machines to commit and execute log entries in an out-of-order way. However, ParallelRaft is not open-sourced. It has only a brief description, lacking formal specification. Moreover, the correctness of ParallelRaft has not been manually proven or formally checked. The purpose of the study is to provide a precise formal specification for ParallelRaft and to prove its correctness. Specifically, the following main contributions are accomplished. First, to clarify the relationship between Raft and ParallelRaft, ParallelRaft-SE (sequential execution) is proposed, which allows out-of-order commitment but prohibits out-of-order executions. Also, a refinement mapping from ParallelRaft-SE to Multi-Paxos is established. Second, it is discovered that ParallelRaft, according to its brief description in the literature, neglects the so-called "ghost log entries" phenomenon, which may violate the consistency among state machines. Therefore, based on ParallelRaft-SE, ParallelRaft-CE (concurrent execution) is proposed. ParallelRaft-CE avoids the "ghost log entries" phenomenon and ensures the consistency among state machine when executing concurrently by limiting parallelism in the commitment of log entries. The correctness of ParallelRaft-CE is proved manually. Finally, the formal specifications of ParallelRaft-SE and ParallelRaft-CE using TLA+ (TLA stands for temporal logic of actions) are provided, and the refinement mapping from ParallelRaft-SE to Multi-Paxos and the correctness of ParallelRaft-CE are verified using the TLC model checker when the number of participants of the protocols is small.

    参考文献
    [1] Fischer MJ, Lynch NA, Paterson MS. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 1985,32(2):374-382.[doi:https://doi.org/10.1145/3149.214121]
    [2] Herlihy M. Wait-Free synchronization. ACM Trans. on Programming Languages and Systems (TOPLAS), 1991,13(1):124-149.[doi:https://doi.org/10.1145/114005.102808]
    [3] Weil SA. Ceph:Reliable, scalable, and high-performance distributed storage[Ph.D. Thesis]. Santa Cruz:University of California, 2007.
    [4] Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman JJ, Ghemawat S, Gubarev A, Heiser C, Hochschild P, Hsieh W. Spanner:Google's globally distributed database. ACM Trans. on Computer Systems (TOCS), 2013,31(3):1-22.
    [5] https://www.mysql.com/
    [6] Zheng J, Lin Q, Xu J, Wei C, Zeng C, Yang P, Zhang Y. PaxosStore:High-availability storage made practical in WeChat. Proc. of the VLDB Endowment, 2017,10(12):1730-1741.
    [7] Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G. PolarFS:An ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proc. of the VLDB Endowment, 2018,11(12):1849-1862.
    [8] Lamport L. Paxos made simple. ACM Sigact News, 2001,32(4):18-25.
    [9] Lamport L. The part-time parliament. ACM Trans. on Computer Systems (TOCS), 1998,16(2):133-169.
    [10] Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In:Proc. of the 2014 USENIX Annual Technical Conf. ({USENIX}{ATC} 2014). 2014. 305-319.
    [11] Schneider, Fred B. Implementing fault-tolerant services using the state machine approach:A tutorial. ACM Computing Surveys, 1990,22(4):299-319.
    [12] Abadi M, Lamport L. The existence of refinement mappings. Theoretical Computer Science, 1991,82(2):253-284.
    [13] Lamport L. Specifying Systems:The TLA+ Language and Tools for Hardware and Software Engineers. Boston:Addison-Wesley Longman Publishing Co., Inc., 2002.
    [14] Lamport L. The TLA+ hyperbook. 2015. http://research.microsoft.com/enus/um/people/lamport/tla/hyperbook.html
    [15] Lamport L. The temporal logic of actions. ACM Trans. on Programming Languages and Systems (TOPLAS), 1994,16(3):872-923.
    [16] Yu Y, Manolios P, Lamport L. Model checking TLA+ specifications. In:Proc. of the Advanced Research Working Conf. on Correct Hardware Design and Verification Methods. Berlin, Heidelberg:Springer-Verlag, 1999. 54-66.
    [17] https://github.com/HappyCS-Gu/Parallel-Raft-tla
    [18] Yi XC, Wei HF, Huang Y, Qiao L, Lü J. TPaxos consensus protocol in PaxosStore:Derivation, specification, and refinement. Ruan Jian Xue Bao/Journal of Software, 2020,31(8):2336-2361(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5964.htm[doi:10.13328/j.cnki.jos.005964]
    [19] Ji Y, Wei HF, Huang Y, Lü J. Specifying and verifying CRDT protocols using TLA+. Ruan Jian Xue Bao/Journal of Software, 2020,31(5):1332-1352(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5956.htm[doi:10.13328/j.cnki.jos. 005956]
    [20] Lamport L. Summary of tla+. http://lamport.azurewebsites.net/tla/summary-standalone.pdf
    [21] Yuan D, Luo Y, Zhuang X, Rodrigues GR, Zhao X, Zhang Y, Jain PU, Stumm M. Simple testing can prevent most critical failures:An analysis of production failures in distributed data-intensive systems. In:Proc. of the 11th USENIX Symp. on Operating Systems Design and Implementation (OSDI). 2014. 249-265.
    [22] Ongaro D. Consensus:Bridging theory and practice[Ph.D. Thesis]. Stanford University, 2014.
    [23] Lampson B. The ABCD's of Paxos. In:Proc. of the 20th Annual ACM Symp. on Principles of Distributed Computing (PODC), Vol.1. 2001. 13.
    [24] Van Renesse R, Altinbuken D. Paxos made moderately complex. ACM Computing Surveys (CSUR), 2015,47(3):1-36.
    [25] Gafni E, Lamport L. Disk Paxos. Distributed Computing, 2003,16(1):1-20.
    [26] Lamport L, Massa M. Cheap Paxos. In:Proc. of the Int'l Conf. on Dependable Systems and Networks. 2004. 307-314.
    [27] Lamport L. Fast Paxos. Distributed Computing, 2006,19(2):79-103.
    [28] Lamport L. Generalized consensus and Paxos. Technical Report, Microsoft Research, 2005.
    [29] Lamport L, Malkhi D, Zhou L. Stoppable Paxos. Technical Report, Microsoft Research, 2008.
    [30] Lamport L, Malkhi D, Zhou L. Vertical Paxos and primary-backup replication. In:Proc. of the 28th ACM Symp. on Principles of Distributed Computing. 2009. 312-313.
    [31] Lamport L. Byzantizing Paxos by refinement. In:Proc. of the Int'l Symp. on Distributed Computing. 2011. 211-224.
    [32] Moraru I, Andersen DG, Kaminsky M. There is more consensus in egalitarian parliaments. In:Proc. of the 24th ACM Symp. on Operating Systems Principles. 358-372.
    [33] Wang Z, Zhao C, Mu S, Chen H, Li J. On the parallels between Paxos and Raft, and how to port optimizations. In:Proc. of the 2019 ACM Symp. on Principles of Distributed Computing. 2019. 445-454.
    [34] Guo Z, Hong C, Yang M, Zhou D, Zhou L, Zhuang L. Paxos made parallel. Technical Report, Microsoft Research Asia, 2012. 118.
    [35] Lamport L, Merz S, Doligez D. A TLA + specification of Paxos and its refinement. 2019. https://github.com/tlaplus/Examples/tree/master/specifications/Paxos
    [36] Chaudhuri K, Doligez D, Lamport L, Merz SA. TLA+ proof system. In:Proc. of the LPAR Workshops, CEUR Workshop. 2008. 17-37.
    附中文参考文献:
    [18] 易星辰,魏恒峰,黄宇,乔磊,吕建.PaxosStore中共识协议TPaxos的推导、规约与精化.软件学报,2020,31(8):2336-2361. http://www.jos.org.cn/1000-9825/5964.htm[doi:10.13328/j.cnki.jos.005964]
    [19] 纪业,魏恒峰,黄宇,吕建.CRDT协议的TLA+描述与验证.软件学报,2020,31(5):1332-1352. http://www.jos.org.cn/1000-9825/31/5956.htm[doi:10.13328/j.cnki.jos.005956]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

谷晓松,魏恒峰,乔磊,黄宇.支持乱序执行的Raft协议.软件学报,2021,32(6):1748-1778

复制
分享
文章指标
  • 点击次数:2382
  • 下载次数: 5681
  • HTML阅读次数: 3116
  • 引用次数: 0
历史
  • 收稿日期:2020-08-30
  • 最后修改日期:2021-01-18
  • 在线发布日期: 2021-02-07
  • 出版日期: 2021-06-06
文章二维码
您是第19791708位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号