基于TLA+形式化规约的Raft系统测试
作者:
作者单位:

中国科学院软件研究所

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目)


Testing Raft Systems Based on TLA+ Formal Specification
Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [57]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    Raft是最为流行的分布式共识协议之一.自2014年被提出以来,Raft协议及其变体在各种分布式系统中被广泛应用.为了证明Raft协议的正确性,开发者使用TLA+形式化规约对协议设计进行了建模和验证.但由于抽象的形式化规约与实际的系统实现源码间存在鸿沟,实现Raft协议的分布式系统中仍然会违背协议设计并引入复杂的缺陷.本文设计了基于TLA+形式化规约的测试方法来检测Raft系统中的缺陷.具体而言,我们将形式化规约匹配到相应的Raft系统实现,并用形式化规约所定义的状态空间来指导Raft系统的测试过程.为了评估该方法的可行性和有效性,我们针对两个不同的Raft系统进行了系统化测试,并发现了3个未知缺陷.

    Abstract:

    Raft is one of the most popular distributed consensus protocols. Since it was proposed in 2014, Raft and its variants have been widely used in different kinds of distributed systems. To prove the correctness of the Raft protocol, developers use the TLA+ formal specification to model and verify its design. However, due to the gap between the abstract formal specification and its practical implementation, distributed systems that implement the Raft protocol (i.e., Raft systems) can still violate the protocol design and introduce intricate bugs. In this paper, we propose a novel testing technique based on TLA+ formal specification to unearth bugs in Raft systems. To be specific, we map Raft’s formal specification to the corresponding Raft system, and then use the specification-defined state space to guide the testing in the Raft system. To evaluate the feasibility and effectiveness of our approach, we apply it on two different Raft systems, and find 3 previously-unknown bugs.

    参考文献
    [1] TiDB. (2017). https://github.om/pingcap/tidb.
    [2] CockroachDB. (2020). https://github.com/cockroachdb/cockroach.
    [3] Hunt P, Konar M, Junqueira FP, Reed B. ZooKeeper: wait-free coordination for internet-scale systems. Proceedings of USENIX Annual Technical Conference (USENIX ATC). 2010: 145–158.
    [4] Ongaro D. Consensus: bridging theory and practice. Stanford University, 2014.
    [5] Ethereum. (2022). https://ethereum.org/en/.
    [6] The home of the fabric mod development toolchain. (2020). https://fabricmc.net/.
    [7] Lamport L. The part-time parliament. ACM Transactions on Computer Systems, 1998, 16(2): 133–169.
    [8] Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. Proceedings of the USENIX Annual Technical Conference (USENIX ATC). 2014: 305–319.
    [9] Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11(12): 1849–1862.
    [10] Raft-java.(2017). https://github.com/wenweihu86/raft-java.
    [11] Xraft.(2018). https://github.com/xnnyygn/xraft.
    [12] Lamport L. The tla+ home page. (1999). https://lamport.azurewebsites.net/tla/tla.html.
    [13] TLA+ specification for the raft consensus algorithm. (2014). https://github.com/ongardie/raft.tla.
    [14] Newcombe C, Rath T, Zhang F, Munteanu B, Brooker M, Deardeuff M. How amazon web services uses formal methods. 2015, 58(4).
    [15] Foundations of azure cosmosdb (multi-master) with dr. leslie lamport. (2018). https://www.youtube.com/watch?v=kYX6UrY_ooA.
    [16] 谷晓松,魏恒峰,乔磊,黄宇.支持乱序执行的Raft协议.软件学报,2021,32(6):1748-1778.http://www.jos.org.cn/1000-9825/6248.htm
    [17] Fonseca P, Zhang K, Wang X, Krishnamurthy A. An empirical study on the correctness of formally verified distributed systems. Proceedings of the 12th European Conference on Computer Systems (EuroSys). 2017: 328–343.
    [18] Hawblitzel C, Howell J, Kapritsos M, Lorch JR, Parno B, Roberts ML, Setty S, Zill B. IronFleet: proving practical distributed systems correct. Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP). 2015: 1–17.
    [19] Wilcox JR, Woos D, Panchekha P, Tatlock Z, Wang X, Ernst MD, Anderson T. Verdi: a framework for implementing and formally verifying distributed systems. Proceedings of the 36th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2015, 50(6): 357–368.
    [20] Michael E, Woos D, Anderson T, Ernst MD, Tatlock Z. Teaching rigorous distributed systems with efficient model checking. Proceedings of European Conference on Computer Systems (EuroSys), 2019.
    [21] Yang J, Chen T, Wu M, Xu Z, Liu X, Lin H, Yang M, Long F, Zhang L, Zhou L. MODIST: transparent model checking of unmodified distributed systems. Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 2009: 213–228.
    [22] Simsa J, Bryant R, Gibson G. DBug?: systematic evaluation of distributed systems. Proceedings of the 5th international conference on Systems software verification (SSV). 2010: 3–3.
    [23] Yabandeh M, Kne?evi? N, Kosti? D, Kuncak V. Crystalball: predicting and preventing inconsistencies in deployed distributed systems. Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2009: 229–244.
    [24] Leesatapornwongsa T, Hao M, Joshi P, Lukman JF, Gunawi HS. SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2014: 399–414.
    [25] Lukman JF, Ke H, Stuardo CA, Suminto RO, Kurniawan DH, Simon D, Priambada S, Tian C, Ye F, Leesatapornwongsa T, Gupta A, Lu S, Gunawi HS. FlyMC: highly scalable testing of complex interleavings in distributed systems. Proceedings of the 14th European Conference on Computer Systems (EuroSys). 2019.
    [26] The raft consensus algorithm. (2014). https://raft.github.io.
    [27] Hazelcast.(2008). https://github.com/hazelcast/hazelcast.
    [28] SOFAJRaft.(2019). https://github.com/sofastack/sofa-jraft.
    [29] Zhao C. The practice on developing the distributed consensus algorithm. Peking University Press, 2020.
    [30] The tla+ toolbox. (2010). http://lamport.azurewebsites.net/tla/toolbox.html.
    [31] ASM. . https://asm.ow2.io.
    [32] Python networkx. . https://networkx.org.
    [33] Xraft issue: duplicate vote response can make illegal leader without a quorum. (2022). https://github.com/xnnyygn/xraft/issues/27.
    [34] Xraft commit: handle with canceled votes. (2022). https://github.com/xnnyygn/xraft/pull/28/commits/a48000080b6590402fbf45dd1a06af001d558830.
    [35] Xraft issue: votedfor is not stored when a node is candidate and receives an appendentriesrpc. (2022). https://github.com/xnnyygn/xraft/issues/29.
    [36] Raft-java issue#3.(2017). https://github.com/wenweihu86/raft-java/issues/3.
    [37] Raft-java issue#19.(2019). https://github.com/wenweihu86/raft-java/issues/19.
    [38] The coq proof assistant. (1984). https://coq.inria.fr/.
    [39] Gomes VBF, Kleppmann M, Mulligan DP, Beresford AR. Verifying strong eventual consistency in distributed systems. Proceedings of ACM International Conference on Object Oriented Programming Systems Languages Applications (OOPSLA). 2017(OOPSLA): 1–28.
    [40] Lesani M, Bell CJ, Chlipala A. Chapar: certified causally consistent distributed key-value stores. Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 2016, 51(1): 357–370.
    [41] Sergey I, Wilcox JR, Tatlock Z. Programming and proving with distributed protocols. Proceedings of the ACM on Programming Languages (POPL), 2018(POPL): 1–30.
    [42] Deligiannis P, Donaldson AF, Ketema J, Lal A, Thomson P. Asynchronous programming, analysis and testing with state machines. Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2015, 2015-June: 154–164.
    [43] 纪业, 魏恒峰, 黄宇, 吕建. CRDT协议的TLA+描述与验证.软件学报, 2020, 31(5): 1332–1352.http://www.jos.org.cn/1000-9825/5956.htm
    [44] Anand S, Burke EK, Chen TY, Clark J, Cohen MB, Grieskamp W, Harman M, Harrold MJ, McMinn P. An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software (JSS)), 2013, 86(8): 1978–2001.
    [45] Bishop S, Fairbairn M, Mehnert H, Norrish M, Ridge T, Sewell P, Smith M, Wansbrough K. Engineering with logic: rigorous test-oracle specification and validation for tcp/ip and the sockets api. Journal of the ACM (JACM), 2018, 66(1).
    [46] Li Y, Pierce BC, Zdancewic S. Model-based testing of networked applications. Proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 2021: 529–539.
    [47] Kim BH, Kim T, Lie D. Modulo?: finding convergence failure bugs in distributed systems with divergence resync models. Proceedings of USENIX Annual Technical Conference (USENIX ATC). 2022: 383–398.
    [48] Liu H, Li G, Lukman JF, Li J, Lu S, Gunawi HS, Tian C. DCatch?: automatically detecting distributed concurrency bugs in cloud systems cloud systems. Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2017: 677–691.
    [49] Liu H, Wang X, Li G, Lu S, Ye F, Tian C. FCatch?: automatically detecting time-of-fault bugs in cloud systems. Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2018: 419–431.
    [50] Lu J, Liu C, Li L, Feng X, Tan F, Yang J, You L. Crashtuner: detecting crash-recovery bugs in cloud systems via meta-info analysis. Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP). 2019: 114–130.
    [51] Chen H, Dou W, Wang D, Qin F. CoFI: consistency-guided fault injection for cloud systems. Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2020: 536–547.
    [52] ChaosMonkey.. https://github.com/Netflix/chaosmonkey.
    [53] Jepsen. . https://jepsen.io/.
    [54] Gunawi HS, Do T, Joshi P, Alvaro P, Hellerstein JM. Fate and destini: a framework for cloud recovery testing. Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2011: 249–262.
    [55] Joshi P, Gunawi HS, Sen K. PreFail: a programmable tool for multiple-failure injection. Proceedings of the ACM international conference on Object oriented progamming systems languages and applications (OOPLSA). 2011: 171–188.
    [56] Alquraan A, Takruri H, Alfatafta M, Al-Kiswany S. An analysis of network-partitioning failures in cloud systems. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2018: 51–68.
    [57] Alagappan R, Ganesan A, Patel Y, Pillai TS, Arpaci-Dusseau AC, Arpaci-Dusseau RH. Correlated crash vulnerabilities. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 2016: 151–167.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文
分享
文章指标
  • 点击次数:103
  • 下载次数: 0
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2022-12-12
  • 最后修改日期:2023-06-28
  • 录用日期:2023-10-19
文章二维码
您是第20036746位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号