基于TLA+形式化规约的Raft系统测试

微信服务号

微信订阅号

2025年6月12日 21:39 星期四

基于TLA+形式化规约的Raft系统测试
DOI:
                        
                    
CSTR:
                        
                    
作者:
                        王栋,窦文生,高钰,吴陈傲,魏峻,黄涛王栋,窦文生,高钰,吴陈傲,魏峻,黄涛
中国科学院软件研究所
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:中国科学院软件研究所
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Testing Raft Systems Based on TLA+ Formal Specification

Author:

WANG DONG,DOU WENSHENG,GAO YU,WU CHENAO,WEI JUN,HUANG TAO
WANG DONG,DOU WENSHENG,GAO YU,WU CHENAO,WEI JUN,HUANG TAO

在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

摘要

图/表

访问统计

参考文献 [57]

相似文献

引证文献

资源附件

文章评论

摘要:

Raft是最为流行的分布式共识协议之一.自2014年被提出以来,Raft协议及其变体在各种分布式系统中被广泛应用.为了证明Raft协议的正确性,开发者使用TLA+形式化规约对协议设计进行了建模和验证.但由于抽象的形式化规约与实际的系统实现源码间存在鸿沟,实现Raft协议的分布式系统中仍然会违背协议设计并引入复杂的缺陷.本文设计了基于TLA+形式化规约的测试方法来检测Raft系统中的缺陷.具体而言,我们将形式化规约匹配到相应的Raft系统实现,并用形式化规约所定义的状态空间来指导Raft系统的测试过程.为了评估该方法的可行性和有效性,我们针对两个不同的Raft系统进行了系统化测试,并发现了3个未知缺陷.

关键词:Raft;分布式系统;软件测试;模型检查;TLA+

Abstract:

Raft is one of the most popular distributed consensus protocols. Since it was proposed in 2014, Raft and its variants have been widely used in different kinds of distributed systems. To prove the correctness of the Raft protocol, developers use the TLA+ formal specification to model and verify its design. However, due to the gap between the abstract formal specification and its practical implementation, distributed systems that implement the Raft protocol (i.e., Raft systems) can still violate the protocol design and introduce intricate bugs. In this paper, we propose a novel testing technique based on TLA+ formal specification to unearth bugs in Raft systems. To be specific, we map Raft’s formal specification to the corresponding Raft system, and then use the specification-defined state space to guide the testing in the Raft system. To evaluate the feasibility and effectiveness of our approach, we apply it on two different Raft systems, and find 3 previously-unknown bugs.

Key words:Raft; distributed system; software testing; model checking; TLA+

参考文献

[1] TiDB. (2017). https://github.om/pingcap/tidb.

[2] CockroachDB. (2020). https://github.com/cockroachdb/cockroach.

[3] Hunt P, Konar M, Junqueira FP, Reed B. ZooKeeper: wait-free coordination for internet-scale systems. Proceedings of USENIX Annual Technical Conference (USENIX ATC). 2010: 145–158.

[4] Ongaro D. Consensus: bridging theory and practice. Stanford University, 2014.

[5] Ethereum. (2022). https://ethereum.org/en/.

[6] The home of the fabric mod development toolchain. (2020). https://fabricmc.net/.

[7] Lamport L. The part-time parliament. ACM Transactions on Computer Systems, 1998, 16(2): 133–169.

[8] Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. Proceedings of the USENIX Annual Technical Conference (USENIX ATC). 2014: 305–319.

[9] Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11(12): 1849–1862.

[10] Raft-java.(2017). https://github.com/wenweihu86/raft-java.

[11] Xraft.(2018). https://github.com/xnnyygn/xraft.

[12] Lamport L. The tla+ home page. (1999). https://lamport.azurewebsites.net/tla/tla.html.

[13] TLA+ specification for the raft consensus algorithm. (2014). https://github.com/ongardie/raft.tla.

[14] Newcombe C, Rath T, Zhang F, Munteanu B, Brooker M, Deardeuff M. How amazon web services uses formal methods. 2015, 58(4).

[15] Foundations of azure cosmosdb (multi-master) with dr. leslie lamport. (2018). https://www.youtube.com/watch?v=kYX6UrY_ooA.

[16] 谷晓松,魏恒峰,乔磊,黄宇.支持乱序执行的Raft协议.软件学报,2021,32(6):1748-1778.http://www.jos.org.cn/1000-9825/6248.htm

[17] Fonseca P, Zhang K, Wang X, Krishnamurthy A. An empirical study on the correctness of formally verified distributed systems. Proceedings of the 12th European Conference on Computer Systems (EuroSys). 2017: 328–343.

[18] Hawblitzel C, Howell J, Kapritsos M, Lorch JR, Parno B, Roberts ML, Setty S, Zill B. IronFleet: proving practical distributed systems correct. Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP). 2015: 1–17.

[19] Wilcox JR, Woos D, Panchekha P, Tatlock Z, Wang X, Ernst MD, Anderson T. Verdi: a framework for implementing and formally verifying distributed systems. Proceedings of the 36th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2015, 50(6): 357–368.

[20] Michael E, Woos D, Anderson T, Ernst MD, Tatlock Z. Teaching rigorous distributed systems with efficient model checking. Proceedings of European Conference on Computer Systems (EuroSys), 2019.

[21] Yang J, Chen T, Wu M, Xu Z, Liu X, Lin H, Yang M, Long F, Zhang L, Zhou L. MODIST: transparent model checking of unmodified distributed systems. Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 2009: 213–228.

[22] Simsa J, Bryant R, Gibson G. DBug?: systematic evaluation of distributed systems. Proceedings of the 5th international conference on Systems software verification (SSV). 2010: 3–3.

[23] Yabandeh M, Kne?evi? N, Kosti? D, Kuncak V. Crystalball: predicting and preventing inconsistencies in deployed distributed systems. Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2009: 229–244.

[24] Leesatapornwongsa T, Hao M, Joshi P, Lukman JF, Gunawi HS. SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2014: 399–414.

[25] Lukman JF, Ke H, Stuardo CA, Suminto RO, Kurniawan DH, Simon D, Priambada S, Tian C, Ye F, Leesatapornwongsa T, Gupta A, Lu S, Gunawi HS. FlyMC: highly scalable testing of complex interleavings in distributed systems. Proceedings of the 14th European Conference on Computer Systems (EuroSys). 2019.

[26] The raft consensus algorithm. (2014). https://raft.github.io.

[27] Hazelcast.(2008). https://github.com/hazelcast/hazelcast.

[28] SOFAJRaft.(2019). https://github.com/sofastack/sofa-jraft.

[29] Zhao C. The practice on developing the distributed consensus algorithm. Peking University Press, 2020.

[30] The tla+ toolbox. (2010). http://lamport.azurewebsites.net/tla/toolbox.html.

[31] ASM. . https://asm.ow2.io.

[32] Python networkx. . https://networkx.org.

[33] Xraft issue: duplicate vote response can make illegal leader without a quorum. (2022). https://github.com/xnnyygn/xraft/issues/27.

[34] Xraft commit: handle with canceled votes. (2022). https://github.com/xnnyygn/xraft/pull/28/commits/a48000080b6590402fbf45dd1a06af001d558830.

[35] Xraft issue: votedfor is not stored when a node is candidate and receives an appendentriesrpc. (2022). https://github.com/xnnyygn/xraft/issues/29.

[36] Raft-java issue#3.(2017). https://github.com/wenweihu86/raft-java/issues/3.

[37] Raft-java issue#19.(2019). https://github.com/wenweihu86/raft-java/issues/19.

[38] The coq proof assistant. (1984). https://coq.inria.fr/.

[39] Gomes VBF, Kleppmann M, Mulligan DP, Beresford AR. Verifying strong eventual consistency in distributed systems. Proceedings of ACM International Conference on Object Oriented Programming Systems Languages Applications (OOPSLA). 2017(OOPSLA): 1–28.

[40] Lesani M, Bell CJ, Chlipala A. Chapar: certified causally consistent distributed key-value stores. Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 2016, 51(1): 357–370.

[41] Sergey I, Wilcox JR, Tatlock Z. Programming and proving with distributed protocols. Proceedings of the ACM on Programming Languages (POPL), 2018(POPL): 1–30.

[42] Deligiannis P, Donaldson AF, Ketema J, Lal A, Thomson P. Asynchronous programming, analysis and testing with state machines. Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2015, 2015-June: 154–164.

[43] 纪业, 魏恒峰, 黄宇, 吕建. CRDT协议的TLA+描述与验证.软件学报, 2020, 31(5): 1332–1352.http://www.jos.org.cn/1000-9825/5956.htm

[44] Anand S, Burke EK, Chen TY, Clark J, Cohen MB, Grieskamp W, Harman M, Harrold MJ, McMinn P. An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software (JSS)), 2013, 86(8): 1978–2001.

[45] Bishop S, Fairbairn M, Mehnert H, Norrish M, Ridge T, Sewell P, Smith M, Wansbrough K. Engineering with logic: rigorous test-oracle specification and validation for tcp/ip and the sockets api. Journal of the ACM (JACM), 2018, 66(1).

[46] Li Y, Pierce BC, Zdancewic S. Model-based testing of networked applications. Proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 2021: 529–539.

[47] Kim BH, Kim T, Lie D. Modulo?: finding convergence failure bugs in distributed systems with divergence resync models. Proceedings of USENIX Annual Technical Conference (USENIX ATC). 2022: 383–398.

[48] Liu H, Li G, Lukman JF, Li J, Lu S, Gunawi HS, Tian C. DCatch?: automatically detecting distributed concurrency bugs in cloud systems cloud systems. Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2017: 677–691.

[49] Liu H, Wang X, Li G, Lu S, Ye F, Tian C. FCatch?: automatically detecting time-of-fault bugs in cloud systems. Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2018: 419–431.

[50] Lu J, Liu C, Li L, Feng X, Tan F, Yang J, You L. Crashtuner: detecting crash-recovery bugs in cloud systems via meta-info analysis. Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP). 2019: 114–130.

[51] Chen H, Dou W, Wang D, Qin F. CoFI: consistency-guided fault injection for cloud systems. Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2020: 536–547.

[52] ChaosMonkey.. https://github.com/Netflix/chaosmonkey.

[53] Jepsen. . https://jepsen.io/.

[54] Gunawi HS, Do T, Joshi P, Alvaro P, Hellerstein JM. Fate and destini: a framework for cloud recovery testing. Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2011: 249–262.

[55] Joshi P, Gunawi HS, Sen K. PreFail: a programmable tool for multiple-failure injection. Proceedings of the ACM international conference on Object oriented progamming systems languages and applications (OOPLSA). 2011: 171–188.

[56] Alquraan A, Takruri H, Alfatafta M, Al-Kiswany S. An analysis of network-partitioning failures in cloud systems. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2018: 51–68.

[57] Alagappan R, Ganesan A, Patel Y, Pillai TS, Arpaci-Dusseau AC, Arpaci-Dusseau RH. Correlated crash vulnerabilities. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 2016: 151–167.

引用本文

复制

文章指标

点击次数:103
下载次数: 0
HTML阅读次数: 0
引用次数: 0

历史

收稿日期:2022-12-12
最后修改日期:2023-06-28
录用日期:2023-10-19
在线发布日期:
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码