Common Data Guided Crash Recovery Bug Detection for Distributed Systems

doi:10.13328/j.cnki.jos.006755

微信服务号

微信订阅号

2025-4-16- 9

Home > Archive>Volume 34, Issue 12, 2023 >5578-5596. DOI:10.13328/j.cnki.jos.006755

PDF HTML XML Export Cite reminder

Common Data Guided Crash Recovery Bug Detection for Distributed Systems
DOI:
                        10.13328/j.cnki.jos.006755
                    
Author:
                        GAO YuGAO Yu
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WANG DongWANG Dong
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
DAI Qian-WangDAI Qian-Wang
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
DOU Wen-ShengDOU Wen-Sheng
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;Nanjing Institute of Software Technology, Nanjing 210000, China;University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WEI JunWEI Jun
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;Nanjing Institute of Software Technology, Nanjing 210000, China;University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP311
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The critical reliability and availability of distributed systems are threatened by crash recovery bugs caused by incorrect crash recovery mechanisms and their implementations. The detection of crash recovery bugs, however, can be extremely challenging since these bugs only manifest themselves when a node crashes under special timing conditions. This study presents a novel approach Deminer to automatically detect crash recovery bugs in distributed systems. Observations in the large-scale distributed systems show that node crashes that interrupt the execution of related I/O write operations, which store a piece of data (i.e., common data) in different places, e.g., different storage paths or nodes, are more likely to trigger crash recovery bugs. Therefore, Deminer detects crash recovery bugs by automatically identifying and injecting such error-prone node crashes under the usage guidance of common data. Deminer first tracks the usage of critical data in a correct run. Then, it identifies I/O write operation pairs that use the common data and predicts error-prone injection points of a node crash on the basis of the execution trace. Finally, Deminer tests the predicted injection points of the node crash and checks failure symptoms to expose and confirm crash recovery bugs. A prototype of Deminer is implemented and evaluated on the latest versions of four widely used distributed systems, i.e., ZooKeeper, HBase, YARN, and HDFS. The experimental results show that Deminer is effective in finding crash recovery bugs. Deminer has detected six crash recovery bugs.

Key words:crash recovery bug;bug detection;fault injection;crash recovery;distributed system

Get Citation

高钰,王栋,戴千旺,窦文生,魏峻.共用数据导向的分布式系统失效恢复缺陷检测.软件学报,2023,34(12):5578-5596

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:January 07,2022
Revised:April 21,2022
Adopted:
Online: October 26,2022
Published: December 06,2023

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History