可扩展的多周期检查点设置

微信服务号

微信订阅号

2025年4月17日 0:25 星期四

首页 > 过刊浏览>2010年第21卷第2期 >218-230

可扩展的多周期检查点设置
DOI:
                        
                    
CSTR:
                        
                    
作者:
                        慈轶为慈轶为
哈尔滨工业大学 计算机科学与技术学院,黑龙江 哈尔滨 150001
在期刊界中查找
在百度中查找
在本站中查找
张展张展

在期刊界中查找
在百度中查找
在本站中查找
左德承左德承

在期刊界中查找
在百度中查找
在本站中查找
吴智博吴智博

在期刊界中查找
在百度中查找
在本站中查找
杨孝宗杨孝宗

在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:Supported by the National High-Tech Research and Development Plan of China under Grant Nos.2008AA01A204, 2009AA01A404 (国家高技术研究发展计划(863)); the State Key Laboratory of High-End Server & Storage Technology of China under Grant No.2009HSSA07 (高效能服务器和存储技术国家重点实验室开放基金项目)

Scalable Time-Based Multi-Cycle Checkpointing

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [28]

相似文献

引证文献

资源附件

文章评论

摘要:

提出了一种多周期检查点设置方法.它允许各个进程采用不同周期进行检查点设置.为了保证一致全局检查点的向前推进,检查点周期可以根据一个P模式进行调整.在所提出的方法中,进程可以进行组划分处理,从而用于检查点周期调整的依赖跟踪可被限定在组内,同时也将使基于时间的多周期检查点设置具有较好的可扩展性.

关键词:容错;检查点;依赖跟踪

Abstract:

In this paper, a time-based multi-cycle checkpointing approach, allowing each process to take checkpoints with its own checkpoint cycle, is proposed. To ensure the advancement of consistent global checkpoint, checkpoint cycles can be adjusted according to a “P-pattern”. In the proposed approach, processes will be divided into zones, so that dependency tracking required for checkpoint cycle adjustment can be restricted in the zone scope. It makes the time-based multi-cycle checkpointing more scalable.

Key words:fault-tolerance; checkpoint; dependency tracking

参考文献

[1] Wang YM, Chung PY, Lin IJ, Fuchs WK. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. on Parallel and Distributed Systems, 1995,6(5):546-554.

[2] Wang YM, Fuchs WK. Optimal message log reclamation for uncoordinated checkpointing. In: Proc. of the Conf. on Fault-Tolerant Parallel and Distributed Systems. Piscataway: IEEE Computer Society Press, 1995. 24-29.

[3] Gupta B, Rahimi S, Yang Y. A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery. Informatica, 2007,31(1):1-13.

[4] Elnozahy EN, Johnson DB, Zwaenepoel W. The performance of consistent checkpointing. In: Proc. of the 11th Symp. on Reliable Distributed Systems. 1992. 39-47.

[5] Koo R, Toueg S. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. on Software Engineering, 1987, SE-13(1):23-31.

[6] Cao G, Singhal M. Low-Cost checkpointing with mutable checkpoints in mobile computing systems. In: Proc. of the 18th Int’l Conf. on Distributed Computing Systems. Piscataway: IEEE Computer Society Press, 1998. 464-471.

[7] Sakata TC, Garcia IC. Non-Blocking synchronous checkpointing based on rollback-dependency trackability. In: Proc. of the 25th IEEE Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 2006. 411-420.

[8] Tong Z, Kain RY, Tsai WT. A low overhead checkpointing and rollback recovery scheme for distributed systems. In: Proc. of the 8th Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 1989. 12-20.

[9] Cristian F, Jahanian F. A timestamp-based checkpointing protocol for long-lived distributed computations. In: Proc. of the 10th Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 1991. 12-20.

[10] Kavanaugh GP, Sanders WH. Performance analysis of two time-based coordinated checkpointing protocols. In: Proc. of the Pacific Rim Int’l Symp. on Fault-Tolerant Systems. Los Alamitos: IEEE Computer Society Press, 1997. 194-201.

[11] Neves N, Fuchs WK. Using time to improve the performance of coordinated checkpointing. In: Proc. of the Int’l Computer Performance and Dependability Symp. Los Alamitos: IEEE Computer Society Press, 1996. 282-291.

[12] Neves N, Fuchs WK. Coordinated checkpointing without direct coordination. In: Proc. of the IEEE Int’l Computer Performance and Dependability Symp. 1998. 23-31.

[13] Baldoni R, Quaglia F, Fornara P. An index-based checkpointing algorithm for autonomous distributed systems. IEEE Trans. on Parallel and Distributed Systems, 1999,10(2):181-192.

[14] Manivannan D, Singhal M. Quasi-Synchronous checkpointing: Models, characterization, and classification. IEEE Trans. on Parallel and Distributed Systems, 1999,10(7):703-713.

[15] Manivannan D, Singhal M. A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proc. of the 16th Int’l Conf. on Distributed Computing Systems. Piscataway: IEEE Computer Society Press, 1996. 100-107.

[16] Baldoni R, Helary JM, Mostefaoui A, Raynal M. A communication-induced checkpointing protocol that ensures rollback- dependency trackability. In: Proc. of the 27th Annual Int’l Symp. on Fault-Tolerant Computing. Washington: IEEE Computer Society Press, 1997. 68-77.

[17] Baldoni R, Quaglia F, Ciciani B. A VP-accordant checkpointing protocol preventing useless checkpoints. In: Proc. of the 17th IEEE Symp. on Reliable Distributed Systems. Los Alamitos: IEEE Computer Society Press, 1998. 61-67.

[18] Alvisi L, Elnozahy E, Rao S, Husain SA, Mel AD. An analysis of communication-induced checkpointing. In: Proc. of the 29th Annual Int’l Symp. on Fault-Tolerant Computing. Washington: IEEE Computer Society Press, 1999. 242-249.

[19] Helary JM, Mostefaoui A, Netzer RHB, Raynal M. Communication-Based prevention of useless checkpoints in distributed computations. Distributed Computing, 2000,13(1):29-43.

[20] Tsai J. On properties of RDT communication-induced checkpointing protocols. IEEE Trans. on Parallel and Distributed Systems, 2003,14(8):755-764.

[21] Randell B. System structure for software fault tolerance. IEEE Trans. on Software Engineering, 1975,SE-1(2):220-232.

[22] Ci YW, Zhang Z, Zuo DC, Wu ZB, Yang XZ. Communication-Based prevention of non-P-pattern. In: Proc. of the 28th IEEE Symp. on Reliable Distributed Systems. Washington: IEEE Computer Society Press, 2009. 129-134.

[23] Netzer RHB, Xu J. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. on Parallel and Distributed Systems, 1995,6(2):165-169.

[24] Alvisi L, Bhatia K, Marzullo K. Causality tracking in causal message-logging protocols. Distributed Computing, 2002,15(1):1-15.

[25] Kim J, Lilja DJ. Exploiting multiple heterogeneous networks to reduce communication costs in parallel programs. In: Proc. of the 6th Heterogeneous Computing Workshop. Los Alamitos: IEEE Computer Society Press, 1997. 83-95.

[26] Kim J, Lilja DJ. Characterization of communication patterns in message-passing parallel scientific application programs. In: Proc. of the 2nd Int’l Workshop on Network-Based Parallel Computing. London: Springer-Verlag, 1998. 202-216.

[27] Chodnekar S, Srinivasan V, Vaidya AS, Sivasubramaniam A, Das CR. Towards a communication characterization methodology for parallel applications. In: Proc. of the 3rd Int’l Symp. on High-Performance Computer Architecture. Los Alamitos: IEEE Computer Society Press, 1997. 310-319.

[28] Vetter JS, Mueller F. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: Proc. of the Int’l Symp. on Parallel and Distributed Processing. Washington: IEEE Computer Society Press, 2002. 27-36.

引用本文

慈轶为,张展,左德承,吴智博,杨孝宗.可扩展的多周期检查点设置.软件学报,2010,21(2):218-230

复制

文章指标

点击次数:9806
下载次数: 8896
HTML阅读次数: 0
引用次数: 0

历史

收稿日期:2009-04-16
最后修改日期:2009-12-07
录用日期:
在线发布日期:
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码