可扩展的多周期检查点设置
基金项目:

Supported by the National High-Tech Research and Development Plan of China under Grant Nos.2008AA01A204, 2009AA01A404 (国家高技术研究发展计划(863)); the State Key Laboratory of High-End Server & Storage Technology of China under Grant No.2009HSSA07 (高效能服务器和存储技术国家重点实验室开放基金项目)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [28]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    提出了一种多周期检查点设置方法.它允许各个进程采用不同周期进行检查点设置.为了保证一致全局检查点的向前推进,检查点周期可以根据一个P模式进行调整.在所提出的方法中,进程可以进行组划分处理,从而用于检查点周期调整的依赖跟踪可被限定在组内,同时也将使基于时间的多周期检查点设置具有较好的可扩展性.

    Abstract:

    In this paper, a time-based multi-cycle checkpointing approach, allowing each process to take checkpoints with its own checkpoint cycle, is proposed. To ensure the advancement of consistent global checkpoint, checkpoint cycles can be adjusted according to a “P-pattern”. In the proposed approach, processes will be divided into zones, so that dependency tracking required for checkpoint cycle adjustment can be restricted in the zone scope. It makes the time-based multi-cycle checkpointing more scalable.

    参考文献
    [1] Wang YM, Chung PY, Lin IJ, Fuchs WK. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. on Parallel and Distributed Systems, 1995,6(5):546-554.
    [2] Wang YM, Fuchs WK. Optimal message log reclamation for uncoordinated checkpointing. In: Proc. of the Conf. on Fault-Tolerant Parallel and Distributed Systems. Piscataway: IEEE Computer Society Press, 1995. 24-29.
    [3] Gupta B, Rahimi S, Yang Y. A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery. Informatica, 2007,31(1):1-13.
    [4] Elnozahy EN, Johnson DB, Zwaenepoel W. The performance of consistent checkpointing. In: Proc. of the 11th Symp. on Reliable Distributed Systems. 1992. 39-47.
    [5] Koo R, Toueg S. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. on Software Engineering, 1987, SE-13(1):23-31.
    [6] Cao G, Singhal M. Low-Cost checkpointing with mutable checkpoints in mobile computing systems. In: Proc. of the 18th Int’l Conf. on Distributed Computing Systems. Piscataway: IEEE Computer Society Press, 1998. 464-471.
    [7] Sakata TC, Garcia IC. Non-Blocking synchronous checkpointing based on rollback-dependency trackability. In: Proc. of the 25th IEEE Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 2006. 411-420.
    [8] Tong Z, Kain RY, Tsai WT. A low overhead checkpointing and rollback recovery scheme for distributed systems. In: Proc. of the 8th Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 1989. 12-20.
    [9] Cristian F, Jahanian F. A timestamp-based checkpointing protocol for long-lived distributed computations. In: Proc. of the 10th Symp. on Reliable Distributed Systems. Piscataway: IEEE Computer Society Press, 1991. 12-20.
    [10] Kavanaugh GP, Sanders WH. Performance analysis of two time-based coordinated checkpointing protocols. In: Proc. of the Pacific Rim Int’l Symp. on Fault-Tolerant Systems. Los Alamitos: IEEE Computer Society Press, 1997. 194-201.
    [11] Neves N, Fuchs WK. Using time to improve the performance of coordinated checkpointing. In: Proc. of the Int’l Computer Performance and Dependability Symp. Los Alamitos: IEEE Computer Society Press, 1996. 282-291.
    [12] Neves N, Fuchs WK. Coordinated checkpointing without direct coordination. In: Proc. of the IEEE Int’l Computer Performance and Dependability Symp. 1998. 23-31.
    [13] Baldoni R, Quaglia F, Fornara P. An index-based checkpointing algorithm for autonomous distributed systems. IEEE Trans. on Parallel and Distributed Systems, 1999,10(2):181-192.
    [14] Manivannan D, Singhal M. Quasi-Synchronous checkpointing: Models, characterization, and classification. IEEE Trans. on Parallel and Distributed Systems, 1999,10(7):703-713.
    [15] Manivannan D, Singhal M. A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proc. of the 16th Int’l Conf. on Distributed Computing Systems. Piscataway: IEEE Computer Society Press, 1996. 100-107.
    [16] Baldoni R, Helary JM, Mostefaoui A, Raynal M. A communication-induced checkpointing protocol that ensures rollback- dependency trackability. In: Proc. of the 27th Annual Int’l Symp. on Fault-Tolerant Computing. Washington: IEEE Computer Society Press, 1997. 68-77.
    [17] Baldoni R, Quaglia F, Ciciani B. A VP-accordant checkpointing protocol preventing useless checkpoints. In: Proc. of the 17th IEEE Symp. on Reliable Distributed Systems. Los Alamitos: IEEE Computer Society Press, 1998. 61-67.
    [18] Alvisi L, Elnozahy E, Rao S, Husain SA, Mel AD. An analysis of communication-induced checkpointing. In: Proc. of the 29th Annual Int’l Symp. on Fault-Tolerant Computing. Washington: IEEE Computer Society Press, 1999. 242-249.
    [19] Helary JM, Mostefaoui A, Netzer RHB, Raynal M. Communication-Based prevention of useless checkpoints in distributed computations. Distributed Computing, 2000,13(1):29-43.
    [20] Tsai J. On properties of RDT communication-induced checkpointing protocols. IEEE Trans. on Parallel and Distributed Systems, 2003,14(8):755-764.
    [21] Randell B. System structure for software fault tolerance. IEEE Trans. on Software Engineering, 1975,SE-1(2):220-232.
    [22] Ci YW, Zhang Z, Zuo DC, Wu ZB, Yang XZ. Communication-Based prevention of non-P-pattern. In: Proc. of the 28th IEEE Symp. on Reliable Distributed Systems. Washington: IEEE Computer Society Press, 2009. 129-134.
    [23] Netzer RHB, Xu J. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. on Parallel and Distributed Systems, 1995,6(2):165-169.
    [24] Alvisi L, Bhatia K, Marzullo K. Causality tracking in causal message-logging protocols. Distributed Computing, 2002,15(1):1-15.
    [25] Kim J, Lilja DJ. Exploiting multiple heterogeneous networks to reduce communication costs in parallel programs. In: Proc. of the 6th Heterogeneous Computing Workshop. Los Alamitos: IEEE Computer Society Press, 1997. 83-95.
    [26] Kim J, Lilja DJ. Characterization of communication patterns in message-passing parallel scientific application programs. In: Proc. of the 2nd Int’l Workshop on Network-Based Parallel Computing. London: Springer-Verlag, 1998. 202-216.
    [27] Chodnekar S, Srinivasan V, Vaidya AS, Sivasubramaniam A, Das CR. Towards a communication characterization methodology for parallel applications. In: Proc. of the 3rd Int’l Symp. on High-Performance Computer Architecture. Los Alamitos: IEEE Computer Society Press, 1997. 310-319.
    [28] Vetter JS, Mueller F. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: Proc. of the Int’l Symp. on Parallel and Distributed Processing. Washington: IEEE Computer Society Press, 2002. 27-36.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

慈轶为,张展,左德承,吴智博,杨孝宗.可扩展的多周期检查点设置.软件学报,2010,21(2):218-230

复制
分享
文章指标
  • 点击次数:9806
  • 下载次数: 8896
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2009-04-16
  • 最后修改日期:2009-12-07
文章二维码
您是第19830950位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号