Fault Tolerance Scheme Using Parallel Recomputing for OpenMP Programs
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [12]
  • |
  • Related [20]
  • |
  • Cited by [1]
  • | |
  • Comments
    Abstract:

    This paper proposes a fault tolerance approach for OpenMP programs, named PR-OMP, which makes use of a novel fault recovery scheme, parallel recomputing. By redistributing the workload of the failed thread to all the surviving threads, PR-OMP remarkably reduces the overhead for fault recovery. The paper discusses the key issues including program division, computational state saving, workload redistribution, and fault detection of PR-OMP and details concerning implementation. Furthermore, the paper also presents an extended data flow analysis for OpenMP, which is used to decrease the data amount of computational state saving. Through the experimental evaluation, it has been proven that this approach achieves a minor overhead in fault recovery.

    Reference
    [1] TOP500 supercomputing site. http://www.top500.org
    [2] Reed DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computer Systems, 2006,22(3):293-302.
    [doi: 10.1016/j.future.2004.11.015]
    [3] Sorin DJ, Martin MMK, Hill MD, Wood DA. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proc. of the Int’l Symp. on Computer Architecture (ISCA 2002). Anchorage, 2002. 123-134. [doi: 10.1109/ISCA.2002.1003568]
    [4] Prvulovic M, Zhang Z, Torrellas J. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proc. of the Int’l Symp. on Computer Architecture (ISCA 2002). Anchorage, 2002. 111-122. [doi: 10.1109/ISCA.2002.1003567]
    [5] Dieter WR, Lumpp JE. A user-level checkpointing library for POSIX threads programs. In: Proc. of the ’99 Symp. on Fault-Tolerant Computing Systems (FTCS’99). Madison, 1999. 224-227. [doi: 10.1109/FTCS.1999.781054]
    [6] Bronevetsky G, Marques D, Pingali K, Szwed P, Schulz M. Application-Level checkpointing for shared memory programs. In: Proc. of the 11th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2004). New York, 2004. 235-247. [doi: 10.1145/1024393.1024421]
    [7] Bronevetsky G, Pingali K, Stodghill P. Experimental evaluation of applicationlevel checkpointing for OpenMP programs. In: Proc. of the 20th Annual Int’l Conf. on Supercomputing (SC 2006). Cairns, 2006. 2-13. [doi: 10.1145/1183401.1183405]
    [8] Bronevetsky G, Marques D, Pingali K, Stodghill P. C3: A system for automating application-level checkpointing of MPI programs. In: Proc. of the 16th Int’l Workshop on Languages and Compilers for Parallel Computing (LCPC 2003). 2003.
    [9] Yang XJ, Du YF, Wang PF, Fu HY, Jia J, Wang ZY, Suo G. The fault tolerant parallel algorithm: The parallel recomputing based failure recovery. In: Proc. of the 16th Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT 2007). Brasov, 2007. 199-212. [doi: 10.1109/PACT.2007.4336212]
    [10] Bailey DH, Harris T, Saphir W, Wijngaart RVD, Woo A, Yarrow M. The NAS parallel benchmarks 2.0. Technical Report, NAS- 95-020, NASA Ames Research Center, 1995.
    [11] Bronevetsky G. Portable checkpointing for parallel applications [Ph.D. Thesis]. Ithaca: Cornell University, 2007.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP 容错机制.软件学报,2012,23(2):411-427

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 05,2010
  • Revised:March 30,2010
  • Online: February 07,2012
You are the first2038183Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063