Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing

doi:10.3724/SP.J.1001.2012.04011

微信服务号

微信订阅号

Home > Archive>Volume 23, Issue 4, 2012 >1022-1035. DOI:10.3724/SP.J.1001.2012.04011

PDF HTML XML Export Cite reminder

Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing
DOI:
                        10.3724/SP.J.1001.2012.04011
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The scale-up of system brings improvement in performance as well as reliability degradation, so there is a need to apply some fault tolerance mechanism to tolerate hardware failure or recover data. Currently, the popular fault tolerance mechanisms, such as Checkpoint/Restart and N-modular redundancy, all need additional overhead, which limits the scalability of parallel computing to some extent. Therefore, it is very important to develop scalable fault tolerance mechanisms for increasingly high performance supercomputing. This paper takes triple modular redundancy (TMR) as an example, describes the implementation of TMR on large-scale MPI parallel computing, and argues that traditional TMR fault-tolerant mechanism limits the scalability of parallel computing. To solve these practical problems, the paper proposes the scalable triple modular redundancy (STMR), and verifies the validity and scalability of it. STMR can not only handle the fail-stop failures that are traditionally handled by Checkpoint/Restart, but can also deal with most of data errors not perceived directly by the hardware. Finally, the study conducts the simulation using the system parameters of BlueGene/L, which shows the scalability change of parallel computing with the TMR and the STMR respectively when the system size increases. The results further validate STMR position as scalable fault-tolerant mechanism.

Reference

Cited by

Get Citation

王之元,杨学军,周云.大规模MPI 并行计算的可扩展三模冗余容错机制.软件学报,2012,23(4):1022-1035

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:October 08,2010
Revised:January 20,2011
Adopted:
Online: March 28,2012
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History