大规模MPI 并行计算的可扩展三模冗余容错机制

doi:10.3724/SP.J.1001.2012.04011

微信服务号

微信订阅号

2025年5月10日 21:05 星期六

首页 > 过刊浏览>2012年第23卷第4期 >1022-1035. DOI:10.3724/SP.J.1001.2012.04011

PDF HTML阅读 XML下载导出引用引用提醒

大规模MPI 并行计算的可扩展三模冗余容错机制
DOI:
                        10.3724/SP.J.1001.2012.04011
                    
CSTR:
                        
                    
作者:
                        王之元王之元
国防科学技术大学 计算机学院, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找
杨学军杨学军
国防科学技术大学 计算机学院, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找
周云周云
国防科学技术大学 计算机学院, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(61003081, 61003087, 60921062)

Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing

Author:

WANG Zhi-Yuan
WANG Zhi-Yuan
College of Computer, National University of Defense Technology, Changsha 410073, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Xue-Jun
YANG Xue-Jun
College of Computer, National University of Defense Technology, Changsha 410073, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Yun
ZHOU Yun
College of Computer, National University of Defense Technology, Changsha 410073, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [13]

相似文献

引证文献

资源附件

文章评论

摘要:

随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart 和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR 在大规模MPI并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR 制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart 针对的fail-stop 故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L 的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR 是可扩展的容错机制.

关键词:容错机制;可扩展性;三模冗余;大规模并行计算;MPI

Abstract:

The scale-up of system brings improvement in performance as well as reliability degradation, so there is a need to apply some fault tolerance mechanism to tolerate hardware failure or recover data. Currently, the popular fault tolerance mechanisms, such as Checkpoint/Restart and N-modular redundancy, all need additional overhead, which limits the scalability of parallel computing to some extent. Therefore, it is very important to develop scalable fault tolerance mechanisms for increasingly high performance supercomputing. This paper takes triple modular redundancy (TMR) as an example, describes the implementation of TMR on large-scale MPI parallel computing, and argues that traditional TMR fault-tolerant mechanism limits the scalability of parallel computing. To solve these practical problems, the paper proposes the scalable triple modular redundancy (STMR), and verifies the validity and scalability of it. STMR can not only handle the fail-stop failures that are traditionally handled by Checkpoint/Restart, but can also deal with most of data errors not perceived directly by the hardware. Finally, the study conducts the simulation using the system parameters of BlueGene/L, which shows the scalability change of parallel computing with the TMR and the STMR respectively when the system size increases. The results further validate STMR position as scalable fault-tolerant mechanism.

Key words:fault tolerance mechanism;scalability;triple modular redundancy;large scale parallel computing;MPI

参考文献

[1] Los Alamos National Laboratory. Operational data to support and enable computer science research. http://institute.lanl.gov/data/ lanldata.shtml

[2] Lu CD. Scalable diskless checkpointing for large parallel systems [Ph.D. Thesis]. Urbana-Champaign: University of Illinois, 2005.

[3] Chakravorty S, Kale LV. A fault tolerance protocol with fast fault recovery. In: Proc. of the 21st IEEE Int’l Parallel and Distributed Processing Symp. Long Beach, 2007. 120?128. http://charm.cs.illinois.edu/newPapers/06-12/paper.pdf [doi: 10.1109/IPDPS.2007. 370310]

[4] Elnozahy EN, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002,34(3):375?408. [doi: 10.1145/568522.568525]

[5] Mancini L, Koutny M. Formal specification of N-modular redundancy. In: Proc. of the ’86 ACM 14th Annual Conf. on Computer Science. New York, 1986. 199?204. http://www.cs.ncl.ac.uk/publications/trs/papers/213.pdf [doi: 10.1145/324634.325389]

[6] Neumann JV. Probabilistic logics and the synthesis of reliable organisms from unreliable components. In: Shannon CE, McCarthy J, eds. Proc. of the Automata Studies. Princeton: Princeton University Press, 1956. 43?98.

[7] Guo L, Tang ZS. Specification and verification of the triple-modular redundancy fault-tolerant system. Journal of Software, 2003, 14(1):54?61 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/14/54.htm

[8] Hwang K. Advanced Computer Architecture: Parallelism, Scalability, Programmability. Beijing: Tsinghua University Press, 2001. 301?303 (in Chinese).

[9] IBM Corp. Unfolding the IBM eserver blue gene solution. 2005. http://www.redbooks.ibm.com/redbooks/pdfs/sg246686.pdf

[10] Heimerdinger WL, Weinstock CB. A conceptual framework for system fault tolerance. Technical Report, CMU/SEI-92-TR-33, Pittsburgh: Carnegie Mellon University, 1992.

[11] Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F. MPICH-V Project: A multiprotocol automatic fault-tolerant MPI. Int’l Journal of High Performance Computing and Applications, 2006,20(3):319?333.

[12] The message passing interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/

[13] Open MPI: Open source high performance computing. http://www.open-mpi.org/

引用本文

王之元,杨学军,周云.大规模MPI 并行计算的可扩展三模冗余容错机制.软件学报,2012,23(4):1022-1035

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2010-10-08
最后修改日期:2011-01-20
录用日期:
在线发布日期: 2012-03-28
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码