静态分析面向异构系统的应用级Checkpoint 设置问题

doi:10.3724/SP.J.1001.2013.04325

微信服务号

微信订阅号

2025年4月3日 15:27 星期四

首页 > 过刊浏览>2013年第24卷第6期 >1361-1375. DOI:10.3724/SP.J.1001.2013.04325

PDF HTML阅读 XML下载导出引用引用提醒

静态分析面向异构系统的应用级Checkpoint 设置问题
DOI:
                        10.3724/SP.J.1001.2013.04325
                    
CSTR:
                        
                    
作者:
                        贾佳贾佳
国防科学技术大学 计算机学院 并行与分布处理国家重点实验室, 湖南 长沙 410073;北京系统工程研究所, 北京 100101
在期刊界中查找
在百度中查找
在本站中查找
杨学军杨学军
国防科学技术大学 计算机学院 并行与分布处理国家重点实验室, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找
马亚青马亚青
中国北方车辆研究所, 北京 100072
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(60921062, 61003087)

Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System

Author:

JIA Jia
JIA Jia
National Laboratory for Parallel and Distributed Processing, College of Computer, National University of Defense Technology, Changsha 410073, China;Beijing Institute of System Engineering, Beijing 100101, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Xue-Jun
YANG Xue-Jun
National Laboratory for Parallel and Distributed Processing, College of Computer, National University of Defense Technology, Changsha 410073, China
在期刊界中查找
在百度中查找
在本站中查找
MA Ya-Qing
MA Ya-Qing
China North Vehicle Research Institution, Beijing 100072, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [27]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

应用级checkpointing 是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销.选择合适的checkpointing 位置、减小全局checkpoint 保存数据量是优化应用级checkpointing 技术的关键问题.对于近年来推出的带有通用GPU 的异构系统上的应用级checkpointing 技术,也同样面临上述问题.针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing 技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing 优化设置问题对其进行数学建模和求解.最后,通过实验验证并评估了所提出的两种方法的性能.

关键词:应用级checkpointing;异构系统;通用GPU;同步检查点设置;异步检查点设置

Abstract:

Application-Level checkpointing is a widely concerned technique used in large-scale scientific computing fields, and programmers to choose the appropriate place to save crucial data: henceforth, the fault-tolerant overhead can be reduced. There are two key issues in adopting this technique: find the proper place and reduce the scale of global checkpoints saving datum. The same problem is encountered when emerging heterogeneous systems with general purpose computation on GPUs. Towards architecture of heterogeneous system and characterization of application, this paper performs static analysis for the checkpointing configurations and placements, and two novelty approaches are proposed: ‘synchronous checkpoint placement’ and the ‘asynchronous checkpoint placement’. The placement problem of checkpoints can be mathematically modeled and solved. Finally, their performances are evaluated via conducting experiments.

Key words:application-level checkpointing;heterogeneous system;general purpose computation on GPU;synchronous checkpoint placement;asynchronous checkpoint placement

参考文献

[1] Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General purpose computation ongraphics hardware. In: Proc. of the ACM SIGGRAPH 2004 Course Notes (SIGGRAPH 2004). New York: ACM Press, 2004. 33.[doi: 10.1145/1103900.1103933]

[2] Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proc. of the 2004 ACM/IEEE Conf.on Supercomputing (SC 2004). Washington: IEEE Computer Society, 2004. 47. [doi: 10.1109/SC.2004.26]

[3] Dally WJ, Hanrahan P, Erez M, Knight TJ. Merrimac: Supercomputing with streams. In: Proc. of the Supercomputing Conf. (SC2003). 2003. 35-42. [doi: 10.1109/SC.2003.10043]

[4] TOP500 supercomputing site. http://www.top500.org

[5] Read DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computers System, 2006,22(3):293-302.[doi: 10.1016/j.future.2004.11.015]

[6] Brown A, Patterson DA. Embracing failure: A case for recovery-oriented computing (ROC). In: Proc. of the High PerformanceTrans. on Processing Symp. 2001.

[7] Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V,Selikhov A. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proc. of the 2002 ACM/IEEE Conf. onSupercomputing. Baltimore: IEEE Computer Society Press, 2002. [doi: 10.1109/SC.2002.10048]

[8] Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of mpi programs. In: Proc. of theACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). 2003. 84-94. [doi: 10.1145/966049.781513]

[9] Elnozahy EN, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACMComputing Surveys, 2002,34(3):375-408. [doi: 10.1145/568522.568525]

[10] Plank JS, Li K, Puening MA. Diskless checkpointing. IEEE Trans. on Parallel Distributed Systems, 1998,9(10):972-986. [doi:10.1109/71.730527]

[11] Ramkumar B, Strumpen V. Portable checkpointing for heterogeneous architectures. In: Proc. of the 27th Int’l Symp. on Fault-Tolerant Computing (FTCS’97). Washington: IEEE Computer Society, 1997. 58. [doi: 10.1109/FTCS.1997.614078]

[12] Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P. Implementation and evaluation of a scalableapplication-level checkpoint-recovery scheme for MPI programs. In: Proc. of the Supercomputing 2004. 2004. [doi: 10.1109/SC.2004.29]

[13] Kapasi UJ, Rixner S, Dally WJ, Khailany B, Ahn JH, Mattson P, Owens JD. Programmable stream processors. IEEE Computer,2003,36(8):54-62. [doi: 10.1109/MC.2003.1220582]

[14] Advanced Micro Devices, Inc. AMD brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf

[15] Kirk D. NVIDIA CUDA Software and GPU Parallel Computing Architecture. New York: ACM Press, 2007. 103-104. [doi:10.1145/1296907.1296909]

[16] Open computing language. http://www.khronos.org/

[17] CUDA technical training volume I/II. Prepared and Provided by NVIDIA, 2008.

[18] NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Version 2.1, Beta, 2008.

[19] Zou FX, Zhang XP. Basic Technology for Fault-Diagnosis and Reliability in Computer Application Systems. Beijing: HighEducation Press, 1999 (in Chinese).

[20] Compute visual profiler 4.0 for NVIDIA CUDA user guide. DU-05162-001_v04. 2011.

[21] Krishna CM, Shin KG, Lee YH. Optimization criteria for checkpoint placement. Communications of the ACM,1984,27(10):1008-1012. [doi: 10.1145/358274.358282]

[22] Chandy KM, Ramamoorthy CV. Rollback and recovery strategies for computer programs. IEEE Trans. on Computers, 1972,21(6):546-556. [doi: 10.1109/TC.1972.5009007]

[23] Toueg S, Babaoğlu Ö. On the optimum checkpoint selection problem. SIAM Journal on Computing, 1984,13:630-649. [doi:10.1137/0213039]

[24] Upadhyaya SJ, Saluja KK. An experimental study to determine task size for rollback recovery systems. IEEE Trans. on Computers,1988,37(7):872-877. [doi: 10.1109/12.2235]

[25] Sheaffer J, Luebke D, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphicsprocessors. In: Proc. of the Graphics Hardware 2007. 2007.

[26] George N, Lach J, Gurumurthi S. Towards transient fault tolerance for heterogeneous computing platforms. In: Proc. of the 38thAnnual IEEE/IFIP Int’l Conf. on Dependable Systems and Networks (DSN 2008). 2008.

[27] Dimitrov M, Mantor M, Zhou H. Understanding software approaches for GPGPU reliability. In: Proc. of the 2nd Workshop onGeneral Purpose Processing on Graphics Processing Units (GPGPU-2), Vol.383. New York: ACM Press, 2009. 94-104. [doi:10.1145/1513895.1513907]

引用本文

贾佳,杨学军,马亚青.静态分析面向异构系统的应用级Checkpoint 设置问题.软件学报,2013,24(6):1361-1375

复制

文章指标

点击次数:3090
下载次数: 5344
HTML阅读次数: 0
引用次数: 0

历史

收稿日期:2011-08-19
最后修改日期:2012-01-15
录用日期:
在线发布日期: 2013-06-07
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码