Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [27]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Application-Level checkpointing is a widely concerned technique used in large-scale scientific computing fields, and programmers to choose the appropriate place to save crucial data: henceforth, the fault-tolerant overhead can be reduced. There are two key issues in adopting this technique: find the proper place and reduce the scale of global checkpoints saving datum. The same problem is encountered when emerging heterogeneous systems with general purpose computation on GPUs. Towards architecture of heterogeneous system and characterization of application, this paper performs static analysis for the checkpointing configurations and placements, and two novelty approaches are proposed: ‘synchronous checkpoint placement’ and the ‘asynchronous checkpoint placement’. The placement problem of checkpoints can be mathematically modeled and solved. Finally, their performances are evaluated via conducting experiments.

    Reference
    [1] Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General purpose computation ongraphics hardware. In: Proc. of the ACM SIGGRAPH 2004 Course Notes (SIGGRAPH 2004). New York: ACM Press, 2004. 33.[doi: 10.1145/1103900.1103933]
    [2] Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proc. of the 2004 ACM/IEEE Conf.on Supercomputing (SC 2004). Washington: IEEE Computer Society, 2004. 47. [doi: 10.1109/SC.2004.26]
    [3] Dally WJ, Hanrahan P, Erez M, Knight TJ. Merrimac: Supercomputing with streams. In: Proc. of the Supercomputing Conf. (SC2003). 2003. 35-42. [doi: 10.1109/SC.2003.10043]
    [4] TOP500 supercomputing site. http://www.top500.org
    [5] Read DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computers System, 2006,22(3):293-302.[doi: 10.1016/j.future.2004.11.015]
    [6] Brown A, Patterson DA. Embracing failure: A case for recovery-oriented computing (ROC). In: Proc. of the High PerformanceTrans. on Processing Symp. 2001.
    [7] Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V,Selikhov A. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proc. of the 2002 ACM/IEEE Conf. onSupercomputing. Baltimore: IEEE Computer Society Press, 2002. [doi: 10.1109/SC.2002.10048]
    [8] Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of mpi programs. In: Proc. of theACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). 2003. 84-94. [doi: 10.1145/966049.781513]
    [9] Elnozahy EN, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACMComputing Surveys, 2002,34(3):375-408. [doi: 10.1145/568522.568525]
    [10] Plank JS, Li K, Puening MA. Diskless checkpointing. IEEE Trans. on Parallel Distributed Systems, 1998,9(10):972-986. [doi:10.1109/71.730527]
    [11] Ramkumar B, Strumpen V. Portable checkpointing for heterogeneous architectures. In: Proc. of the 27th Int’l Symp. on Fault-Tolerant Computing (FTCS’97). Washington: IEEE Computer Society, 1997. 58. [doi: 10.1109/FTCS.1997.614078]
    [12] Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P. Implementation and evaluation of a scalableapplication-level checkpoint-recovery scheme for MPI programs. In: Proc. of the Supercomputing 2004. 2004. [doi: 10.1109/SC.2004.29]
    [13] Kapasi UJ, Rixner S, Dally WJ, Khailany B, Ahn JH, Mattson P, Owens JD. Programmable stream processors. IEEE Computer,2003,36(8):54-62. [doi: 10.1109/MC.2003.1220582]
    [14] Advanced Micro Devices, Inc. AMD brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf
    [15] Kirk D. NVIDIA CUDA Software and GPU Parallel Computing Architecture. New York: ACM Press, 2007. 103-104. [doi:10.1145/1296907.1296909]
    [16] Open computing language. http://www.khronos.org/
    [17] CUDA technical training volume I/II. Prepared and Provided by NVIDIA, 2008.
    [18] NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Version 2.1, Beta, 2008.
    [19] Zou FX, Zhang XP. Basic Technology for Fault-Diagnosis and Reliability in Computer Application Systems. Beijing: HighEducation Press, 1999 (in Chinese).
    [20] Compute visual profiler 4.0 for NVIDIA CUDA user guide. DU-05162-001_v04. 2011.
    [21] Krishna CM, Shin KG, Lee YH. Optimization criteria for checkpoint placement. Communications of the ACM,1984,27(10):1008-1012. [doi: 10.1145/358274.358282]
    [22] Chandy KM, Ramamoorthy CV. Rollback and recovery strategies for computer programs. IEEE Trans. on Computers, 1972,21(6):546-556. [doi: 10.1109/TC.1972.5009007]
    [23] Toueg S, Babaoğlu Ö. On the optimum checkpoint selection problem. SIAM Journal on Computing, 1984,13:630-649. [doi:10.1137/0213039]
    [24] Upadhyaya SJ, Saluja KK. An experimental study to determine task size for rollback recovery systems. IEEE Trans. on Computers,1988,37(7):872-877. [doi: 10.1109/12.2235]
    [25] Sheaffer J, Luebke D, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphicsprocessors. In: Proc. of the Graphics Hardware 2007. 2007.
    [26] George N, Lach J, Gurumurthi S. Towards transient fault tolerance for heterogeneous computing platforms. In: Proc. of the 38thAnnual IEEE/IFIP Int’l Conf. on Dependable Systems and Networks (DSN 2008). 2008.
    [27] Dimitrov M, Mantor M, Zhou H. Understanding software approaches for GPGPU reliability. In: Proc. of the 2nd Workshop onGeneral Purpose Processing on Graphics Processing Units (GPGPU-2), Vol.383. New York: ACM Press, 2009. 94-104. [doi:10.1145/1513895.1513907]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

贾佳,杨学军,马亚青.静态分析面向异构系统的应用级Checkpoint 设置问题.软件学报,2013,24(6):1361-1375

Copy
Share
Article Metrics
  • Abstract:3097
  • PDF: 5371
  • HTML: 0
  • Cited by: 0
History
  • Received:August 19,2011
  • Revised:January 15,2012
  • Online: June 07,2013
You are the first2035257Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063