Fault-Torlerance Method for CPU-GPU Heterogeneous System
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [17]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    In recent years, heterogeneous parallel architecture has become an important development trend of supercomputer because it mitigates the problem of increasingly high power consumption. As a high performance and power efficiency accelerator, GPU (graphics processing unit) has been extensively used in HPC (high performance computing) area. However, the inherent unreliability of the GPU hardware deteriorates the reliability of supercomputer. Presently, most research of FT (fault-tolerance) techniques for CPU-GPU heterogeneous system isolates the GPU from the system, and does FT work for it at the granularity of a single GPU invocation. This paper proposes a new Lazy FT method for CPU-GPU heterogeneous system, introduces a FT framework and its constraints based on directives, and demonstrates the validity of the Lazy FT method. The experimental results show that, compared with existing FT methods, the cost of LazyFT is very cheap.

    Reference
    [1] Feng WC. The importance of being low power in high performance computing. CTWatch Quarterly, 2005,1(3):12-20. http://www.ctwatch.org/quarterly/articles/2005/08/
    [2] Dally WJ, Hanrahan P, Erez M, Knight TJ, Labonté F, Ahn JH, Jayasena N, Kapasi UJ, Das A, Gummaraju J, Buck I. Merrimac: Supercomputing with streams. In: Proc. of the Supercomputing Conf. 2003 (SC 2003). 2003. 35-42. http://www.computer.org/ portal/web/csdl/doi/10.1109/SC.2003.10043 [doi: 10.1109/SC.2003.10043]
    [3] Pham D, Asano S, Bolliger M, Day MN, Hofstee HP, Johns C, Kahle J, Kameyama A, Keaty J, Masubuchi Y, Riley M, Shippy D, Stasiak D, Suzuoki M, Wang M, Warnock J, Weitzel S, Wendel D, Yamazaki T, Yazawa K. The design and implementation of a first-generation CELL processor. In: Proc. of the IEEE Int’l Solid-State Circuits Conf. (ISSCC 2005). 2005. 184-185. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1493930 [doi: 10.1109/ISSCC.2005.1493930]
    [4] Luebke D, Harris M, Krüger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General purpose computation on graphics hardware. In: Proc. of the Conf. on SIGGRAPH 2004 Course Notes. Los Angeles, 2004. 33-es. http://dl.acm.org/ citation.cfm?id=1103933 [doi: 10.1145/1103900. 1103933]
    [5] Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ. A survey of general-purpose computation on graphics hardware. In: Proc. of the Eurographics 2005. 2005. 21-51. http://www.cg.informatik.uni-siegen.de/Teaching/Lectures/ 06_WS/Hauptseminar/02/GPUSurvey.pdf
    [6] Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing. 2004. 47. http://www.computer.org/portal/web/csdl/doi/10.1109/SC.2004.26 [doi: 10.1109/SC.2004.26]
    [7] Sheaffer JW, Luebke DP, Skadron K. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In: Proc. of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symp. on Graphics Hardware (GH 2007). 2007. 55-64. http://dl.acm.org/citation.cfm?id=1280104
    [8] Dimitrov M, Mantor M, Zhou HY. Understanding software approaches for GPGPU reliability. In: Proc. of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. GPGPU-2, Vol.383. Washington, New York: ACM Press, 2009. 94-104. http://dl.acm.org/citation.cfm?id=1513907 [doi: 10.1145/1513895.1513907]
    [9] Gao L, Yang XJ. Error flow model: Modeling and analysis of software propagating hardware faults. Journal of Software, 2007,18(4):808-820 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/18/808.htm [doi: 10.1360/jos180808]
    [10] Dubrova E. Fault Tolerant Design: An Introduction. Kluwer Acedemic Publischer, 2007. http://web.it.kth.se/~elena/draft.pdf
    [11] Weaver C, Emer J, Mukherjee SS, Reinhardt SK. Techniques to reduce the soft error rate of a high-performance microprocessor. In: Proc. of the 31st Annual Int’l Symp. on Computer Architecture. München, 2004. 264. http://dl.acm.org/citation.cfm?id=1006723[doi: 10.1109/ISCA.2004.1310780]
    [12] Gregerson AE, Abhyankar AV. Performance cost analysis of software-implemented hardware fault tolerance methods in generalpurpose GPU computing. 2009. http://homepages.cae.wisc.edu/~ece753/papers/Paper_4.pdf
    [13] Muchnick SS. Advanced Compiler Design and Implementation. San Francisco: Morgan Kaufmann Publishers, 1998.
    [14] Guerraoui R, Schiper A. Software-Based replication for fault tolerance. IEEE Computer, 1997,30(4):68-74. [doi: 10.1109/ 2. 585156]
    [15] Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P. Recent advances in checkpoint/recovery systems. In: Proc. of the Next Generation Systems Program Workshop at IPDPS. 2006. http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2006. 1639575 [doi: 10.1109/IPDPS.2006.1639575]
    [16] Bronevetsky G, Schulz M, Szwed P, Marques D, Pingali K. Application-Level checkpointing for shared memory programs. In: Proc. of the Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 2004. http://dl.acm.org/ citation.cfm?id=1024421 [doi: 10.1145/1037947.1024421]
    [17] Wu M, Sun XH, Jin H. Performance under failures of high-end computing. In: Proc. of the 2007 ACM/IEEE Conf. on Supercomputing. New York: ACM Press, 2007. 1-11. [doi: 10.1145/1362622.1362687]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

徐新海,杨学军,林宇斐,林一松,唐滔.一种面向CPU-GPU 异构系统的容错方法.软件学报,2011,22(10):2538-2552

Copy
Share
Article Metrics
  • Abstract:4810
  • PDF: 6774
  • HTML: 0
  • Cited by: 0
History
  • Received:April 28,2010
  • Revised:May 18,2011
You are the first2044825Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063