Abstract:The three layers supercomputing environment of Chinese Academy of Sciences is built to integrate the computing resources of the head center in Beijing, eight regional centers and several campus-level centers. To enhance the reliability of the supercomputing environment and provide stable and reliable computing services, the fault-tolerant mechanism research has become a research priority of the supercomputing environment. In this paper, the fault-tolerant basic concepts and computer fault-tolerant technologies are introduced at first. Next, a fault-tolerant framework of the supercomputing environment is proposed. Then the fault-tolerant solutions of different levels based on the framework and the performance test results in Deepcomp 7000 are presented. Finally, the fault-tolerant overheads of different levels are compared and analyzed to verify the impact on the application.