Abstract:As one of the core components of Apache Hadoop, the Hadoop distributed file system (HDFS) has been widely used in the industry. HDFS adopts a multiple replicas mechanism to ensure data reliability, which may incur inconsistency because of node failure, network partition, and write failure. HDFS is considered to have reduced data consistency compared to traditional file systems, which is difficult for users to understand when there will be inconsistent. At present, there is no relevant work to verify the consistency mechanism. When the data is inconsistent, it will increase the uncertainty of the upper applications. Thus, research for data consistency model is required. The large scale of HDFS makes the analysis more difficult. Code reading, abstracting, colored Petri net modeling, and state-space analysis are conducted to comprehend the system. The works are listed as the following. (1) Colored petri nets are used to model HDFS's process of reading and writing files, the model describes the functions of inner components and their cooperation mechanism in detail. (2) Data layer consistency and operation layer consistency of HDFS are analyzed with state-space tools based on a colored Petri net model, figuring out data consistency guaranteed by the system. (3) A time point repeatable read method is proposed to verify operation layer consistency and serial repeatable strategy is utilized to decrease state-space complexity. Based on the contribution above, the directions for HDFS application development are proposed, helping to improve the data consistency. The CPN modeling method and technique are applicated in the analysis of other distributed information systems.