Abstract:As an effective programming model for large-scale data-intensive applications, MapReduce has been widely and successfully applied in the field of parallel and distributed computing, and has the characteristics of good fault-tolerance and easy to implement and extend. Because MapReduce extends computing to the nodes of large-scale cluster system, reasonable placement of processing data has become one of the key factors affecting the performance of MapReduce cluster system, including energy efficiency, resource utilization, communications and I/O throughput, response time, and reliability. This study first analyzes characteristics of the default data placement strategy of Hadoop, which is a typical implementation of MapReduce programming model. Next, it investigates popular data placement strategies for MapReduce cluster computing environments. Finally, it presents future research directions in the area of data placement strategies for MapReduce-based cluster computing systems.