Abstract:Virtualization technology is becoming more and more prevalence with the rise of cloud computing. The physical machines for service hosting are gradually being replaced by virtual ones. Driven by reliability and flexibility considerations, virtual machine images increase sharply, and how to manage them efficiently and economically has become a big challenge. Since large amount of duplicated data blocks exist in different virtual machine images, an efficient deduplication method is vital to the virtual machine image management. The existing deduplication works are not very suitable for cloud environments as they employ time-consuming algorithms which can cause serious performance interference to the neighboring virtual machines. This paper proposes a local deduplication method which can greatly optimize the deduplication process of virtual machine. The main idea of the method is to convert the global deduplication to a local one, thus considerably reducing the space and time complexity. In this method, the images are classified into different groups through an improved k-means clustering algorithm according to image similarities. When a new image is entered, a sampling method is used to choose an appropriate group to perform the deduplication operation. Experiments show that this approach is robust and effective. It can significantly reduce (more than 50%) the performance interference to hosting virtual machine with an acceptable increase (about 1%) in disk space usage.