Virtual Machine Image Deduplication Method Based on Clustering
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61402450); National Key Technology Research and Development Program of China (2013BAH45F01); National High-Tech R&D Program of China (863) (2013AA041301); Beijing Natural Science Foundation of China (4154088)

  • Article
  • | |
  • Metrics
  • |
  • Reference [24]
  • |
  • Related [20]
  • |
  • Cited by [1]
  • | |
  • Comments
    Abstract:

    Virtualization technology is becoming more and more prevalence with the rise of cloud computing. The physical machines for service hosting are gradually being replaced by virtual ones. Driven by reliability and flexibility considerations, virtual machine images increase sharply, and how to manage them efficiently and economically has become a big challenge. Since large amount of duplicated data blocks exist in different virtual machine images, an efficient deduplication method is vital to the virtual machine image management. The existing deduplication works are not very suitable for cloud environments as they employ time-consuming algorithms which can cause serious performance interference to the neighboring virtual machines. This paper proposes a local deduplication method which can greatly optimize the deduplication process of virtual machine. The main idea of the method is to convert the global deduplication to a local one, thus considerably reducing the space and time complexity. In this method, the images are classified into different groups through an improved k-means clustering algorithm according to image similarities. When a new image is entered, a sampling method is used to choose an appropriate group to perform the deduplication operation. Experiments show that this approach is robust and effective. It can significantly reduce (more than 50%) the performance interference to hosting virtual machine with an acceptable increase (about 1%) in disk space usage.

    Reference
    [1] Goldberg RP. Survey of virtual machine research. Computer, 1974,7(6):34-45. [doi: 10.1109/MC.1974.6323581]
    [2] Jin K, Miller EL. The effectiveness of deduplication on virtual machine disk images. In: Proc. of the Israeli Experimental Systems Conf. ACM Press, 2009. 66-72. [doi: 10.1145/1534530.1534540]
    [3] Ao L, Shu JW, Li MQ. Data deduplication techniques. Ruan Jian Xue Bao/Journal of Software, 2010,21(5):916-929 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/3761.htm [doi: 10.3724/SP.J.1001.2010.03761]
    [4] Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), 1979,28(1):100-108. [doi: 10.2307/2346830]
    [5] Institute of Software Chinese Academic of Sciences. OnceCloud: A cloud service platform. 2014. http://www.once.com.cn/OncePortal/oncecloud
    [6] Bolosky WJ, Corbin S, Goebel D, Douceur JR. Single instance storage in Windows 2000. In: Proc. of the 4th USENIX Windows Systems Symp. Washington: USENIX, 2000. 13-24.
    [7] Hunt JJ, Vo KP, Tichy WF. Delta algorithms: An empirical analysis. ACM Trans. on Software Engineering and Methodology, 1998, 7(2):192-214. [doi: 10.1145/279310.279321]
    [8] Quinlan S, Dorward S. Venti: A new approach to archival storage. In: Proc. of the 2002 Conf. on File and Storage Technologies. Monterey: USENIX Association, 2002. 89-101.
    [9] Kulkarni P, Douglis F, LaVoie JD, Tracey JM. Redundancy elimination within large collections of files. In: Proc. of the USENIX Annual Technical Conf. Boston: USENIX Association, 2004. 59-72.
    [10] Rhea S, Cox R, Pesterev A. Fast, inexpensive content-addressed storage in foundation. In: Proc. of the USENIX 2008 Annual Technical Conf. on Annual Technical Conf. Boston: USENIX Association, USENIX Association, 2008. 143-156.
    [11] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In: Proc. of the 7th Conf. on File and Storage Technologies. San Francisco: USENIX Association, 2009. 111-123.
    [12] Dubnicki C, Gryz L, Heldt L, Heldt L, Kaczmarczyk M., Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M. HYDRAstor: A scalable secondary storage. In: Proc. of the 7th Conf. on File and Storage Technologies. San Francisco: USENIX Association, 2009. 197-210.
    [13] Ungureanu C, Atkin B, Aranya A, Gokhale S, Rago S, Calkowski G, Dubnicki C, Bohra A. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In: Proc. of the 8th Conf. on File and Storage Technologies. San Jose: USENIX Association, 2010. 225-238.
    [14] Zhu B, Li K, Patterson RH. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proc. of the 6th Conf. on File and Storage Technologies. San Jose: USENIX Association, 2009. 1-14.
    [15] Bobbarjung DR, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Trans. on Storage (TOS), 2006,2(4):424-448. [doi: 10.1145/1210596.1210599]
    [16] Tridgell A. Efficient algorithms for sorting and synchronization [Ph.D. Thesis]. Canberra: Australian National University, 1999.
    [17] Muthitacharoen A, Chen B, Mazieres D. A low-bandwidth network file system. ACM SIGOPS Operating Systems Review, 2001, 35(5):174-187. [doi: 10.1145/502059.502052]
    [18] Rabin MO. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
    [19] Rivest R. The MD5 message-digest algorithm. 1992. http://tools.ietf.org/html/rfc1321
    [20] Eastlake D, Jones P. US secure hash algorithm 1 (SHA1). 2001. http://www.rfc-editor.org/info/rfc3174
    [21] Policroniades C, Pratt I. Alternatives for detecting redundancy in storage systems data. In: Proc. of the USENIX 2004 Annual Technical Conf. Boston: USENIX Association, 2004. 73-86.
    [22] Bloom BH. Space/Time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970,13(7):422-426. [doi: 10.1145/362686.362692]
    [23] Clements AT, Ahmad I, Vilayannur M, Li J. Decentralized deduplication in SAN cluster file systems. In: Proc. of the 2009 Conf. on USENIX Annual Technical. San Diego: USENIX Association, 2009. 101-114.
    [24] Zhang W, Tang H, Jiang H, Yang T, Li X, Zeng Y. Multi-Level selective deduplication for VM snapshots in cloud storage. In: Proc. of the 5th IEEE Int'l Conf. on Cloud Computing (CLOUD). IEEE, 2012. 550-557. [doi: 10.1109/CLOUD.2012.78]
    Comments
    Comments
    分享到微博
    Submit
Get Citation

徐继伟,张文博,魏峻,钟华,黄涛.一种基于聚类分组的虚拟机镜像去冗余方法.软件学报,2016,27(2):466-480

Copy
Share
Article Metrics
  • Abstract:3477
  • PDF: 5224
  • HTML: 1283
  • Cited by: 0
History
  • Received:April 23,2014
  • Revised:December 31,2014
  • Online: November 17,2015
You are the first2033290Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063