重复数据删除技术
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant No.60873066 (国家自然科学基金); the NationalHigh-Tech Research and Development Plan of China under Grant No.2009AA01A403 (国家高技术研究发展计划(863)); the SpecializedResearch Fund for the Doctoral Program of Higher Education of China under Grant No.200800030027 (高等学校博士学科点专项科研基金)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [42]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    重复数据删除技术主要分为两类:相同数据的检测技术和相似数据的检测与编码技术,系统地总结了 这两类技术,并分析了其优缺点.此外,由于重复数据删除技术会影响存储系统的可靠性和性能,又总结了针对这 两方面的问题提出的各种技术.通过对重复数据删除技术当前研究现状的分析,得出如下结论:a) 重复数据删除 中的数据特性挖掘问题还未得到完全解决,如何利用数据特征信息有效地消除重复数据还需要更深入的研 究;b) 从存储系统设计的角度,如何引入恰当的机制打破重复数据删除技术的可靠性局限并减少重复数据删除技术带来的额外系统开销也是一个需要深入研究的方面.

    Abstract:

    Data deduplication technologies can be divided into two categories: a) identical data detection techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since data deduplication technologies can affect the reliability and performance of storage systems, this paper also surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of the current state of research on data deduplication technologies, this paper makes several conclusions as follows: a) How to mine data characteristic information in data deduplication has not been completely solved, and how to use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by data deduplication techniques.

    参考文献
    [1] McKnight J, Asaro T, Babineau B. Digital archiving: End-User survey and market forecast 2006-2010. 2006. http://www. enterprisestrategygroup.com/ESGPublications/ReportDetail.asp?ReportID=591
    [2] Muthitacharoen A, Chen B, Maziéres D. A low-bandwidth network file system. In: Proc. of the 18th ACM Symp. on Operating System Principles (SOSP 2001). New York: ACM Press, 2001. 174?187.
    [3] Bolosky WJ, Corbin S, Goebel D, Douceur JR. Single instance storage in Windows 2000. In: Proc. of the 4th Usenix Windows System Symp. Berkeley: USENIX Association, 2000. 13?24.
    [4] Bobbarjung DR, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Trans. on Storage, 2006, 2(4):424?448.
    [5] Policroniades C, Pratt I. Alternatives for detecting redundancy in storage systems data. In: Proc. of the 2004 USENIX Annual Technical Conf. (USENIX 2004). Berkeley: USENIX Association, 2004. 73?86.
    [6] Jain N, Dahlin M, Tewari R. Taper: Tiered approach for eliminating redundancy in replica synchronization. In: Proc. of the 4th Usenix Conf. on File and Storage Technologies (FAST 2005). Berkeley: USENIX Association, 2005. 281?294.
    [7] Denehy TE, Hsu WW. Duplicate management for reference data. IBM Research Report, RJ 10305 (A0310-017), IBM Research Division, 2003.
    [8] Broder AZ. Identifying and filtering near-duplicate documents. In: Giancarlo R, Sankoff D, eds. Proc. of the 11th Annual Symp. on Combinatorial Pattern Matching. London: Springer-Verlag, 2000. 1?10.
    [9] Broder AZ. On the resemblance and containment of documents. In: Carpentieri B, De Santis A, Vaccaro U, Storer JA, eds. Proc. of the Compression and Complexity of SEQUENCES 1997. Washington: IEEE Computer Society Press, 1997. 21?29.
    [10] Douglis F, Iyengar A. Application-Specific delta encoding via resemblance detection. In: Proc. of the 2003 USENIX Annual Technical Conf. (USENIX 2003). Berkeley: USENIX Association, 2003. 113?126.
    [11] Kulkarni P, Douglis F, Lavoie JD, Tracey JM. Redundancy elimination within large collections of files. In: Proc. of the 2004 Usenix Annual Technical Conf. (USENIX 2004). Berkeley: USENIX Association, 2004. 59?72.
    [12] Han B, Keleher P. Implementation and performance evaluation of fuzzy file block matching. In: Proc. of the 2007 USENIX Annual Technical Conf. (USENIX 2007). Berkeley: USENIX Association, 2007. 199?204.
    [13] Bloom BH. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970,13(7):422?426. [doi: 10.1145/362686.362692]
    [14] Broder AZ, Mitzenmacher M. Network applications of bloom filters: A survey. Internet Mathematics, 2003,1(4):485?509.
    [15] Mitzenmacher M. Compressed bloom filters. IEEE/ACM Trans. on Networking (TON), 2002,10(5):604–612. [doi: 10.1109/ TNET.2002.803864]
    [16] Manber U. Finding similar files in a large file system. In: Proc. of the USENIX Winter 1994 Technical Conf. Berkeley: USENIX Association, 1994. 1–10.
    [17] Wu S, Manber U. Agrep?A fast approximate pattern-matching tool. In: Proc. of the USENIX Winter 1992 Technical Conf. Berkeley: USENIX Association, 1992. 153–162.
    [18] Hunt JJ, Vo KP, Tichy WF. Delta algorithms an empiriecal analysis. ACM Trans. on Software Engineering and Methodology, 1998, 7(4):192–214. [doi: 10.1145/279310.279321]
    [19] Ouyang Z, Memon N, Suel T, Trendafilov D. Cluster-Based delta compression of a collection of files. In: Proc. of the 3rd Int’l Conf. on Web Information Systems Engineering. Washington: IEEE Computer Society Press, 2006. 257–266.
    [20] Bhagwat D, Pollack K, Long DDE, Schwarz T, Miller EL, Paris JF. Providing high reliability in a minimum redundancy archival storage system. In: Proc. of the 14th Int’l Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2006). Washington: IEEE Computer Society Press, 2006. 413–421.
    [21] Zhu B, Li K. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proc. of the 6th Usenix Conf. on File and Storage Technologies (FAST 2008). Berkeley: USENIX Association, 2008. 269–282.
    [22] Bhagwat D, Eshghi K, Mehra P. Content-Based document routing and index partitioning for scalable similarity-based searches in a large corpus. In: Berkhin P, Caruana R, Wu XD, Gaffney S, eds. Proc. of the 13th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD 2007). New York: ACM Press, 2007. 105–112.
    [23] You LL, Pollack KT, Long DDE. Deep store: An archival storage system architecture. In: Proc. of the 21st Int’l Conf. on Data Engineering (ICDE 2005). Washington: IEEE Computer Society Press, 2005. 804–815.
    [24] Quinlan S, Dorward S. Venti: A new approach to archival storage. In: Proc. of the 1st Usenix Conf. on File and Storage Technologies (FAST 2002). Berkeley: USENIX Association, 2002. 89–102.
    [25] Sapuntzakis CP, Chandra R, Pfaff B, Chow J, Lam MS, Rosenblum M. Optimizing the migration of virtual computers. In: Proc. of the 5th Symp. on Operating Systems Design and Implementation (OSDI 2002). New York: ACM Press, 2002. 377–390.
    [26] Rabin MO. Fingerprinting by random polynomials. Technical Report, CRCT TR-15-81, Harvard University, 1981.
    [27] Rivest R. The MD5 message-digest algorithm. 1992. http://www.python.org/doc/current/lib/module-md5.html
    [28] U.S. National Institute of Standards and Technology (NIST). Federal Information Processing Standards (FIPS) Publication 180-1: Secure Hash Standard. 1995. http://www.itl.nist.gov/fipspubs/fip180-1.htm
    [29] U.S. National Institute of Standards and Technology (NIST). Federal Information Processing Standards (FIPS) Publication 180-2: Secure Hash Standard. 2002. http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf
    [30] Moreton TD, Pratt IA, Harris TL. Storage, mutability and naming in Pasta. In: Proc. of the Int’l Workshop on Peer-to-Peer Computing at Networking 2002 Workshops. Berlin, Heidelberg: Springer-Verlag, 2002. 215–219.
    [31] Cox LP, Murray CD, Noble BD. Pastiche: Making backup cheap and easy. In: Proc. of the 5th Symp. on Operating Systems Design and Implementation (OSDI 2002). Berkeley: USENIX Association, 2002. 285–298.
    [32] Rhea SC, Liang K, Brewer E. Value-Based Web caching. In: Proc. of the 12th Int’l Conf. on World Wide Web (WWW 2003). New York: ACM Press, 2003. 619–628.
    [33] Tridgell A. Rolling checksum. 1998. http://rsync.samba.org./tech_report/node3.html
    [34] Hunt JW, McIllroy MC. An algorithm for differential file comparison. Technical Report, No.41, Computing Science, AT&T Bell Laboratories, 1976.
    [35] Hunt JW, Szymanske TG. A fast algorithm for computing longest common subsequences. Communications of the ACM, 1977, 20(5):350–353. [doi: 10.1145/359581.359603]
    [36] Rochkind MJ. The source code control system. IEEE Trans. on Software Engineering, 1975,SE-1(4):364–370.
    [37] Tichy WF. RCS?A system for version control. Software-Practice and Experience, 1985,15(7):637–654. [doi: 10.1002/ spe.4380150703]
    [38] Tichy WF. The string-to-string correction problem with block moves. ACM Trans. on Computer Systems, 1984,2(4):309–321. [doi: 10.1145/357401.357404]
    [39] Chen PM, Lee EK, Gibson GA, Katz RH, Patterson DA. RAID: High-Performance, reliable secondary storage. ACM Computing Surveys, 1994,26(2):145–185. [doi: 10.1145/176979.176981]
    [40] Kubiatowicz J, Bindel D, Chen Y, Czerwinski S, Eaton P, Geels D, Gummadi R, Rhea S, Weatherspoon H, Weimer W, Wells C, Zhao B. Oceanstore: An architecture for global-scale persistent storage. In: Proc. of the 9th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2000). New York: ACM Press, 2000. 190–201.
    [41] Adya A, Bolosky WJ, Castro M, Chaiken R, Cermak G, Douceur JR, Howell J, Lorch JR, Theimer M, Wattenhofer R. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In: Proc. of the 5th Symp. on Operating Systems Design and Implementation (OSDI 2002). New York: ACM Press, 2002. 1–14.
    [42] Rizzo L. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Computer Communication Review, 1997,27(2):24–36. [doi: 10.1145/263876.263881]
    相似文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

敖 莉,舒继武,李明强.重复数据删除技术.软件学报,2010,21(5):916-929

复制
分享
文章指标
  • 点击次数:12608
  • 下载次数: 20900
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2008-06-04
  • 最后修改日期:2009-10-09
文章二维码
您是第19894609位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号