Data Model for Dirty Databases

doi:10.3724/SP.J.1001.2012.04042

微信服务号

微信订阅号

2025-4-24- 20

Home > Archive>Volume 23, Issue 3, 2012 >539-549. DOI:10.3724/SP.J.1001.2012.04042

PDF HTML XML Export Cite reminder

Data Model for Dirty Databases
DOI:
                        10.3724/SP.J.1001.2012.04042
                    
Author:
                        WANG Hong-ZhiWANG Hong-Zhi
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LI Jian-ZhongLI Jian-Zhong
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GAO HongGAO Hong
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [20]

Cited by

Materials

Comments

Abstract:

Dirty data brings new challenges for data management. Current methods of dirty data management are mainly data cleaning. Such methods have limitations when dealing with in applications. In some systems, dirty data has to be tolerated. Therefore, the management of databases with dirty data becomes an important issue. The crucial problem is to obtain query result with a clean degree satisfying clean requirement of applications from databases with dirty data. From the aspect of dirty data management, a data model for dirty databases is presented in this paper. This paper proposes the representation of dirty data, data operators for dirty data and the computation method of clean degree of tuples with support of data operation. The equivalent transformation rules for query expressions on dirty data and the preliminary implementation of the data model are also discussed in this paper.

Key words:data quality;dirty data;data model;query processing

Reference

[1] Eckerson W. Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data, Vol.1. Seattle: The Data Warehousing Institute, 2002. 1-36.

[2] Shilakes CC, Tylman J. Enterprise information portals. RC#60232206, United States: Merrill Lynch, 1998. 1-64.

[3] Fuxman A, Miller R. First-Order query rewriting for inconsistent databases. In: Eiter T, Libkin L, eds. Proc. of the 10th Int’l Conf. on Database Theory. Edinburgh: Springer-Verlag, 2005. 337-351. [doi: 10.1016/j.jcss.2006.10.013]

[4] Fuxman A, Fazli E, Miller RJ. ConQuer, efficient management of inconsistent databases. In: ?zcan F, ed. Proc. of the ACM SIGMOD Int’l Conf. on Management of Data. Baltimore: ACM Press, 2005. 155-166. [doi: 10.1145/1066157.1066176]

[5] Andritsos P, Fuxman A, Miller RJ. Clean answers over dirty databases: A probabilistic approach. In: Liu L, Reuter A, Whang KY, Zhang J, eds. Proc. of the 22nd Int’l Conf. on Data Engineering. Atlanta: IEEE Computer Society, 2006. 30. [doi: 10.1109/ICDE. 2006.35]

[6] Khalefa ME, Mokbel MF, Levandoski JJ. Skyline query processing for incomplete data. In: Proc. of the 24th Int’l Conf. on Data Engineering. Cancún: IEEE Computer Society, 2008. 556-565. [doi: 10.1109/ICDE.2008.4497464]

[7] Koch C. On query algebras for probabilistic databases. SIGMOD Record, 2008,37(4):78-85. [doi: 10.1145/1519103.1519116]

[8] Gal A, Martinez MV, Simari GI, Subrahmanian VS. Aggregate query answering under uncertain schema mappings. In: Proc. of the 25th Int’l Conf. on Data Engineering. Shanghai: IEEE Computer Society, 2009. 940-951. [doi: 10.1109/ICDE.2009.55]

[9] Dong XL, Halevy A, Yu C. Data integration with uncertainty. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CC, Ganti V, Kanne C, Klas W, Neuhold EJ, eds. Proc. of the 33rd Int’l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 687-698. [doi: 10.1007/s00778-008-0119-9]

[10] Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engineering, 2007,19(1):1-16. [doi: 10.1109/TKDE.2007.250581]

[11] Li MH, Wang HZ, Li JZ, Gao H. Duplicate record detection method based on optimal bipartite graph matching. Journal of Computer Research and Development, 2009,46(Suppl.):339-345 (in Chinese with English abstract).

[12] Madhavan J, Bernstein PA, Doan AH, Halevy AL. Corpus-Based schema matching. In: Proc. of the 21st Int’l Conf. on Data Engineering. Tokyo: IEEE Computer Society, 2005. 57-68. [doi: 10.1109/ICDE.2005.39]

[13] Li C, Wang B, Yang XC. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CC, Ganti V, Kanne C, Klas W, Neuhold EJ, eds. Proc. of the 33rd Int’l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 303-314.

[14] Yang XC, Wang B, Li C. Cost-Based variable-length-gram selection for string collections to support approximate queries efficiently. In: Wang JT, ed. Proc. of the ACM SIGMOD Int’l Conf. on Management of Data. Vancouver: ACM Press, 2008. 353-364. [doi: 10.1145/1376616.1376655]

[15] Li C, Lu JH, Lu YM. Efficient merging and filtering algorithms for approximate string searches. In: Proc. of the 24th Int’l Conf. on Data Engineering. Cancún: IEEE Computer Society, 2008. 257-266. [doi: 10.1109/ICDE.2008.4497434]

[16] Lieberman M, Sankaranarayanan J, Samet H. A fast similarity join algorithm using graphics processing units. In: Proc. of the 24th Int’l Conf. on Data Engineering. Cancún: IEEE Computer Society, 2008. 1111-1120. [doi: 10.1109/ICDE.2008.4497520]

[17] Xiao C, Wang W, Lin XM, Yu JX. Efficient similarity joins for near duplicate detection. In: Huai JP, Chen R, Hon HW, Liu YH, Ma WY, Tomkins A, Zhang XD, eds. Proc. of the 17th Int’l Conf. on World Wide Web. Beijing: ACM Press, 2008. 131-140. [doi: 10.1145/1367497.1367516]

[18] Garey M, Johnson D. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman and Company, 1979.

[19] Feige U, Peleg D, Kortsarz G. The dense k-subgraph problem. Algorithmica, 2001,29(3):410-421. [doi: 10.1007/s004530010050]

[20] Arora S, Karger D, Karpinski M. Polynomial time approximation schemes for dense instances of NP-hard problems. In: Proc. of the 27th Annual ACM Symp. on Theory of Computing. Las Vegas: ACM Press, 1995. 284-293. [doi: 10.1145/225058.225140]

Get Citation

王宏志,李建中,高宏.一种非清洁数据库的数据模型.软件学报,2012,23(3):539-549

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:May 21,2010
Revised:April 28,2011
Adopted:
Online: March 05,2012
Published:

You are the first2038265Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History