基于函数依赖与条件约束的数据修复方法
作者:
基金项目:

国家重点基础研究发展计划(973)(2012CB316203);国家自然科学基金(61370101,U1501252,61532021);上海市教委科研创新重点项目(14ZZ045)


Functional Dependency and Conditional Constraint Based Data Repair
Author:
Fund Project:

National Basic Research Program of China (973) (2012CB316203); National Natural Science Foundation of China (61370101, U1501252, 61532021); Innovation Program of Shanghai Municipal Education Commission (14ZZ045)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [27]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随着经济与信息技术的发展,在许多应用中均产生大量数据.然而,受硬件设备、人工操作、多源数据集成等诸多因素的影响,在这些应用之中往往存在较为严重的数据质量问题,特别是不一致性问题,从而无法有效管理数据.因此,首要的任务就是开发新型数据清洗技术来提升数据质量,以支持后续的数据管理与分析.现有工作主要研究基于函数依赖的数据修复技术,即以函数依赖来描述数据一致性约束,通过变更数据库中部分元组的属性值(而非增加/删除元组)来使得整个数据库遵循函数依赖集合.从一致性约束描述的角度来看,函数依赖并非是唯一的表达方式,还存在其他表达方式,例如硬约束、数量约束、等值约束、非等值约束等.然而,随着一致性约束种类的增加,其处理难度也远比仅有函数依赖的场景要困难.考虑以函数依赖与其他一致性约束共同表述数据库的一致性约束,并在此基础上设计数据修复算法,从而提升数据质量.实验结果表明,所提方法的执行效率较高.

    Abstract:

    Along with the development of economy and information technology, a large amount of data are produced in many applications. However, due to the influence of some factors, such as hardware equipments, manual operations, and multi-source data integration, serious data quality issues sunch as data inconsistency arise, which makes it more challenging to manage data effectively. Hence, it is crucial to develop new data cleaning technology to improve data quality to better support data management and analysis. Existing work in this area mainly focuses on the situation where functional dependencies are used to describe data inconsistency. Once some violations are detected, some tuples must be changed to suit for the functional dependency set via update (instead of insert or delete). Besides functional dependency, there also exist other types of constraints, such as the hard constraint, quantity constraint, equivalent constraint, and non-equivalent constraint. However, it becomes more difficult when more inconsistent conditions are involved. This paper addresses the general scenario where functional dependencies and other constraints co-exist. Corresponding data repair algorithm is designed to improve the data quality effectively. Experimental results show that the proposed method performs effectively and efficiently.

    参考文献
    [1] President's Council of Advisors on Science and Technology. Designing a digital future:Federally funded research and development in networking and information technology. 2010. http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf
    [2] Big data:Science in the peta-byte era. Nature, 2008,455:1-136. http://www.nature.com/nature/journal/v455/n7209/edsumm/e080904-01.html[doi:10.1038/455001a]
    [3] Dealing with the data. Science, 2011,331(6018):639-806. http://www.sciencemag.org/site/special/data/
    [4] Gong XQ, Jin CQ, Wang XL, Zhang R, Zhou AY. Data-Intensive science and engineering:Requirements and challenges. Chinese Journal of Computers, 2012,35(8):1-16(in Chinese with English abstract).
    [5] Ao L, Shu JW, Li MQ. Data deduplication techniques. Ruan Jian Xue Bao/Journal of Software, 2010,21(5):916-929(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/3761.htm[doi:10.3724/SP.J.1001.2010.03761]
    [6] Jeffery S, Garofalakis M, Franklin M. Adaptive cleaning for RFID data streams. In:Proc. of the VLDB. Seoul:VLDB Endowment, 2006.163-174.
    [7] Redman TC. The impact of poor data quality on the typical enterprise. Communications of the ACM, 1998,41(2):79-82.
    [8] Bohannon P, Flaster M, Fan WF, Rastogi R. A cost-based model and effective heuristic for repairing constraints by value modification. In:Proc. of the SIGMOD. New York:ACM Press, 2005.143-154.[doi:10.1145/1066157.1066175]
    [9] Kolahi S, Lakshmanan LVS. On approximating optimum repairs for functional dependency violations. In:Proc. of the ICDT. New York:ACM Press, 2009.53-62.[doi:10.1145/1514894.1514901]
    [10] Arenas M, Bertossi LE, Chomicki J. Consistent query answers in inconsistent databases. In:Proc. of the PODS. New York:ACM, 1999.68-79.[doi:10.1145/303976.303983]
    [11] Lopatenko A, Bertossi LE. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In:Proc. of the ICDT. Barcelona:Springer-Verlag, 2007.179-193.[doi:10.1007/11965893_13]
    [12] Bekales G, Ilyas IF, Golab L. Sampling the repairs of functional dependency violations under hard constraints. In:Proc. of the VLDB. Singapore:VLDB Endowment, 2010,3(1):197-207.[doi:10.14778/1920841.1920870]
    [13] He C, Tan ZJ, Chen Q, Sha CF, Wang ZH, Wang W. Repair diversification for functional dependency violation. In:Proc. of the DASFAA. Heidelberg:Springer-Verlag, 2014.468-482.[doi:10.1007/978-3-319-05813-9_31]
    [14] Chen Q, Tan ZJ, He C, Sha CF, Wang W. Repairing functional dependency violations in distributed data. In:Proc. of the DASFAA. Heidelberg:Springer-Verlag, 2015.441-457.[doi:10.1007/978-3-319-18120-2_26]
    [15] Li J, Liu X. An important aspect of big data:Data usability. Journal of Computer Research and Development, 2013, 50(6):1147-1162(in Chinese with English abstract).
    [16] Bleiholder J, Szott S, Herschel M, Kaufer F, Naumann F. Subsumption and complementation as data fusion operators. In:Proc. of the EDBT. New York:ACM Press, 2010.513-524.[doi:10.1145/1739041.1739103]
    [17] Liu HP, Jin CQ, Zhou AY. A pattern-based entity resolution algorithm. Chinese Journal of Computers, 2015,38(9):1796-1808(in Chinese with English abstract).
    [18] Xie JY, Yang J, Chen YG, Wang HX, Yu PS. A sampling-based approach to information recovery. In:Proc. of the ICDE. Iscataway, NJ:IEEE Computer Society, 2008.476-485.[doi:10.1109/ICDE.2008.4497456]
    [19] Chen HQ, Ku WS, Wang HX, Sun MT. Leveraging spatio-temporal redundancy for RFID data cleansing. In:Proc. of the SIGMOD. New York:ACM Press, 2010.51-62.[doi:10.1145/1807167.1807176]
    [20] Zhuang YZ, Chen L. In-Network outlier cleaning for data collection in sensor networks. In:Proc. of the CleanDB. New York:VLDB Endowment, 2006.41-48.
    [21] Chiang F, Miller RJ. A unified model for data and constraint repair. In:Proc. of the ICDE. Iscataway, NJ:IEEE Computer Society, 2011.[doi:10.1109/ICDE.2011.5767833]
    [22] Silberschatz A, Korth H, Sudarshan S. Database System Concepts. 6th ed., McGraw-Hill Education, 2010.
    附中文参考文献:
    [4] 宫学庆,金澈清,王晓玲,张蓉,周傲英.数据密集型科学与工程:需求和挑战.计算机学报,2012,35(8):1-16.
    [5] 敖莉,舒继武,李明强.重复数据删除技术.软件学报,2010,21(5):916-929. http://www.jos.org.cn/1000-9825/3761.htm[doi:10.3724/SP.J.1001.2010.03761]
    [15] 李建中,刘显敏.大数据的一个重要方面:数据可用性.计算机研究与发展,2013,50(6):1147-1162.
    [17] 刘辉平,金澈清,周傲英.一种基于模式的实体解析算法.计算机学报,2015,38(9):1796-1808.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

金澈清,刘辉平,周傲英.基于函数依赖与条件约束的数据修复方法.软件学报,2016,27(7):1671-1684

复制
分享
文章指标
  • 点击次数:5878
  • 下载次数: 7518
  • HTML阅读次数: 2973
  • 引用次数: 0
历史
  • 收稿日期:2015-10-09
  • 最后修改日期:2016-01-12
  • 在线发布日期: 2016-03-24
文章二维码
您是第20055153位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号