基于多视角的多类型错误全面检测方法
作者:
作者单位:

作者简介:

彭锦峰(1992-),男,博士生,CCF学生会员,主要研究领域为数据质量,人工智能;寇月(1980-),女,博士,副教授,CCF高级会员,主要研究领域为推荐系统,实体识别;申德荣(1964-),女,博士,教授,博士生导师,CCF高级会员,主要研究领域为Web数据处理,分布式数据库;聂铁铮(1980-),男,博士,副教授,CCF高级会员,主要研究领域为数据质量,数据集成.

通讯作者:

彭锦峰,pengjinfeng11@163.com

中图分类号:

基金项目:

国家自然科学基金(62172082,62072084,62072086);中央高校基本科研业务费(N2116008)


Comprehensive Error Detection Method for Multiple Types Errors Based on Multiple Views
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着信息化社会的发展,数据的规模越发庞大,数据的种类也越发丰富.时至今日,数据已经成为国家和企业的重要战略资源,是科学化管理的重要保障.然而,随着社会生活产生的数据日益丰富,大量的脏数据也随之而来,数据质量问题油然而生.如何准确而全面地检测出数据集中所包含的错误数据,一直是数据科学中的痛点问题.尽管已有许多传统方法被广泛用于各行各业,如基于约束与统计的检测方法,但这些方法通常需要丰富的先验知识与昂贵的人力和时间成本.受限于此,这些方法往往难以准确而全面地检测数据.近年来,许多新型错误检测方法利用深度学习技术,通过时序推断、文本解析等方式取得了更好检测效果,但它们通常只适用于特定的领域或特定的错误类型,面对现实生活中的复杂情况,泛用性不足.基于上述情况,结合传统方法与深度学习技术的优点,提出了一个基于多视角的多类型错误全面检测模型CEDM.首先,从模式的角度,结合现有约束条件,在属性、单元和元组层面进行多维度的统计分析,构建出基础检测规则;然后,通过词嵌入捕获数据语义,从语义的角度分析属性相关性、单元关联性与元组相似性,进而基于语义关系,从多个维度上更新、扩展基础规则;最终,联合多个视角对多种类型的错误进行全面检测.在多个真实数据集与合成数据集上进行了实验,结果表明,该方法优于现有的错误检测方法,并且能够适用于多种错误类型与多种领域,具有更高的泛用性.

    Abstract:

    With the development of the information society, the scale of data has become larger and the types of data have become more abundant. Nowadays, data have become important strategic resources, which are the vital guarantees for scientific management for countries and enterprises. Nevertheless, with the increasing of data generated in social life, a large amount of dirty data come along with it, and data quality issue ensues. In the field of data science, it has always been a pain point that how to detect errors in an accurate and comprehensive manner. Although many traditional methods based on constraints or statistics have been widely used, they are usually limited by prior knowledge and labor cost. Recently, some novel methods detect errors by utilizing deep learning model to inference time series data or analyze context data and achieve better performance. However, these methods tend to be only applicable to specific areas or specific types of errors, which are not general enough for complex reality cases. Based on above observations, this study takes advantages of both traditional methods and deep learning model to propose a comprehensive error detection method (CEDM), which can deal with multiple type errors in multiple views. Firstly, under the view of patterns, basic detection rules can be constructed based on the statistical analysis with constraints from multiple dimensions, including attributes, cells, and tuples. After this, under the semantic view, data semantics are captured by word embedding and attribute relevance, cell dependency, and tuple similarity are analyzed. And hence, the basic rules can be extended and updated based on the semantic relations in different dimensions. Finally, the errors of multiple types could be detected comprehensively and accurately in multiple views. Extensive experiments on real and synthetic datasets demonstrate that the proposed method outperforms the state-of-the-art error detection methods and has higher generalization ability that can be applicable to multiple areas and multiple error types.

    参考文献
    相似文献
    引证文献
引用本文

彭锦峰,申德荣,寇月,聂铁铮.基于多视角的多类型错误全面检测方法.软件学报,2023,34(3):1049-1064

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-05-15
  • 最后修改日期:2022-07-29
  • 录用日期:
  • 在线发布日期: 2022-10-26
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号