多区间速度约束下的时序数据清洗方法
作者:
作者单位:

作者简介:

高菲(1993-),女,博士,主要研究领域为数据清洗.
宋韶旭(1981-),男,博士,副教授,博士生导师,CCF专业会员,主要研究领域为数据库,数据质量,时序数据清理,大数据集成.
王建民(1968-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为数据库,工作流,大数据与知识工程(非结构化数据管理、业务过程与产品生命周期管理、数字版权与系统安全技术、数据库测试技术).

通讯作者:

宋韶旭,E-mail:sxsong@tsinghua.edu.cn

中图分类号:

基金项目:

国家重点研发计划(2019YFB1705301);国家自然科学基金(62072265,61572272,71690231)


Time Series Data Cleaning under Multi-speed Constraints
Author:
Affiliation:

Fund Project:

National Key Research and Development Plan (2019YFB1705301); National Natural Science Foundation of China (62072265, 61572272, 71690231)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为进一步优化推广大数据及人工智能技术,作为数据管理与分析的基础,数据质量问题日益成为相关领域的研究热点.通常情况下,数据采集及记录仪的物理故障或技术缺陷等会导致收集到的数据存在一定的错误,而异常错误会对后续的数据分析以及人工智能过程产生不可小视的影响,因此在数据应用之前,需要对数据进行相应的数据清洗修复.现存的平滑修复方法会导致大量原本正确的数据点过度修复为异常值,而基于约束的顺序依赖方法以及SCREEN方法等也因为约束条件较为单薄而无法对复杂的数据情况进行精确修复.基于最小修复原则,进一步提出了多区间速度约束下的时间序列数据修复方法,并采用动态规划方法来求解最优修复路径.具体来说,提出了多个速度区间来对时序数据进行约束,并根据多速度约束对各数据点形成一系列修复候选点,进而基于动态规划方法从中选取最优修复解.为验证上述方法的可行性和有效性,采用一个人工数据集、两个真实数据集以及一个带有真实错误的数据集在不同的异常率及数据量下对上述方法进行实验.由实验结果可知:相较于其他现存的修复方法,该方法在修复结果及时间开销方面均有着较好的表现.进一步,对多个数据集通过聚类及分类精确率的验证来表明数据质量问题对后续数据分析及人工智能的影响至关重要,本方法可以提升数据分析及人工智能结果的质量.

    Abstract:

    As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields. Furthermore, data quality can optimize and promote big data and artificial intelligence technology. Generally, physical failures or technical defects in data collection and recorder will cause certain anomalies in collected data. These anomalies will have a significant impact on subsequent data analysis and artificial intelligence processes, thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. And the constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are proposed to constrain time series data, and a series of repairing candidate points is formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. In order to verify the feasibility and effectiveness of this method, an artificial data set, two real data sets, and another real data set with real anomalies are used for experiments under different rates of anomalies and data sizes. It can be seen from the experimental results that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS error and time cost. In addition, the verification of clustering and classification accuracy with several data sets shows the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.

    参考文献
    相似文献
    引证文献
引用本文

高菲,宋韶旭,王建民.多区间速度约束下的时序数据清洗方法.软件学报,2021,32(3):689-711

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-07-19
  • 最后修改日期:2020-09-03
  • 录用日期:
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号