兼顾行列的时序数据质量规则发现
作者:
作者简介:

丁小欧(1993-),女,博士,助理教授,CCF专业会员,主要研究领域为数据质量,数据清洗,时序数据管理;王宏志(1978-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库管理系统,大数据分析与治理;李映泽(2001-),男,本科生,主要研究领域为时序数据质量管理,数据库;李昊轩(2001-),男,本科生,主要研究领域为数据清洗,异常检测,时序数据挖掘;王晨(1981-),男,副研究员,CCF专业会员,主要研究领域为数据库,工业大数据,工业化联网.

通讯作者:

王宏志,wangzh@hit.edu.cn

基金项目:

国家自然科学基金(62232005,62202126);国家重点研发计划(2021YFB3300502);CCF-华为胡杨林基金数据库专项(CCF-HuaweiDB202204);黑龙江省博士后资助项目(LBH-Z21137)


Time Series Data Quality Rules Discovery with Both Row and Column Dependencies
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [31]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    智能装置设备产生的时序数据增长迅速,存在严重的数据质量问题.劣质时序数据质量管理和数据质量提升技术需求日益迫切.时序数据的有序时窗、行列关联等特点,为时序数据质量语义表达提出了挑战.提出了一种同时考虑时序数据在行与列上的数据依赖信息的数据质量规则,即时序否定约束TDC.研究了TDC的定义与构建方法,从时窗与多阶表达式运算这两个方面,对已有的数据质量规则体系进行表达力的扩展,并提出针对兼顾行列的时序数据质量规则挖掘方法.在真实时序数据集上开展大量实验,实验结果验证了该方法能够有效且高效地挖掘时序数据中隐藏的数据质量规则.对比实验的结果表明,该方法能够有效地对行与列上的关联信息进行谓词构造;在质量规则挖掘效果上优于单纯的行上约束挖掘方法以及单纯的列上约束挖掘方法.

    Abstract:

    Time series data generated by intelligent devices are growing rapidly and faced with serious data quality problems. The demand for time series data quality management and data quality improvement based on data repairing techniques is increasingly urgent. Time series data has the obvious characteristics about the ordered time window and strong associations between rows and columns. This brings much more challenges for the research of the data quality semantic expression of time series data. This study proposes the definition and the construction of time series data quality rules, which takes into account the association on both rows and columns. It extends the expression of the existing data quality rule systems in terms of time window and multi-order expression operation. In addition, the discovery method is proposed for time series data quality rules. Experiment results on real time series data sets verify that the proposed method can effectively and efficiently discover hidden data quality rules from time series data, showing that the proposed method has higher performance with the predicate construction of associated information on row and column, compared with the existing data quality rule discovery method.

    参考文献
    [1] Li JZ, Wang HZ, Gao H.State-of-the-art of research on big data usability.Ruan Jian Xue Bao/Journal of Software, 2016, 27(7):1605-1625(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5038.htm[doi:10.13328/j.cnki.jos.005038]
    [2] Ilyas IF, Chu X.Data Cleaning.ACM, 2019.150-194.
    [3] Ilyas IF, Chu X.Trends in cleaning relational data:Consistency and deduplication.Foundations and Trends® in Databases, 2015, 5(4):281-393.
    [4] Fan WF, Geerts F.Foundations of Data Quality Management.Morgan & Claypool Publishers, 2012.13-68.
    [5] Chu X, Ilyas IF, Papotti P.Discovering denial constraints.Proc.of the VLDB Endowment, 2013, 6(13):1498-1509.
    [6] Pena EHM, de Almeida EC, Naumann F.Discovery of approximate (and exact) denial constraints.Proc.of the VLDB Endowment, 2019, 13(3):266-278.
    [7] Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H.TANE:An efficient algorithm for discovering functional and approximate dependencies.The Computer Journal, 1999, 42(2):100-111.
    [8] Chiang F, Miller RJ.Discovering data quality rules.Proc.of the VLDB Endowment, 2008, 1(1):1166-1177.
    [9] Song SX, Zhang AQ, Wang JM, Yu PS.SCREEN:Stream data cleaning under speed constraints.In:Proc.of the ACM SIGMOD Int'l Conf.on Management of Data.ACM, 2015.827-841.
    [10] Fan WF, Geerts F, Li JZ, Xiong M.Discovering conditional functional dependencies.IEEE Trans.on Knowledge and Data Engineering, 2010, 23(5):683-698.
    [11] Golab L, Karloff H, Korn F, et al.Sequential dependencies.Proc.of the VLDB Endowment, 2009, 2(1):574-585.
    [12] Fan WF, Geerts F, Tang N, et al.Conflict resolution with data currency and consistency.Journal of Data and Information Quality (JDIQ), 2014, 5(1-2):1-37.
    [13] Bleifuß T, Kruse S, Naumann F.Efficient denial constraint discovery with hydra.Proc.of the VLDB Endowment, 2017, 11(3):311-323.
    [14] Livshits E, Heidari A, Ilyas IF, et al.Approximate denial constraints.Proc.of the VLDB Endowment, 2020, 13(10):1682-1695.
    [15] Wang X, Wang C.Time series data cleaning:A survey.IEEE Access, 2019, 8:1866-1881.
    [16] Dasu T, Duan R, Srivastava D.Data quality for temporal streams.IEEE Data Engineering.Bulletin, 2016, 39(2):78-92.
    [17] Zhang AQ, Song SX, Wang JM, et al.Time series data cleaning:From anomaly detection to anomaly repairing.Proc.of the VLDB Endowment, 2017, 10(10):1046-1057.
    [18] Gao F, Song SX, Wang JM.Time series data cleaning under multi-speed constraints.Ruan Jian Xue Bao/Journal of Software, 2021, 32(3):689-711(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/6176.htm[doi:10.13328/j.cnki.jos.006176]
    [19] Ding XO, Yu SJ, Wang MX, Wang HZ, Gao H, Yang DH.Anomaly detection on industrial time series based on correlation analysis.Ruan Jian Xue Bao/Journal of Software, 2020, 31(3):726-747(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5907.htm[doi:10.13328/j.cnki.jos.005907]
    [20] Liang Z, Wang HZ, Ding XO, Mu TY.Industrial time series determinative anomaly detection based on constraint hypergraph.Knowledge-based Systems, 2021, 233:Article No.107548.
    [21] Baudinet M, Chomicki J, Wolper P.Constraint-generating dependencies.Journal of Computer and System Sciences, 1999, 59(1):94-115.
    [22] Abiteboul S, Hull R, Vianu V.Foundations of Databases.Addison-Wesley, 1995.
    [23] Schmidt M, Lipson H.Distilling free-form natural laws from experimental data.Science, 2009, 324(5923):81-85.
    [24] La Cava W, Orzechowski P, Burlacu B, de França FO, Virgolin M, Jin Y, Kommenda M, Moore JH.Contemporary symbolic regression methods and their relative performance.In:Proc.of the 35th Conf.on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).2021.https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c0c7c76d30bd3dcaefc96f40275bdc0a-Paper-round1.pdf
    [25] Virgolin M, Alderliesten T, Witteveen C, et al.Improving model-based genetic programming for symbolic regression of small expressions.Evolutionary Computation, 2021, 29(2):211-237.
    [26] https://finance.yahoo.com/quote/%5EIXIC?p=^IXIC&.tsrc=fin-srch
    [27] Li ZJ, Ding XO, Wang HZ.An effective constraint-based anomaly detection approach on multivariate time series.In:Wang X, Zhang R, Lee YK, Sun L, Moon YS, eds.Proc.of the 4th APWeb-WAIM Joint Int'l Conf.on Web and Big Data.LNCS Vol.12318, Cham:Springer, 2020.61-69.[doi:10.1007/978-3-030-60290-1_5]
    附中文参考文献
    [1] 李建中, 王宏志, 高宏.大数据可用性的研究进展.软件学报, 2016, 27(7):1605-1625.http://www.jos.org.cn/1000-9825/5038.htm[doi:10.13328/j.cnki.jos.005038]
    [18] 高菲, 宋韶旭, 王建民.多区间速度约束下的时序数据清洗方法.软件学报, 2021, 32(3):689-711.http://www.jos.org.cn/1000-9825/6176.htm[doi:10.13328/j.cnki.jos.006176]
    [19] 丁小欧, 于晟健, 王沐贤, 王宏志, 高宏, 杨东华.基于相关性分析的工业时序数据异常检测.软件学报, 2020, 31(3):726-747.http://www.jos.org.cn/1000-9825/5907.htm[doi:10.13328/j.cnki.jos.005907]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

丁小欧,李映泽,王晨,王宏志,李昊轩.兼顾行列的时序数据质量规则发现.软件学报,2023,34(3):1065-1086

复制
分享
文章指标
  • 点击次数:1292
  • 下载次数: 4586
  • HTML阅读次数: 3340
  • 引用次数: 0
历史
  • 收稿日期:2022-05-16
  • 最后修改日期:2022-07-29
  • 在线发布日期: 2022-10-28
  • 出版日期: 2023-03-06
文章二维码
您是第20544200位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号