规则与概率相结合的不一致数据子集修复方法

doi:10.13328/j.cnki.jos.006972

微信服务号

微信订阅号

2025年6月15日 12:34 星期日

首页 > 过刊浏览>2024年第35卷第9期 >4448-4468. DOI:10.13328/j.cnki.jos.006972

PDF HTML阅读 XML下载导出引用引用提醒

规则与概率相结合的不一致数据子集修复方法
DOI:
                        10.13328/j.cnki.jos.006972
                    
CSTR:
                        
                    
作者:
                        张安珍张安珍
沈阳航空航天大学 计算机学院, 辽宁 沈阳 110136
在期刊界中查找
在百度中查找
在本站中查找
司佳宇司佳宇
沈阳航空航天大学 计算机学院, 辽宁 沈阳 110136
在期刊界中查找
在百度中查找
在本站中查找
梁天宇梁天宇
沈阳航空航天大学 计算机学院, 辽宁 沈阳 110136
在期刊界中查找
在百度中查找
在本站中查找
朱睿朱睿
沈阳航空航天大学 计算机学院, 辽宁 沈阳 110136
在期刊界中查找
在百度中查找
在本站中查找
邱涛邱涛
沈阳航空航天大学 计算机学院, 辽宁 沈阳 110136
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:张安珍(1990－), 女, 博士, 讲师, CCF专业会员, 主要研究领域为大数据质量管理, 近似查询处理库. ;司佳宇(1997－), 女, 硕士, 主要研究领域为数据质量管理. ;梁天宇(2001－), 男, 学士, 主要研究领域为数据质量管理. ;朱睿(1982－), 男, 博士, 副教授, CCF高级会员, 主要研究领域为流数据管理, 查询处理与优化. ;邱涛(1989－), 男, 博士, 讲师, CCF专业会员, 主要研究领域文本数据管理, 查询优化处理.
通讯作者:朱睿, E-mail: zhurui@sau.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金青年基金(62102271, 62002245); 辽宁省教育厅基础研究项目(JYT2020027)

Subset Repair Method Combining Rules and Probabilities for Inconsistent Data

Author:

ZHANG An-Zhen
ZHANG An-Zhen
School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
在期刊界中查找
在百度中查找
在本站中查找
SI Jia-Yu
SI Jia-Yu
School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
在期刊界中查找
在百度中查找
在本站中查找
LIANG Tian-Yu
LIANG Tian-Yu
School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Rui
ZHU Rui
School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
在期刊界中查找
在百度中查找
在本站中查找
QIU Tao
QIU Tao
School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

不一致数据子集修复问题是数据清洗领域的重要研究问题, 现有方法大多是基于完整性约束规则的, 采用最小删除元组数量原则进行子集修复. 然而, 这种方法没有考虑删除元组的质量, 导致修复准确性较低. 为此, 提出规则与概率相结合的子集修复方法, 建模不一致元组概率使得正确元组的平均概率大于错误元组的平均概率, 求解删除元组概率和最小的子集修复方案. 此外, 为了减小不一致元组概率计算的时间开销, 提出一种高效的错误检测方法, 减小不一致元组规模. 真实数据和合成数据上的实验结果验证所提方法的准确性优于现有最好方法.

关键词:不一致数据;函数依赖;子集修复;概率图网络

Abstract:

Subset repair for inconsistent data is an important research problem in the field of data cleaning. Most of the existing methods are based on integrity constraint rules and adopt the principle of the minimum number of deleted tuples for subset repair. However, these methods take no account of the quality of deleted tuples, and the repair accuracy is low. Therefore, this study proposes a subset repair method combining rules and probabilities. The probability of inconsistent tuples is modeled so that the average probability of correct tuples is greater than that of wrong tuples, and the optimal subset repair with the smallest sum of the probability of deleted tuples is calculated. In addition, in order to reduce the time overhead of calculating the probability of inconsistent tuples, this study proposes an efficient error detection method to reduce the size of inconsistent tuples. Experimental results on real data and synthetic data verify that the proposed method outperforms the state-of-the-art subset repair method in terms of accuracy.

Key words:inconsistent data;functional dependency;subset repair;probabilistic graph network

引用本文

张安珍,司佳宇,梁天宇,朱睿,邱涛.规则与概率相结合的不一致数据子集修复方法.软件学报,2024,35(9):4448-4468

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-03-27
最后修改日期:2022-06-28
录用日期:
在线发布日期: 2023-09-27
出版日期: 2024-09-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码