国家自然科学基金青年基金(62102271, 62002245); 辽宁省教育厅基础研究项目(JYT2020027)
不一致数据子集修复问题是数据清洗领域的重要研究问题, 现有方法大多是基于完整性约束规则的, 采用最小删除元组数量原则进行子集修复. 然而, 这种方法没有考虑删除元组的质量, 导致修复准确性较低. 为此, 提出规则与概率相结合的子集修复方法, 建模不一致元组概率使得正确元组的平均概率大于错误元组的平均概率, 求解删除元组概率和最小的子集修复方案. 此外, 为了减小不一致元组概率计算的时间开销, 提出一种高效的错误检测方法, 减小不一致元组规模. 真实数据和合成数据上的实验结果验证所提方法的准确性优于现有最好方法.
Subset repair for inconsistent data is an important research problem in the field of data cleaning. Most of the existing methods are based on integrity constraint rules and adopt the principle of the minimum number of deleted tuples for subset repair. However, these methods take no account of the quality of deleted tuples, and the repair accuracy is low. Therefore, this study proposes a subset repair method combining rules and probabilities. The probability of inconsistent tuples is modeled so that the average probability of correct tuples is greater than that of wrong tuples, and the optimal subset repair with the smallest sum of the probability of deleted tuples is calculated. In addition, in order to reduce the time overhead of calculating the probability of inconsistent tuples, this study proposes an efficient error detection method to reduce the size of inconsistent tuples. Experimental results on real data and synthetic data verify that the proposed method outperforms the state-of-the-art subset repair method in terms of accuracy.