[关键词]
[摘要]
随着海量数据的涌现和不断积累,数据治理成为提高数据质量、最大化数据价值的重要手段.其中,数据错误检测是提高数据质量的关键步骤,近年来引起了学术界及工业界的广泛关注.目前,绝大多数错误检测方法只适用于单数据源场景.然而在现实场景中,数据往往不集中存储与管理.不同来源且高度相关的数据能够提升错误检测的精度.但由于数据隐私安全问题,跨源数据往往不允许集中共享.鉴于此,提出了一种基于联邦学习的跨源数据错误检测方法FeLeDetect,以在数据隐私保证的前提下,利用跨源数据信息提高错误检测精度.为了充分捕获每一个数据源的数据特征,首先提出一种基于图的错误检测模型GEDM,并在此基础上设计了一种联邦协同训练算法FCTA,以支持在各方数据不出本地的前提下,利用跨源数据协同训练GEDM.此外,为了降低联邦训练的通信开销和人工标注成本,还提出了一系列优化方法.最后,在3个真实数据集上进行了大量的实验.实验结果表明:(1)相较于5种现有最先进的错误检测方法,GEDM在本地场景和集中场景下,错误检测结果的F1分数平均提高了10.3%和25.2%;(2) FeLeDetect错误检测结果的F1分数较本地场景下GEDM的结果平均提升了23.2%.
[Key word]
[Abstract]
With the emergence and accumulation of massive data, data governance has become an important manner to improve data quality and maximize data value. Error detection is crucial for improving data quality, which has attracted a surge of interests from both industry and academia. Various detection methods tailored for a single data source have been proposed. Nevertheless, in many real-world scenarios, data is not centrally stored and managed. Different sources of correlated data can be employed to improve the accuracy of error detection. Unfortunately, due to privacy/security issues, cross-source data is often not allowed to be integrated centrally. To this end, this study proposes FeLeDetect, a cross-source data error detection method based on federated learning. First, a graph-based error detection model (GEDM) is presented to capture sufficient data features from each data source. Then, the study investigates a federated co-training algorithm (FCTA) to collaboratively train GEDM over different data sources without privacy leakage. Furthermore, the study designs a series of optimization methods to reduce the communication cost during the federated learning and the manual labeling efforts. Extensive experiments on three real-life datasets demonstrate that GEDM achieves an average improvement of 10.3% F1-score in the local scenario and 25.2% F1-score in the centralized scenario, outperforming all the five existing state-of-the-art competitors for a single data source; and FeLeDetect further enhances local GEDM in terms of F1-score by 23.2% on average.
[中图分类号]
[基金项目]
国家重点研发计划(2021YFC3300303);国家自然科学基金(62025206,61972338,62102351)