Research on Record Pair Ranking for Entity Resolution with Time Constraint
Author:
Affiliation:

Clc Number:

Fund Project:

National Key Research and Development Program of China (2018YFB1003404); National Natural Science Foundation of China (61672142, 61472070, 61602103); Natural Science Foundation of Tianjin of China (17JCYBJC15200)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Entity resolution (ER) is an important aspect of data integration and data cleaning, and is also a necessary pre-process step of big data analytics and mining. Traditional batch based ER's overall runtime is costly, and cannot satisfy current (nearly) real-time data applications' requirements. Therefore, time constraint entity resolution (TC-ER) is focused on, while core problem is record pair ranking according to match probability both information inner blocks and information across blocks are analyzed from multi-pass blocking respectively, and two basic recordsmatch probability methods are proposed. The basic methods are improved by proposing an advanced record match probability method based on similarity flowing over a biparitite graph.A bipartite graph is constructed according to record pairs, blocks, and relations between them. Similarities iteratively flow between pair nodes and block nodes over the bipartite graph until convergence. The convergence result is computed with fixpoint iterations. An approximate convergence computation mehod is proposed to reduce cost, and it improves real-time recall in TC-ER. Finally, the proposed methods are evaluated on two datasets, which shows their effectiveness and also tests different aspects of the proposed methods.

    Reference
    Related
    Cited by
Get Citation

孙琛琛,申德荣,李玉坤,肖迎元,马建红.时间约束的实体解析中记录对排序研究.软件学报,2020,31(3):695-709

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:July 15,2019
  • Revised:September 10,2019
  • Adopted:
  • Online: January 10,2020
  • Published: March 06,2020
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063