Research on Record Pair Ranking for Entity Resolution with Time Constraint
Author:
Affiliation:

Fund Project:

National Key Research and Development Program of China (2018YFB1003404); National Natural Science Foundation of China (61672142, 61472070, 61602103); Natural Science Foundation of Tianjin of China (17JCYBJC15200)

  • Article
  • | |
  • Metrics
  • |
  • Reference [36]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Entity resolution (ER) is an important aspect of data integration and data cleaning, and is also a necessary pre-process step of big data analytics and mining. Traditional batch based ER's overall runtime is costly, and cannot satisfy current (nearly) real-time data applications' requirements. Therefore, time constraint entity resolution (TC-ER) is focused on, while core problem is record pair ranking according to match probability both information inner blocks and information across blocks are analyzed from multi-pass blocking respectively, and two basic recordsmatch probability methods are proposed. The basic methods are improved by proposing an advanced record match probability method based on similarity flowing over a biparitite graph.A bipartite graph is constructed according to record pairs, blocks, and relations between them. Similarities iteratively flow between pair nodes and block nodes over the bipartite graph until convergence. The convergence result is computed with fixpoint iterations. An approximate convergence computation mehod is proposed to reduce cost, and it improves real-time recall in TC-ER. Finally, the proposed methods are evaluated on two datasets, which shows their effectiveness and also tests different aspects of the proposed methods.

    Reference
    [1] Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection:A survey. IEEE Trans. on Knowledge and Data Engineering, 2007,19(1):1-16.[doi:10.1109/TKDE.2007.250581]
    [2] Hassanzadeh O, Chiang F, Lee HC, Miller RJ. Framework for evaluating clustering algorithms in duplicate detection. Proc. of the VLDB Endowment, 2009,2(1):1282-1293.[doi:10.14778/1687627.1687771]
    [3] Köpcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proc. of the VLDB Endowment, 2010,3(1-2):484-493.[doi:10.14778/1920841.1920904]
    [4] Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In:Proc. of the 2005 ACM SIGMOD Int'l Conf. on Management of Data. ACM, 2005. 85-96.
    [5] Ramadan B, Christen P, Liang H, Gayler RW. Dynamic sorted neighborhood indexing for real-time entity resolution. Journal of Data and Information Quality (JDIQ), 2015,6(4):15.
    [6] Whang SE, Marmaros D, Garcia-Molina H. Pay-as-You-Go entity resolution. IEEE Trans. on Knowledge and Data Engineering, 2013,25(5):1111-1124.
    [7] Papenbrock T, Heise A, Naumann F. Progressive duplicate detection. IEEE Trans. on Knowledge and Data Engineering, 2015,27(5):1316-1329.
    [8] Kushagra S, Saxena H, Ilyas IF, Ben-David S. A semi-supervised framework of clustering selection for de-duplication. In:Proc. of the 35th Int'l Conf. on Data Engineering (ICDE). IEEE, 2019. 208-219.
    [9] Nie H, Han X, He B, Sun L, Chen B, Zhang W, Wu S, Kong H. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In:Proc. of the 28th ACM Int'l Conf. on Information and Knowledge Management. ACM, 2019. 629-638.
    [10] Tauer G, Date K, Nagi R, Suditc M. An incremental graph-partitioning algorithm for entity resolution. Information Fusion, 2019,46:171-183.
    [11] Kwashie S, Liu L, Liu J, Liu L, Stumptner M, Yang L. Certus:An effective entity resolution approach with graph differential dependencies (GDDs). Proc. of the VLDB Endowment, 2019,12(6):653-666.
    [12] Reas R, Ash S, Barton R, Borthwick A. Superpart:Supervised graph partitioning for record linkage. In:Proc. of the 2018 IEEE Int'l Conf. on Data Mining (ICDM). IEEE, 2018. 387-396.
    [13] Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N. Distributed representations of tuples for entity resolution. Proc. of the VLDB Endowment, 2018,11(11):1454-1467.
    [14] Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching:A design space exploration. In:Proc. of the 2018 Int'l Conf. on Management of Data. ACM, 2018. 19-34.
    [15] Li JZ, Wang HZ, Gao H. State-of-the-Art of research on big data usability. Ruan Jian Xue Bao/Journal of Software, 2016,27(7):1605-1625(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5038.htm[doi:10.13328/j.cnki.jos.005038]
    [16] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. on Knowledge and Data Engineering, 2012,24(9):1537-1555.[doi:10.1109/TKDE.2011.127]
    [17] Shao J, Wang Q, Lin Y. Skyblocking for entity resolution. Information Systems, 2019,85:30-43.
    [18] Nascimento DC, Pires CES, Mestre DG. Exploiting block co-occurrence to control block sizes for entity resolution. Knowledge and Information Systems, 2019, 1-42.
    [19] Allam A, Skiadopoulos S, Kalnis P. Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 2018,117:98-113.
    [20] Fisher J, Christen P, Wang Q, Rahm E. A clustering-based framework to control block sizes for entity resolution. In:Proc. of the 21st ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2015. 279-288.
    [21] Hernández MA, Stolfo SJ. Real-World data is dirty:Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998,2(1):9-37.
    [22] Draisbach U, Naumann F, Szott S, Wonneberg O. Adaptive windows for duplicate detection. In:Proc. of the 2012 IEEE 28th Int'l Conf. on Data Engineering. IEEE, 2012. 1073-1083.
    [23] Papadakis G, Ioannou E, Palpanas T, Niederee C, Nejdl W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. on Knowledge and Data Engineering, 2013,25(12):2665-2682.
    [24] Papadakis G, Koutrika G, Palpanas T, Nejdl W. Meta-Blocking:Taking entity resolution to the next level. IEEE Trans. on Knowledge and Data Engineering, 2014,26(8):1946-1960.
    [25] Gentile AL, Ristoski P, Eckel S, Ritze D, Paulheim H. Entity matching on Web tables:A table embeddings approach for blocking. In:Proc. of the 22nd Int'l Conf. on Extending Database Technology. 2017. 510-513.
    [26] Kenig B, Gal A. Mfiblocks:An effective blocking algorithm forentity resolution. Information Systems, 2013,38(6):908-926.
    [27] Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In:Getoor L, Senator TE, Domingos PM, Faloutsos C, eds. Proc. of the ACM KDD Workshop on Data Cleaning and Object Consolidation. New York:ACM, 2003. 73-78.
    [28] Sun CC, Shen DR, Kou Y, Nie TZ, Yu G. Entity resolution oriented clustering algorithm. Ruan Jian Xue Bao/Journal of Software, 2016,27(9):2303-2319(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5043.htm[doi:10.13328/j.cnki.jos. 005043]
    [29] Melnik S, Garcia-Molina H, Rahm E. Similarity flooding:A versatile graph matching algorithm and its application to schema matching. In:Proc. of the 18th Int'l Conf. on Data Engineering. IEEE, 2002. 117-128.
    [30] Jeh G, Widom J. SimRank:A measure of structural-context similarity. In:Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM, 2002. 538-543.
    [31] Motwani R, Raghavan P. Randomized Algorithms. Cambridge:Cambridge University Press, 1995.
    [32] Christen P, Pudjijono A. Accurate synthetic generation of realistic personal information. In:Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining. Berlin, Heidelberg:Springer-Verlag, 2009. 507-514.
    [33] Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In:Proc. of the 14th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM, 2008. 151-159.
    附中文参考文献:
    [15] 李建中,王宏志,高宏.大数据可用性的研究进展.软件学报,2016,27(7):1605-1625. http://www.jos.org.cn/1000-9825/5038.htm[doi:10.13328/j.cnki.jos.005038]
    [28] 孙琛琛,申德荣,寇月,聂铁铮,于戈.面向实体识别的聚类算法.软件学报,2016,27(9):2303-2319. http://www.jos.org.cn/1000-9825/5043.htm[doi:10.13328/j.cnki.jos.005043]
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

孙琛琛,申德荣,李玉坤,肖迎元,马建红.时间约束的实体解析中记录对排序研究.软件学报,2020,31(3):695-709

Copy
Share
Article Metrics
  • Abstract:3074
  • PDF: 5861
  • HTML: 2952
  • Cited by: 0
History
  • Received:July 15,2019
  • Revised:September 10,2019
  • Online: January 10,2020
  • Published: March 06,2020
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063