Entity Resolution Oriented Clustering Algorithm

doi:10.13328/j.cnki.jos.005043

微信服务号

微信订阅号

2025-5-11- 17

Home > Archive>Volume 27, Issue 9, 2016 >2303-2319. DOI:10.13328/j.cnki.jos.005043

PDF HTML XML Export Cite reminder

Entity Resolution Oriented Clustering Algorithm
DOI:
                        10.13328/j.cnki.jos.005043
                    
Author:
                        SUN Chen-ChenSUN Chen-Chen
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
SHEN De-RongSHEN De-Rong
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
KOU YueKOU Yue
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
NIE Tie-ZhengNIE Tie-Zheng
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
YU GeYU Ge
College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:National Natural Science Foundation of China (61472070, 61402213); National Basic Research Program of China (973) (2012CB316201); Fundamental Research Funds for the Central Universities (N110404010)

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Entity resolution (ER) is a key aspect of data quality and is necessary for big data processing. Existing ER research focuses on data object similarity algorithms, blocking and supervised ER technologies, but pays little attention to matching decision problems in unsupervised ER. This paper proposes a clustering algorithm for ER to complement existing work. The algorithm builds a weighted similarity graph with data objects and their pairwise similarities. During clustering, the similarity between a cluster and a vertex is dynamically computed via random walk with restarts on the similarity graph. The basic logic behind clustering is that a cluster absorbs the nearest neighbor vertex iteratively. A data object ordering method is also proposed to optimize clustering order, promoting clustering accuracy. Further, an improved computation method of random walk's stationary probability distribution is proposed to reduce cost of the clustering algorithm. The evaluation on real datasets and synthetic datasets validates effectiveness of the proposed algorithm.

Key words:entity resolution;clustering;random walk model;cluster-vertex similarity;data object ordering

Get Citation

孙琛琛,申德荣,寇月,聂铁铮,于戈.面向实体识别的聚类算法.软件学报,2016,27(9):2303-2319

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:September 24,2015
Revised:January 12,2016
Adopted:
Online: September 02,2016
Published:

You are the first2043805Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History