Multi-attribute Data Indexing for Query Based Entity Resolution
Author:
Affiliation:

Clc Number:

TP311

Fund Project:

National Natural Science Foundation of China (62002262, 61672142, 61602103, 62072086, 62072084); National Key Research and Development Program of China (2018YFB1003404)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Entity resolution is a key aspect of data integration, and also is a necessary preprocessing step of big data analytics and mining. In big data era, more and more query-driven data analytics applications come out, and query-based entity resolution becomes a hot topic. This work studies multi-attribute data indexing technology for entity cache in order to promote query-resolution efficiency. There are two core problems. One is how to design the multi-attributeindex. An R-tree based multi-attributeindex is designed. Entity cache is produced online, so an online index construction method is proposed based on spatial clustering. A filter-verify based multi-dimensional query method is proposed. It filters impossible records by the multi-attributeindex, and then verifies each candidate record with similarity functions or distance functions. The other ishow to insert different string attributes into the tree index. The basic solution is mapping strings into integer spaces. For Jaccard similarity and edit similarity, a q-gram based mapping method is proposed, and is improved by vector dimension reduction and z-order, which achieves high mapping qualities. Finally, the proposed hybrid index is experimentally evaluated on two datasets. Its effectiveness is validated, and moreover, different aspects of the multi-attribute index are also tested.

    Reference
    Related
    Cited by
Get Citation

孙琛琛,申德荣,肖迎元,李玉坤.面向查询式实体解析的多属性数据索引技术.软件学报,2022,33(6):2331-2347

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 02,2020
  • Revised:November 20,2020
  • Adopted:
  • Online: April 21,2021
  • Published: June 06,2022
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063