Similarity Join Algorithm Based on Entity

doi:10.13328/j.cnki.jos.004610

微信服务号

微信订阅号

2025-4-9- 10

Home > Archive>Volume 26, Issue 6, 2015 >1421-1437. DOI:10.13328/j.cnki.jos.004610

PDF HTML XML Export Cite reminder

Similarity Join Algorithm Based on Entity
DOI:
                        10.13328/j.cnki.jos.004610
                    
Author:
                        LIU Xue-LiLIU Xue-Li
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WANG Hong-ZhiWANG Hong-Zhi
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LI Jian-ZhongLI Jian-Zhong
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GAO HongGAO Hong
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [22]

Cited by

Materials

Comments

Abstract:

Taking entity as the basic unit to organize the tuples in query processing is an effective way of managing low-quality data. As many descriptions of attribute value in an entity, join operator must support similarity join over multiple values. Entity similarity join is more effective than traditional similarity join in data cleaning, information integration, fuzzy keyword search, fraud detection, and text aggregation. In this paper, an entity similarity join algorithm, ES-JOIN, is designed by adopting the structure of the double layer prefix index. The presented method is suitable for solving set similarity join problem based on fuzzy elements matching, and thus is a better choice than the traditional set similarity join which only considers exact element match. In order to accelerate the join process, a new filtering measures is proposed to optimize the algorithm, and an optimization algorithm, OPT_ES-join, is also obtained. Experiments demonstrate that the ES-JOIN algorithm has good efficiency and scalability, and the filter measures is very effective.

Key words:entity;similarity join;poor-quality data

Reference

[1] Bertossi L, Kolahi S, Lakshmanan L. Data cleaning and query answering with matching dependencies and matching functions. In: Abiteboul S, Böhm K, Koch C, Tan KL, eds. Proc. of the 27th Int'l Conf. on Data Engineering. Hannover: IEEE Computer Society, 2011. 268-279. [doi: 10.1145/1938551.1938585]

[2] Dong X, Halevy AY, Yu C. Data integration with uncertainty. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CY, Ganti V, Kanne CC, Klas WJ, Neuhold E, eds. Proc. of the 33rd Int'l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 687-698.

[3] Ji S, Li G, Li C, Feng JH. Efficient interactive fuzzy keyword search. In: Proc. of the 18th Int'l Conf. on World Wide Web. Madrid: ACM Press, 2009. 371-380. [doi: 10.1145/1526709.1526760]

[4] Timothy C, Justin Z. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 2003,54(3):203-215. [doi: 10.1002/asi.10170]

[5] Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 1997, 29(8):1157-1166. [doi: 10.1016/S0169-7552(97)00031-7]

[6] Li G, Deng D, Wang J, Feng JH. Pass-Join: A partition-based method for similarity joins. VLDB Endowment, 2011,5(3):253-264. [doi: 10.14778/2078331.2078340]

[7] Wang J, Feng J, Li G. Trie-Join: Efficient trie-based string similarity joins with edit-distance constraints. VLDB Endowment, 2010, 3(1-2):1219-1230. [doi: 10.14778/1920841.1920992]

[8] Xiao C, Wang W, Lin X. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. VLDB Endowment, 2008,1(1):933-944. [doi: 10.14778/1453856.1453957]

[9] Bayardo J, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proc. of the 16th Int'l Conf. on World Wide Web. Banff: ACM Press, 2007. 131-140. [doi: 10.1145/1242572.1242591]

[10] Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Liu L, Reuter A, Whang KY, Zhang JJ, eds. Proc. of the 22nd Int'l Conf. on Data Engineering. Atlanta: IEEE Computer Society, 2006. 1-5. [doi: 10.1109/ICDE. 2006.9]

[11] Li GL, Deng D, Feng J. Faerie: Efficient filtering algorithms for approximate dictionary-based entity extraction. In: Sellis TK, Miller RJ, Kementsietsidis A, Velegrakis Y, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Athens: ACM Press, 2011. 529-540. [doi: 10.1145/1989323.1989379]

[12] Navarro G. A guided tour to approximate string matching. ACM Computing Surveys, 2011,33(1):31-88. [doi: 10.1145/375360. 375365]

[13] Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Dayal U, Whang KY, Lomet DB, Alonso G, Lohman GM, Kersten ML, Cha SK, Kim YK, eds. Proc. of the 32nd Int'l Conf. on Very Large Data Bases. Seoul: ACM Press, 2006. 918-929.

[14] Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Weikum G, König AC, Deßloch S, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Paris: ACM Press, 200. 743-754. [doi: 10.1145/1007568.1007652]

[15] Vernica R, Carey MJ, Li C. Efficient parallel set-similarity joins using MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2010). Indianapolis: ACM Press, 2010. 495-506. [doi: 10.1145/ 1807167.1807222]

[16] Li C, Wang B, Yang XC. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CC, Ganti V, Kanne C, Klas W, Neuhold EJ, eds. Proc. of the 33rd Int'l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 303-314.

[17] Wang J, Li G, Feng J. Can we beat the prefix filtering: An adaptive framework for similarity join and search. In: Candan KS, Chen Y, Snodgrass RT, Gravano L, Fuxman A, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Scottsdale: ACM Press, 2012. 85-96. [doi: 10.1145/2213836.2213847]

[18] Wang J, Li G, Feng J. Fast-Join: An efficient method for fuzzy token matching based string similarity join. In: Abiteboul S, Böhm K, Koch C, Tan KL, eds. Proc. of the 27th Int'l Conf. on Data Engineering. Hannover: IEEE Computer Society, 2011. 458-469. [doi: 10.1109/ICDE.2011.5767865]

[19] Green T, Tannen V. Models for incomplete and probabilistic information. Current Trends in Database Technology—EDBT, 2006, 4254:278-296. [doi: 10.1007/11896548_24]

[20] Wang HZ, Li JZ, Gao H. Data model for dirty databases. Ruan Jian Xue Bao/Journal of Software, 2012,23(3):539-549 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4042.htm [doi: 10.3724/SP.J.1001.2012.04042]

[21] Xiao C, Wang W, Lin XM, Yu JX. Efficient similarity joins for near duplicate detection. In: Huai JP, Chen R, Hon HW, LiuYH, Ma WY, Tomkins A, Zhang XD, eds. Proc. of the 17th Int'l Conf. on World Wide Web. Beijing: ACM Press, 2008. 131-140. [doi: 10.1145/1367497.1367516]

[22] Jestes J, Li FF, Yan ZP, Yi K. Probabilistic string similarity joins. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Indianapolis: ACM Press, 2010. [doi: 10.1145/1807167.1807204]

Get Citation

刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法.软件学报,2015,26(6):1421-1437

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 25,2012
Revised:March 28,2014
Adopted:
Online: June 04,2015
Published:

You are the first2034063Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History