Similarity Join Algorithm Based on Entity
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [22]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Taking entity as the basic unit to organize the tuples in query processing is an effective way of managing low-quality data. As many descriptions of attribute value in an entity, join operator must support similarity join over multiple values. Entity similarity join is more effective than traditional similarity join in data cleaning, information integration, fuzzy keyword search, fraud detection, and text aggregation. In this paper, an entity similarity join algorithm, ES-JOIN, is designed by adopting the structure of the double layer prefix index. The presented method is suitable for solving set similarity join problem based on fuzzy elements matching, and thus is a better choice than the traditional set similarity join which only considers exact element match. In order to accelerate the join process, a new filtering measures is proposed to optimize the algorithm, and an optimization algorithm, OPT_ES-join, is also obtained. Experiments demonstrate that the ES-JOIN algorithm has good efficiency and scalability, and the filter measures is very effective.

    Reference
    [1] Bertossi L, Kolahi S, Lakshmanan L. Data cleaning and query answering with matching dependencies and matching functions. In: Abiteboul S, Böhm K, Koch C, Tan KL, eds. Proc. of the 27th Int'l Conf. on Data Engineering. Hannover: IEEE Computer Society, 2011. 268-279. [doi: 10.1145/1938551.1938585]
    [2] Dong X, Halevy AY, Yu C. Data integration with uncertainty. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CY, Ganti V, Kanne CC, Klas WJ, Neuhold E, eds. Proc. of the 33rd Int'l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 687-698.
    [3] Ji S, Li G, Li C, Feng JH. Efficient interactive fuzzy keyword search. In: Proc. of the 18th Int'l Conf. on World Wide Web. Madrid: ACM Press, 2009. 371-380. [doi: 10.1145/1526709.1526760]
    [4] Timothy C, Justin Z. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 2003,54(3):203-215. [doi: 10.1002/asi.10170]
    [5] Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 1997, 29(8):1157-1166. [doi: 10.1016/S0169-7552(97)00031-7]
    [6] Li G, Deng D, Wang J, Feng JH. Pass-Join: A partition-based method for similarity joins. VLDB Endowment, 2011,5(3):253-264. [doi: 10.14778/2078331.2078340]
    [7] Wang J, Feng J, Li G. Trie-Join: Efficient trie-based string similarity joins with edit-distance constraints. VLDB Endowment, 2010, 3(1-2):1219-1230. [doi: 10.14778/1920841.1920992]
    [8] Xiao C, Wang W, Lin X. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. VLDB Endowment, 2008,1(1):933-944. [doi: 10.14778/1453856.1453957]
    [9] Bayardo J, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proc. of the 16th Int'l Conf. on World Wide Web. Banff: ACM Press, 2007. 131-140. [doi: 10.1145/1242572.1242591]
    [10] Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Liu L, Reuter A, Whang KY, Zhang JJ, eds. Proc. of the 22nd Int'l Conf. on Data Engineering. Atlanta: IEEE Computer Society, 2006. 1-5. [doi: 10.1109/ICDE. 2006.9]
    [11] Li GL, Deng D, Feng J. Faerie: Efficient filtering algorithms for approximate dictionary-based entity extraction. In: Sellis TK, Miller RJ, Kementsietsidis A, Velegrakis Y, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Athens: ACM Press, 2011. 529-540. [doi: 10.1145/1989323.1989379]
    [12] Navarro G. A guided tour to approximate string matching. ACM Computing Surveys, 2011,33(1):31-88. [doi: 10.1145/375360. 375365]
    [13] Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Dayal U, Whang KY, Lomet DB, Alonso G, Lohman GM, Kersten ML, Cha SK, Kim YK, eds. Proc. of the 32nd Int'l Conf. on Very Large Data Bases. Seoul: ACM Press, 2006. 918-929.
    [14] Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Weikum G, König AC, Deßloch S, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Paris: ACM Press, 200. 743-754. [doi: 10.1145/1007568.1007652]
    [15] Vernica R, Carey MJ, Li C. Efficient parallel set-similarity joins using MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2010). Indianapolis: ACM Press, 2010. 495-506. [doi: 10.1145/ 1807167.1807222]
    [16] Li C, Wang B, Yang XC. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CC, Ganti V, Kanne C, Klas W, Neuhold EJ, eds. Proc. of the 33rd Int'l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 303-314.
    [17] Wang J, Li G, Feng J. Can we beat the prefix filtering: An adaptive framework for similarity join and search. In: Candan KS, Chen Y, Snodgrass RT, Gravano L, Fuxman A, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Scottsdale: ACM Press, 2012. 85-96. [doi: 10.1145/2213836.2213847]
    [18] Wang J, Li G, Feng J. Fast-Join: An efficient method for fuzzy token matching based string similarity join. In: Abiteboul S, Böhm K, Koch C, Tan KL, eds. Proc. of the 27th Int'l Conf. on Data Engineering. Hannover: IEEE Computer Society, 2011. 458-469. [doi: 10.1109/ICDE.2011.5767865]
    [19] Green T, Tannen V. Models for incomplete and probabilistic information. Current Trends in Database Technology—EDBT, 2006, 4254:278-296. [doi: 10.1007/11896548_24]
    [20] Wang HZ, Li JZ, Gao H. Data model for dirty databases. Ruan Jian Xue Bao/Journal of Software, 2012,23(3):539-549 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4042.htm [doi: 10.3724/SP.J.1001.2012.04042]
    [21] Xiao C, Wang W, Lin XM, Yu JX. Efficient similarity joins for near duplicate detection. In: Huai JP, Chen R, Hon HW, LiuYH, Ma WY, Tomkins A, Zhang XD, eds. Proc. of the 17th Int'l Conf. on World Wide Web. Beijing: ACM Press, 2008. 131-140. [doi: 10.1145/1367497.1367516]
    [22] Jestes J, Li FF, Yan ZP, Yi K. Probabilistic string similarity joins. In: Elmagarmid AK, Agrawal D, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Indianapolis: ACM Press, 2010. [doi: 10.1145/1807167.1807204]
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法.软件学报,2015,26(6):1421-1437

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:November 25,2012
  • Revised:March 28,2014
  • Online: June 04,2015
You are the first2034063Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063