On Generalized Bisimilarity Join
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61373023)

  • Article
  • | |
  • Metrics
  • |
  • Reference [22]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Similarity join is one of the hottest topics in the field of data management, and it has been widely applied in many fields. However, existing similarity join methods cannot meet the increasing demands in the real world. This paper define generalized bisimilarity join as a new similarity join to expend the applications of the similarity join research by introducing the satisfaction operator on various data types with individual thresholds. Two efficient methods, SJS(sub-join set) and MFV(mapping-filtering-verification), are proposed to solve this problem. A large amount of experiments conducted on both real-world and synthetic datasets demonstrate the correctness and the effectiveness of the proposed methods.

    Reference
    [1] Xiao C, Wang W, Lin XM, Yu JX, Wang GR. Efficient similarity joins for near-duplicate detection. ACM Trans. on Database Systems (TODS), 2011,36(3).[doi:10.1145/2000824.2000825]
    [2] Papapetrou P, Athitsos V, Kollios G, Gunopulos D. Reference-Based alignment in large sequence databases. Proc. of the VLDB Endowment, 2009,2(1):205-216.[doi:10.14778/1687627.1687651]
    [3] Li YN, Patel JM, Terrell A. Wham:A high-throughput sequence alignment method. ACM Trans. on Database Systems (TODS), 2012,37(4):28.[doi:10.1145/2389241.2389247]
    [4] Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In:Proc. of the 22nd Int'l Conf. on Data Engineering (ICDE 2006). 2006.[doi:10.1109/ICDE.2006.9]
    [5] Liu XL, Wang HZ, Li JZ, Gao H. Similarity join algorithm based on entity. Ruan Jian Xue Bao/Journal of Software, 2015,26(6):1421-1437(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4610.htm[doi:10.13328/j.cnki.jos.004610]
    [6] Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In:Proc. of the 32nd Int'l Conf. on Very Large Data Bases. 2006. 918-929.
    [7] Zhang ZJ, Hadjieleftheriou M, Ooi BC, SrivastavaD. Bed-Tree:An all-purpose index structure for string similarity search based on edit distance. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. 2010. 915-926.[doi:10. 1145/1807167. 1807266]
    [8] Zhao X, Xiao C, Lin XM, Wang W. Efficient graph similarity joins with edit distance constraints. In:Proc. of the 28th Int'l Conf. on Data Engineering (ICDE 2012). 2012. 834-845.[doi:10.1109/ICDE.2012.91]
    [9] Wang JN, Feng JH, Li GL. Trie-Join:Efficient trie-based string similarity joins with edit-distance constraints. Proc. of the VLDB Endowment, 2010,3(1-2):1219-1230.
    [10] Li GL, Deng D, Wang JN, Feng JH. Pass-Join:A partition-based method for similarity joins. Proc. of the VLDB Endowment, 2011, 5(3):253-264.
    [11] Li R, Ju L, Peng Z, Yu ZW, Wang CK. Batch text similarity search with MapReduce. In:Proc. of the 13th Asia-Pacific Web Conf. Beijing, 2011. 412-423.
    [12] Lu W, Du XY, Hadjieleftheriou MM, Ooi BC. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Trans. on Knowledge and Data Engineering, 2014,26(12):2983-2996.[doi:10.1109/TKDE.2014.2309131]
    [13] Wang JN, Li GL, Deng D, Zhang Y, Feng JH. Two birds with one stone:An efficient hierarchical framework for top-k and threshold-based string similarity search. In:Proc. of the 31st Int'l Conf. on the Data Engineering (ICDE 2015). 2015. 519-530.[doi:10.1109/ICDE.2015.7113311]
    [14] Wang JN, Li GL, Feng JH. Fast-Join:An efficient method for fuzzy token matching based string similarity join. In:Proc. of the 27th Int'l Conf. on the Data Engineering (ICDE 2011). 2011. 458-469.[doi:10.1109/ICDE.2011.5767865]
    [15] Wang W, Qin JB, Xiao C, Lin XM, Shen HT. VChunkJoin:An efficient algorithm for edit similarity joins. IEEE Trans. on Knowledge and Data Engineering, 2013,25(8):1916-1929.[doi:10.1109/TKDE.2012.79]
    [16] Wang CK, Wang JM, Lin XM, Wang W, Wang HX, Li HS, Tian WP, Xu J, Li R. MapDupReducer:Detecting near duplicates over massive datasets. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. 2010. 1119-1122.[doi:10.1145/1807167.1807296]
    [17] Deng D, Li GL, Feng JH. A pivotal prefix based filtering algorithm for string similarity search. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 673-684.[doi:10.1145/2588555.2593675]
    [18] Rong CT, Lu W, Wang XL, Du XY, Chen YG, Tung AKH. Efficient and scalable processing of string similarity join. IEEE Trans. on Knowledge and Data Engineering, 2013,25(10):2217-2230.[doi:10.1109/TKDE.2012.195]
    [19] Deng D, Li GL, Hao S, Wang JN, Feng JH. MassJoin:A MapReduce-based method for scalable string similarity joins. In:Proc. of the 30th Int'l Conf. on the Data Engineering (ICDE 2014). 2014. 340-351.[doi:10.1109/ICDE.2014.6816663]
    [20] Brualdi RA. Introductory Combinatorics. 5th ed., Pearson Education, 2009.
    附中文参考文献:
    [5] 刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法.软件学报,2015,26(6):1421-1437. http://www.jos.org.cn/1000-9825/4610.htm[doi:10.13328/j.cnki.jos.004610]
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

王昶平,王朝坤,汪浩,王萌,陈俊.泛化双向相似连接.软件学报,2017,28(12):3223-3240

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 01,2016
  • Revised:October 26,2016
  • Online: March 27,2017
You are the first2043735Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063