Top-k相似连接算法性能优化

doi:10.13328/j.cnki.jos.005012

微信服务号

微信订阅号

首页 > 过刊浏览>2016年第27卷第12期 >3051-3066. DOI:10.13328/j.cnki.jos.005012

PDF HTML阅读 XML下载导出引用引用提醒

Top-k相似连接算法性能优化
DOI:
                        10.13328/j.cnki.jos.005012
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金（61370205）；上海市自然科学基金（13ZR1400800）；中央高校基本科研业务费专项资金

Optimizing Top-k Similarity Join Algorithm

Author:

Affiliation:

Fund Project:

National Natural Science Foundation of China (61370205); Shanghai Municipal Natural Science Foundation (13ZR1400800); The Fundamental Research Funds for the Central Universities

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

相似连接算法在数据清理、数据集成和重复网页检测等领域有着广泛的应用.现有相似连接算法有两种类型：基于相似度阈值的相似连接和Top-k相似连接.Top-k连接算法非常适合于相似度阈值未知的应用场景，目前最为有效的Top-k相似连接算法是Xiao等人提出的Topk-join.为了解决Topk-join中存在的性能问题，提出了一种Top-k相似连接算法Opt-join，该算法将Token批处理技术集成在现有的事件驱动框架中，以降低前缀事件的处理代价；通过置换哈希查找与过滤操作的执行位置来降低哈希查找代价，并理论证明了该置换的正确性.实验结果表明：与Topk-join算法相比，Opt-join取得了1.28倍~3.09倍的性能提升.实验数据还显示：随着数据长度的增加或k值的增长，Opt-join的性能优势有不断增加的趋势.

Abstract:

Similarity join is widely used in data cleaning, data integration and the detection of near duplicate Web pages. Existing similarity join algorithms fall into two categories:Threshold-based similarity join and Top-k similarity join. Top-k similarity join is suitable for applications in which the threshold is unknown in advance. The most efficient Top-k similarity join algorithm is Top-k-join, which is proposed by Xiao et al. In order to resolve the performance problemsof Topk-join, a novel Top-k similarity join algorithm Opt-join is proposed in this paper. By integrating the token batch processing technique into the existing event-driven framework, Opt-join reduces the cost of processing the prefix events. In addition, Opt-joinreduces the cost in hash lookup by switching the positions of the hash lookup and filtering operations. The correctness of the new algorithm is proved. Experimental results show that 1.28x-3.09xspeed-up is achieved by Opt-join compared with Topk-join. More importantly, with the increase of the record length or the k value, Opt-join surpasses Topk-join by a larger margin.

参考文献

相似文献

引证文献

引用本文

王洪亚,杨利宏,刘晓强. Top-k相似连接算法性能优化.软件学报,2016,27(12):3051-3066

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2015-06-11
最后修改日期:2015-09-08
录用日期:
在线发布日期: 2016-01-18
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码