基于参考集索引的高效序列相似性查找算法
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

Supported by the National Natural Science Foundation of China under Grant No.0573093 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant No.2006AA02Z329 (国家高技术研究发展计划(863))


Efficient Algorithm for Sequence Similarity Search Based on Reference Indexing
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    序列数据在文本、Web访问日志文件、生物数据库中普遍存在,对其进行相似性查找是一种重要的获取和分析知识的手段.基于参考集索引技术是一类解决序列相似性查找的有效方法,主要思想是找到序列数据库中的少数序列作为参考集,通过参考集过滤掉数据库中与查询序列不相关的数据,从而高效地回答查询.在现有基于参考集索引技术的基础上,提出一种过滤能力更强的序列相似性查询算法IRI(improved reference indexing).首先,充分利用了先前的查询结果集来加速当前的查询,其次考虑了基于序列特征的上界和下界,使得应用参考集进行过滤的上下界更紧,过滤能力进一步加强.最后,为了避免候选集中费时的编辑距离计算,则只计算前缀序列间的编辑距离,从而进一步加速算法运行.实验采用真实的DNA序列和蛋白质序列数据,结果表明,算法IRI在查询性能上明显优于现有的基于参考集索引方法RI(reference indexing).

    Abstract:

    Sequence data are ubiquitous in many domains such as text, Web access log and biological database. Similarity search in this kind of data is very important for knowledge acquisition and analysis. An indexing technique based on reference is an effective method for sequence similarity search, the main idea of which is to assign some sequences in database as reference sets, then filter out those sequences unrelated to query sequence and finally get the answer efficiently. This paper presents a similarity search algorithm IRI (improved reference indexing) which is based on current indexing technique using reference set and is more powerful in terms of filtration. First, previous query results are used to accelerate the current query. Then, the upper bound and lower bound based on sequence characteristic are proposed to make the bound tighter and improve the filtration capability. Finally, to avoid the time-consuming edit distance computing, only partial edit distance between prefix sequences need to compute, which makes the algorithm run faster. Real data including DNA and protein sequence data are used in the experiment. Comprehensive experimental results show that IRI is more efficient than the current reference-based indexing algorithm RI (reference indexing).

    参考文献
    相似文献
    引证文献
引用本文

戴东波,熊赟,朱扬勇.基于参考集索引的高效序列相似性查找算法.软件学报,2010,21(4):718-731

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2008-09-29
  • 最后修改日期:2009-03-31
  • 录用日期:
  • 在线发布日期:
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号