Abstract:Sequence data are ubiquitous in many domains such as text, Web access log and biological database. Similarity search in this kind of data is very important for knowledge acquisition and analysis. An indexing technique based on reference is an effective method for sequence similarity search, the main idea of which is to assign some sequences in database as reference sets, then filter out those sequences unrelated to query sequence and finally get the answer efficiently. This paper presents a similarity search algorithm IRI (improved reference indexing) which is based on current indexing technique using reference set and is more powerful in terms of filtration. First, previous query results are used to accelerate the current query. Then, the upper bound and lower bound based on sequence characteristic are proposed to make the bound tighter and improve the filtration capability. Finally, to avoid the time-consuming edit distance computing, only partial edit distance between prefix sequences need to compute, which makes the algorithm run faster. Real data including DNA and protein sequence data are used in the experiment. Comprehensive experimental results show that IRI is more efficient than the current reference-based indexing algorithm RI (reference indexing).