Efficient Compressed Genomic Data Oriented Query Approach
Author:
Affiliation:

Fund Project:

National Natural Science Foundation-Outstanding Youth Foundation (61322208); National Basic Research Program of China (973) (2012CB316201); National Natural Science Foundation of China (61272178, 61572122, 61532021)

  • Article
  • | |
  • Metrics
  • |
  • Reference [31]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    With the rapid development of the third and next generation sequencing techniques, genomic sequences such as DNA grow explosively. Processing such big data efficiently is a great challenge. Research found that although those sequences are very large, they are highly similar to each other. Thus it is feasible to reduce the space cost by storing their differences to a reference sequence. New studies show that it is possible to directly search on the compressed data. The target of this study is to improve the scalability of the indexing and searching techniques to meet the growing demand of big data. Based on the existing method, this work is placed on compressing the reference sequence. Several optimization techniques are proposed to perform efficient exact and approximate search with arbitrary query length on the compressed data. The process is further improved by utilizing parallel computing to crease the query efficiency for big data. Experimental study demonstrates the efficiency of the proposed method.

    Reference
    [1] Mardis ER. The impact of next-generation sequencing technology on genetics. Trends in Genetics, 2008,24(3):133-141.[doi:10.1016/j.tig.2007.12.007]
    [2] Rusk N. Cheap third-generation sequencing. Nature Methods, 2009,6(4):244.[doi:10.1038/nmeth0409-244a]
    [3] Mäkinen V, Navarro G, Sirén J, Välimäki N. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 2010,17(3):281-308.[doi:10.1089/cmb.2009.0169]
    [4] Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X. The complete genome of an individual by massively parallel DNA sequencing. Nature, 2008,452(7189):872-876.[doi:10.1038/nature06884]
    [5] Christley S, Lu Y, Li C, Xie X. Human genomes as email attachments. Bioinformatics, 2009,25(2):274-275.[doi:10.1093/bioinformatics/btn582]
    [6] Ferrada H, Gagie T, Hirvola T, Puglisi SJ. Hybrid indexes for repetitive datasets. Philosophical Trans. of the Royal Society of London A:Mathematical, Physical and Engineering Sciences, 2014,372(2016):130-137.[doi:10.1098/rsta.2013.0137]
    [7] Kuruppu S, Puglisi SJ, Zobel J. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In:String Processing and Information Retrieval. Berlin, Heidelberg:Springer-Verlag, 2010.201-206.[doi:10.1007/978-3-642-16321-0_20]
    [8] Kreft S, Navarro G. Self-Indexing based on LZ77. Lecture Notes in Computer Science, 2011,6661(1):41-54.[doi:10.1007/978-3-642-21458-5_6]
    [9] Yang XC, Wang B, Li C, Wang JY. Efficient direct search on compressed genomic data. In:Proc. of the Int'l Conf. on Data Engineering. IEEE, 2013.961-972.[doi:10.1109/ICDE.2013.6544889]
    [10] Deorowicz S, Grabowski S. Robust relative compression of genomes with random access. Bioinformatics, 2011,27(21):2979-2986.[doi:10.1093/bioinformatics/btr505]
    [11] Wandelt S, Leser U. FRESCO:Referential compression of highly-similar sequences. IEEE/ACM Trans. on Computational Biology & Bioinformatics, 2013,10(5):1275-1288.[doi:10.1109/TCBB.2013.122]
    [12] Brandon MC, Wallace DC, Baldi P. Data structures and compression algorithms for genomic sequence data. Bioinformatics, 2009,25(14):1731-1738.[doi:10.1093/bioinformatics/btp319]
    [13] Daily K, Rigor P, Christley S. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics, 2010,11(19):514-524.[doi:10.1186/1471-2105-11-514]
    [14] Rasko L, Hideaki S, Martin S. The sequence read archive. Nucleic Acids Research, 2011,39(2):19-21.[doi:10.1093/nar/gkq1019]
    [15] Pinho AJ, Diogo P, Garcia SP. GReEn:A tool for efficient compression of genome resequencing data. Nucleic Acids Research, 2011,40(4):27-53.[doi:10.1093/nar/gkr1124]
    [16] Zhu YY, Xiong Y. DNA sequence data mining technique. Ruan Jian Xue Bao/Journal of Software, 2007,18(11):2766-2781(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/18/2766.htm[doi:10.1360/jos182766]
    [17] Claude F, Farina A, Martínez-Prieto MA, Navarro G. Compressed q-gram indexing for highly repetitive biological sequences. In:Proc. of the BioInformatics and BioEngineering. IEEE, 2010.86-91.[doi:10.1109/BIBE.2010.22]
    [18] Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 2009,10(9):98-119.[doi:10.1186/gb-2009-10-9-r98]
    [19] Claude F, Fariña A, Martínez-Prieto MA, Navarro G. Indexes for highly repetitive document collections. In:Proc. of the 20th ACM Int'l Conf. on Information and Knowledge Management. ACM, 2011.463-468.[doi:10.1145/2063576.2063646]
    [20] Huang S, Lam TW, Sung WK, Tam SL, Yiu SM. Indexing similar DNA sequences. In:Algorithmic Aspects in Information and Management. Berlin, Heidelberg:Springer-Verlag, 2010.180-190.[doi:10.1007/978-3-642-14355-7_19]
    [21] Navarro G. A guided tour to approximate string matching. ACM Computing Surveys, 2001,33(1):31-88.[doi:10.1145/375360.375365]
    [22] Lin XM, Wang W. Set and string similarity queries:A survey. Chinese Journal of Computers, 2011,34(10):1853-1862(in Chinese with English abstract).[doi:10.3724/SP.J.1016.2011.01853]
    [23] Danek A, Deorowicz S, Grabowski S. Indexing large genome collections on a PC. Eprint Arxiv, 2014,9(10):52-59.
    [24] Ferrada H, Gagie T, Hirvola T, Puglisi SJ. AliBI:An alignment-based index for genomic datasets. ArXiv, 2013,7(4):32-39.
    [25] Wandelt S, Leser U. MRCSI:Compressing and searching string collections with multiple references. Proc. of the VLDB Endowment, 2015,8(5):461-472.[doi:10.14778/2735479.2735480]
    [26] Burrows M. A block-sorting lossless data compression algorithm. Digital SRC Research Report, 1994,57(4):425-449.
    [27] Navarro G, Mäkinen V. Compressed full-text indexes. ACM Computing Surveys, 2007,39(1):2-53.[doi:10.1145/1216370.1216372]
    [28] Lam TW, Li R, Tam A, Wong S, Wu E, Yiu SM. High throughput short read alignment via bi-directional BWT. In:Proc. of the 2009 IEEE Int'l Conf. on Bioinformatics and Biomedicine. IEEE Computer Society, 2009.31-36.[doi:10.1109/BIBM.2009.42]
    附中文参考文献:
    [16] 朱扬勇,熊赟.DNA序列数据挖掘技术.软件学报,2007,18(11):2766-2781. http://www.jos.org.cn/1000-9825/18/2766.htm[doi:10.1360/jos182766]
    [22] 林学民,王炜.集合和字符串的相似度查询.计算机学报,2011,34(10):1853-1862.[doi:10.3724/SP.J.1016.2011.01853]
    Related
    Cited by
Get Citation

王佳英,王斌,杨晓春.面向压缩生物基因数据的高效的查询方法.软件学报,2016,27(7):1715-1728

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:September 25,2015
  • Revised:January 12,2016
  • Online: March 24,2016
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063