Indels Detection Algorithm Based on Optimal Split-Read Matching
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61402132, 61571163, 61532014)

  • Article
  • | |
  • Metrics
  • |
  • Reference [23]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    The development of next-generation high-throughput DNA sequencing techniques has greatly promoted the research of structural variations (SVs) detection.Current genetic structure variation detection methods are mainly base on depth of coverage, pair-end mapping clusters, or sequence assembly, some of them are known to be not accurate or too sensitive.What's more, some methods are not able to recognize the specific position and sequence of structural variation.Insertions and deletions (indels) are the most common forms of genome structure variations.This paper puts forward an optimal split-read matching algorithm (OSRM) using dynamic programming.OSRM breaks an abnormal read into several reads in a least quantity.First, a score matrix of the abnormal read and the corresponding referenced sequence is created.Then a matrix of backtracking path is established.Next, a formula designed according to the characteristics of structural variation is used to elect the optimal backtracking path matrix.And finally the split-read and referenced sequence are matched in an optimal arrangement by which the accurate position and sequence of found indels are outputted.Experiments prove that the performance of algorithm is excellent.In addition, compared with Pindel which is the best in split-read methods, OSRM can offset its defection in detecting small and medium indels while also be able to detect more complex situation.

    Reference
    [1] Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D.A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome.Science, 2008, 321(5891):956-960.[doi: 10.1126/science.1160342]
    [2] Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E.Mapping and sequencing of structural variation from eight human genomes.Nature, 2008,453(7191):56-64.[doi: 10.1038/nature06862]
    [3] McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL.Integrated detection and population-genetic analysis of SNPs and copy number variation.Nature Genetics, 2008,40(10): 1166-1174.[doi: 10.1038/ng.238]
    [4] Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, Wang X.Whole-Genome sequencing of multiple Arabidopsis Thaliana populations.Nature Genetics, 2011,43(10):956-963.[doi: 10.1038/ng.911]
    [5] Platt A, Horton M, Huang YS, Li Y, Anastasio AE, Mulyati NW, Ågren J, Bossdorf O, Byers D, Donohue K, Dunning M.The scale of population structure in Arabidopsis Thaliana.PLoS Genet, 2010,6(2):e1000843.[doi: 10.1371/journal.pgen.1000843]
    [6] Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE.Paired-End mapping reveals extensive structural variation in the human genome.Science, 2007,318(5849):420-426.[doi: 10.1126/science.1149504]
    [7] Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A.Strong association of De Novo copy number mutations with autism.Science, 2007,316(5823):445-449.[doi: 10.1126/science.1138659]
    [8] Mullaney JM, Mills RE, Pittard WS, Devine SE.Small insertions and deletions (INDELs) in human genomes.Human Molecular Genetics, 2010,19(R2):R131-R136.https://dx.doi.org/10.1093hmgddq400
    [9] Zhang S, Han RL, Gao ZY, Zhu SK, Tian YD, Sun GR, Kang XT.A novel 31-bp indel in the paired box 7 (PAX7) gene is associated with chicken performance traits.British Poultry Science, 2014,55(1):31-36.[doi: 10.1080/00071668.2013.860215]
    [10] Lyu SJ, Tian YD, Wang SH, Han RL, Mei XX, Kang XT.A novel 2-bp indel within Krüppel-like factor 15 gene (KLF15) and its associations with chicken growth and carcass traits.British Poultry Science, 2014,55(4):427-434.[doi: 10.1080/00071668.2014.921886]
    [11] Shi T, Peng W, Yan J, Cai H, Lan X, Lei C, Bai Y, Chen H.A novel 17 bp indel in the SMAD3 gene alters transcription level.Archives Animal Breeding, 2016,59(1):151-157.[doi: 10.5194/aab-59-151-2016]
    [12] Zang L, Wang Y, Sun B, Zhang X, Yang C, Kang L, Zhao Z, Jiang Y.Identification of a 13bp indel polymorphism in the 3'-UTR of DGAT2 gene associated with backfat thickness and lean percentage in pigs.Gene, 2016,576(2):729-733.[doi: 10.1016/j.gene.2015.09.047]
    [13] Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein MB.PEMer: A computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data.Genome Biology, 2009,10(2):1.[doi: 10.1186/gb-2009-10-2-r23]
    [14] Yoon S, Xuan Z, Makarov V, Ye K, Sebat J.Sensitive and accurate detection of copy number variants using read depth of coverage.Genome Research, 2009,19(9):1586-1592.[doi: 10.1101/gr.092981.109]
    [15] Abyzov A, Urban AE, Snyder M, Gerstein M.CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.Genome Research, 2011,21(6):974-984.[doi: 10.1101/gr.114876.110]
    [16] Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM.High-Quality draft assemblies of mammalian genomes from massively parallel sequence data.Proc.of the National Academy of Sciences, 2011,108(4):1513-1518.[doi: 10.1073/pnas.1017351108]
    [17] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z.Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.Bioinformatics, 2009,25(21):2865-2871.[doi: 10.1093/bioinformatics/btp394]
    [18] Mardis ER.The impact of next-generation sequencing technology on genetics.Trends in Genetics, 2008,24(3):133-141.[doi: 10.1016/j.tig.2007.12.007]
    [19] Ng PC, Kirkness EF.Whole genome sequencing.Methods in Molecular Biology, 2010,628:215-226.
    [20] Alkan C, Coe BP, Eichler EE.Genome structural variation discovery and genotyping.Nature Reviews Genetics, 2011,12(5): 363-376.[doi: 10.1038/nrg2958]
    [21] Needleman SB, Wunsch CD.A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology, 1970,48(3):443-453.[doi: 10.1016/0022-2836(70)90057-4]
    [22] Smith TF, Waterman MS.Identification of common molecular subsequences.Journal of Molecular Biology, 1981,147(1):195-197.[doi: 10.1016/0022-2836(81)90087-5]
    [23] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R.The sequence alignment/map format and SAMtools.Bioinformatics, 2009,25(16):2078-2079.[doi: 10.1093/bioinformatics/btp352]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

王春宇,潘俊,郭茂祖,刘晓燕,刘扬,刘国军.基于读分割最优匹配的indels识别算法.软件学报,2017,28(10):2640-2653

Copy
Share
Article Metrics
  • Abstract:2719
  • PDF: 4258
  • HTML: 2152
  • Cited by: 0
History
  • Received:June 13,2016
  • Revised:September 02,2016
  • Online: October 19,2016
You are the first2032327Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063