一种基于彩色编码技术的基序发现算法
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant Nos.30370418, 90209008, 60302016, 60532050, 30500131 (国家自然科学基金); the Project for National Science Fund for Distinguished Young Scholars of China under Grant No.60225008 (国家杰出青年科学基金); the Joint Research Fund for Overseas Chinese Young Scholars under Grant No.30528027 (海外杰出青年研究基金); the Beijing Natural Science Foundation under Grant Nos.4051002, 4042024 (北京市自然科学基金)


A Motif Finding Algorithm Based on Color Coding Technology
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    从DNA序列中发现基序是生物计算中的一个重要问题,序列条数K=20包含基序用例的序列条数k=16的(l,d)-(K-k)问题(记作(l,d)-(20-16)问题)是目前生物学家十分关注的基序发现问题.针对该问题提出了一种基于彩色编码技术的SDA(sample-driven algorithm)搜索算法--彩色编码基序搜索算法(color coding motif finding algorithm,简称CCMF算法).它利用彩色编码技术将该问题转化为(l,d)-(16-16)问题,再采用分治算法和分支定界法来求解.在解决将(l,d)-(20-16)问题转化为(l,d)-(16-16)问题时,CCMF算法利用彩色编码技术将4 845个组合降低到403个着色,这将极大地提高算法的整体运行效率.使用模拟数据和生物数据进行测试的结果表明,CCMF算法能够快速发现所有(l,d)-(20-16)问题的基序模型和基序用例,具有优于其他算法的综合性能评价,能够用于真实的基序发现问题.同时,通过修改着色方案,CCMF算法可以用于求解一般的(l,d)-(K-k)问题,其中,kK.

    Abstract:

    Finding common pattern, motifs or signals, in a set of DNA sequences is an important problem in computational biology. Recently, some biologists extremely focus on the (l,d)-(K-k) motif finding problem when the number of sequences K is 20 and the number of sequences with instances k is 16, (l,d)-(20-16) problem for short. For solving this problem, this paper introduces a novel sample-driven algorithm (SDA), called color coding motif finding algorithm, CCMF for short. It uses color coding technology to converse a (l,d)-(20-16) problem to some (l,d)-(16-16) problems, then uses divide-and-conquer and branch-and-bound approaches to solve this (l,d)-(16-16) problem. Using the conversion process can reduce 4 845 combinations to 403 colorings, while increasing the running rate enormously. The experimental results on synthetic and real datasets show that the CCMF algorithm can accurately and efficiently find all (l,d)-(20-16) patterns and instances. Its comprehensive performances in finding motifs are superior to those of other existing algorithms. It is applicable for real biological purpose. The color coding technology can also be used to improve the performances of other similar (l,d)-(K-k) problems when k is less than K.

    参考文献
    [1]Helden JV,Andre B,Collado-Vides J.Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.Journal of Molecular Biology,1998,281:827-842.
    [2]Liu X,Brutlag DL,Liu J.Bioprospector:Discovering conserved DNA motifs in upstream regulatory regions of co-expressed gene.In:Altman RB,eds.Proc.of the 6th Pacific Symposium on Biocomputing.Word Scientific Publish Company,2001.127-138.
    [3]Blanchette M,Schwikowski B,Tompa M.Algorithms for phylogenetic footprinting.Journal of Computational Biology,2002,9(2):211-223.
    [4]Liu XS,Brutlag DL,Liu JS.An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments.Nature Biotechnology,2002,20(8):835-839.
    [5]Pevzner P,Sze S.Combinatorial approaches to finding subtle signals in DNA sequences.In:Bourne PE,Gribskov M,Altman RB,et al.,eds.Proc.of the 8th Int'l Conf.on Intelligent Systems for Molecular Biology (ISMB 2000).San Diego:AAAI Press,2000.269-278.
    [6]Styczynski MP,Jensen KL,Rigoutsos I,Stephanopoulos GN.An extension and novel solution to the (l,d)-motif challenge problem.Genome Informatics,2004,15:63-71.
    [7]Chin FYL,Leung HCM.An efficient algorithm for the extended (l,d)-motif problem with unknown number of binding sites.In:Bourbakis NG,ed.Proc.of IEEE 5th Symp.on Bioinformatics and Bioengineering (BIBE 2005).Minneapolis:IEEE Computer Society,2005.11-18.
    [8]Brazma A,Jonassen I,Eidhammer I,Gilbert D.Approaches to the automatic discovery of patterns in biosequences.Journal of Computational Biology,1998,5(2):279-305.
    [9]Buhler J,Tompa M.Finding motifs using random projections.Journal of Computational Biology,2002,9(2):225-242.
    [10]Sinha S,Tompa M.Discovery of novel transcription factor binding sites by statistical overrepresentation.Nucleic Acids Research,2002,30(24):5549-5560.
    [11]Eskin E,Pevzner PA.Finding composite regulatory patterns in DNA sequences.Bioinformatics,2002,18:354-363.
    [12]Price A,Ramabhadran S,Pevzner PA.Finding subtle motifs by branching from sample strings.Bioinformatics,2003,SⅡ:149-155.
    [13]Sze SH,Lu S,Chen J.Integrating sample-driven and pattern-driven approaches in motif finding.In:LNCS/LN Bioinformatics (WABI 2004),2004.438-449.
    [14]Gross JL,Yellen JL.Handbook of Graph Theory and Applications.CRC Press,2003.12-43.
    [15]From MathWorld-A wolfram Web resource.2006.http://mathworld.wolfram.com/Combination.html
    [16]Alon N,Yuster R,Zwick U.Color-Coding.Journal of the ACM,1995,42(4):844-856.
    [17]Chen J,Lu S,Sze SH,Zhang F.Improved algorithms for path,matching,and packing problems.In:Gabow HN,ed.Proc.of the 18th Annual ACM-SIAM Symp.on Discrete Algorithms (SODA'2007).2007.298-307.
    [18]Scott J,Ideker T,Karp R,Sharan R.Efficient algorithms for detecting signaling pathways in protein interaction networks.Research in Computational Molecular Biology (RECOME 2005).2005.1-13.
    [19]Chen J,Kanj I,Meng J,Xia G,Zhang F.On the effective enumerability of NP problems.In:Bodlaender HL,Langston MA,eds.Proc.of the 2nd Int'l Workshop on Parameterized and Exact Computation (IWPEC 2006).Zürich:Springer-Verlag,2006.215-226.
    [20]Lanctot JK,Li M,Ma B,Wang SJ,Zhang LX.Distinguishing string selection problems.Information and Computation,2003,185(1):41-55.
    [21]Gramm J,Niedermeier R,Rossmanith P.Exact solutions for closest string and related problems.In:LNCS 2223 (ISAAC 2001),2001.441-453.
    [22]Stormo GD,Hartzell III GW.Identifying protein-binding sites from unaligned DNA fragments.In:JSTOR,ed.Proc.of the National Academy of Sciences of the United States of America.National Academy of Sciences,1989,86(4):1183-1187.
    [23]Zhu J,Zhang M.SCPD:A promoter database of the yeast Saccha-romyces cerevisiae.Bioinformatics,1999,15:563-577.http://cgsigma.cshl.org/jian/
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王建新,黄元南,陈建二.一种基于彩色编码技术的基序发现算法.软件学报,2007,18(6):1298-1307

复制
分享
文章指标
  • 点击次数:4163
  • 下载次数: 5767
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2006-06-26
  • 最后修改日期:2006-08-29
文章二维码
您是第19939651位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号