Subnettrees for Strict Pattern Matching with General Gaps and Length Constraints
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [30]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Pattern matching with gap constraints has important applications in many fields such as information retrieval, computational biology, and sequential pattern mining etc. This paper proposes a pattern matching problem named Strict Pattern Matching with General Gaps and Length Constraints (SPANGLO) which has four characteristics: It is strict exact pattern matching; Any position in the given sequence can be used more than once; The pattern can have more than one general gap constraint; Each matching substring has length constraints. An instance of SPANGLO can be transformed into an exponential number of matching instances with non-negative gaps in the worst case. In order to solve the problem effectively, an algorithm named SubnettreeSpanglo (SETS) is proposed based on Subnettree and its special concepts and properties. The correctness and completeness of the algorithm are given, and the space and time complexities of the algorithm are O(m×MaxLen×W) and O(MaxLen×W×m2×n) respectively, where n, m, MaxLen and W are the sequence length, the pattern length, the maximum length of the occurrence and the maximum gap of the pattern respectively. Experimental results validate the correctness of the transforming method of SPANGLO and the efficientiveness and correctness of SETS.

    Reference
    [1] Fischer MJ, Paterson MS. String matching and other products. In: Karp R, ed. Proc. of the 7th SIAM AMS Complexity of Computation. Cambridge: American Mathematical Society, 1974. 113-125.
    [2] Manber U, Baeza-Yates R. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 1991, 37(3):133-136. [doi: 10.1016/0020-0190(91)90032-D]
    [3] Lewenstein M. Indexing with gaps. In: Grossi R, Sebastiani F, Silvestri F, eds. Proc. of the Int'l Symp. String Processing and Information Retrieval. Pisa: Springer-Verlag, 2011. 135-143. [doi: 10.1007/978-3-642-24583-1_14]
    [4] Navarro G, Raffinot M. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. Journal of Computational Biology, 2003,10(6):903-923. [doi: 10.1089/106652703322756140]
    [5] Wang H, Xie F, Hu X, Li P, Wu X. Pattern matching with flexible wildcards and recurring characters. In: Hu X, Lin TY, Raghavan VV, Grzymala-Busse JW, Liu Q, Broder AZ, eds. Proc. of the 2010 IEEE Int'l Conf. on Granular Computing. Silicon Valley: IEEE Computer Society, 2010. 782-786. [doi: 10.1109/GrC.2010.156]
    [6] Cole R, Gottlieb LA, Lewenstein M. Dictionary matching and indexing with errors and don't cares. In: Babai L, ed. Proc. of the 36th ACM Symp. on the Theory of Computing. Chicago: ACM Press, 2004. 91-100. [doi: 10.1145/1007352.1007374]
    [7] Crochemore M, Iliopoulos C, Makris C, Rytter W, Tsakalidis A, Trichlas K. Approximate string matching with gaps. Nordic Journal of Computing, 2002,9(1):54-65.
    [8] Cantone D, Cristofaro S, Faro S. New efficient bit-parallel algorithms for the (δ,α)-matching problem with applications in music information retrieval. Int'l Journal of Foundations of Computer Science, 2009,20(6):1087-1108. [doi: 10.1142/S0129054109007054]
    [9] Ji X, Bailey J, Dong G. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems, 2007,11(3):259-286. [doi: 10.1007/s10115-006-0038-2]
    [10] Ferreira PG, Azevedo PJ. Protein sequence pattern mining with constraints. In: Jorge A, Torgo L, Brazdil P, Camacho R, Gama J, eds. Proc. of the European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD). Porto: Springer-Verlag, 2005. 96-107. [doi: 10.1007/11564126_14]
    [11] Min F, Wu X, Lu Z. Pattern matching with independent wildcard gaps. In: Proc. of the 8th Int'l Conf. on Pervasive Intelligence and Computing. Chengdu: IEEE, 2009. 194-199. [doi: 10.1109/DASC.2009.65]
    [12] Navarro G. A guided tour to approximate string matching. ACM Computing Surveys, 2001,33(1):31-88. [doi: 10.1145/375360.375 365]
    [13] Chen G, Wu X, Zhu X, Arslan AN, He Y. Efficient string matching with wildcards and length constraints. Knowledge and Information Systems, 2006,10(4):399-419. [doi: 10.1007/s10115-006-0016-8]
    [14] Wu YX, Wu XD, Jiang H, Min F. A heuristic algorithm for MPMGOOC. Chinese Journal of Computers, 2011,34(8):1452-1462 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2011.01452]
    [15] Li C, Yang Q, Wang J, Li M. Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. on Knowledge Discovery from Data (TKDD), 2012,6(1):Article 2. [doi: 10.1145/2133360.2133362]
    [16] Zhang S, Zhang J, Zhu X, Huang Z. Identifying follow-correlation itemset-pairs. In: Proc.of the Int'l Conf. on Data Mining (ICDM). Hong Kong: IEEE Computer Society, 2006. 765-774. [doi: 10.1109/ICDM.2006.84]
    [17] Zhang M, Kao B, Cheung D, Yip K. Mining periodic patterns with gap requirement from sequences. ACM Trans. on Knowledge Discovery from Data, 2007,1(2):Article 7. [doi: 10.1145/1267066.1267068]
    [18] Tanbeer SK, Ahmed CF, Jeong BS. Mining regular patterns in data streams. In: Kitagawa H, Ishikawa Y, Li Q, Watanabe C, eds. Proc. of the Database Systems for Advanced Applications. Tsukuba: Springer-Verlag, 2010. 399-413. [doi: 10.1007/978-3-642-12026-8_31]
    [19] Li Z, Ding B, Han J, Kays R, Nye P. Mining periodic behaviors for moving objects. In: Rao B, Krishnapuram B, Tomkins A, Yang Q, eds. Proc. of the 16th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD). Washington: ACM Press, 2010. 1099-1108. [doi: 10.1145/1835804.1835942]
    [20] Zhu X, Wu X. Mining complex patterns across sequences with gap requirements. In: Veloso MM, ed. Proc. of the 20th Int'l Joint Conf. on Artificial Intelligence (IJCAI). Hyderabad: Springer-Verlag, 2007. 2934-2940.
    [21] Huang Y, Wu X, Hu X, Xie F, Gao J, Wu G. Mining frequent patterns with gaps and one-off condition. In: Proc. of the IEEE Int'l Conf. on Computational Science and Engineering (CSE 2009). Vancouver: IEEE Computer Society, 2009. 180-186. [doi: 10.1109/CSE.2009.160]
    [22] Ding B, Lo D, Han J, Khoo SC. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Ioannidis YE, Lee DL, Ng RT, eds. Proc. of the IEEE 25th Int'l Conf. on Data Engineering (ICDE). Shanghai: IEEE, 2009. 1024-1035. [doi: 10.1109/ICDE.2009.104]
    [23] Myers E. Approximate matching of network expressions with spacers. Journal of Computational Biology, 1996,3(1):33-51. [doi: 10.1089/cmb.1996.3.33]
    [24] Fredriksson K, Grabowski S. Efficient algorithms for pattern matching with general gaps and character classes. In: Crestani F, Ferragina P, Sanderson M, eds. Proc. of the Int'l Conf. on String Processing and Information Retrieval. Glasgow: Springer-Verlag, 2006. 267-278. [doi: 10.1007/11880561_22]
    [25] Fredriksson K, Grabowski S. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Information Retrieval, 2008,11(4):335-357. [doi: 10.1007/s10791-008-9054-z]
    [26] Akutsu T. Approximate string matching with variable length don't care characters. IEICE Trans. on Information and Systems, 1996, E79-D(9):1353-1354.
    [27] Bille P, Gørtz I, Vildhøj H, Wind D. String matching with variable length gaps. In: Chávez E, Lonardi S, eds. Proc. of the 17th Int'l Conf. on String Processing and Information Retrieval (SPIRE 2010). Mexico: Springer-Verlag, 2010. 385-394. [doi: 10.1007/978-3-642-16321-0_40]
    [28] Rahman S, Iliopoulos C, Lee I, Mohamed M, Smyth W. Finding patterns with variable length gaps or don't cares. In: Chen DZ, Lee DT, eds. Proc. of the 12th Annual Int'l Conf. on Computing and Combinatorics. Taipei: Springer-Verlag, 2006. 146-155. [doi: 10.1007/11809678_17]
    [29] He D, Wu X, Zhu X. SAIL-APPROX: An efficient on-line algorithm for approximate pattern matching with wildcards and length constraints. In: Proc. of the 2007 IEEE Int'l Conf. on Bioinformatics and Biomedicine (BIBM 2007). Silicon Valley: IEEE Computer Society, 2007. 151-158. [doi: 10.1109/BIBM.2007.48]
    [30] Wu Y, Wu X, Min F, Li Y. A Nettree for pattern matching with flexible wildcard constraints. In: Proc. of the 2010 IEEE Int'l Conf. on Information Reuse and Integration (IRI 2010). Las Vegas: IEEE Systems, Man, and Cybernetics Society, 2010. 109-114. [doi: 10.1109/IRI.2010.5558954]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

武优西,刘亚伟,郭磊,吴信东.子网树求解一般间隙和长度约束严格模式匹配.软件学报,2013,24(5):915-932

Copy
Share
Article Metrics
  • Abstract:5515
  • PDF: 4354
  • HTML: 0
  • Cited by: 0
History
  • Received:September 13,2012
  • Revised:February 05,2013
  • Online: May 07,2013
You are the first2044932Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063