带通配符和One-Off条件的序列模式挖掘
作者:
基金项目:

国家自然科学基金(61229301, 60828005, 61273292); 美国国家科学基金(CCF-0905337, CCF-0514819); 国家高技术研究发展计划(863)(2012AA011005); 国家重点基础研究发展计划(973)(2013CB329604)


Mining Sequential Patterns with Wildcards and the One-Off Condition
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [17]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    很多应用领域产生大量的序列数据.如何从这些序列数据中挖掘具有重要价值的模式,已成为序列模式挖掘研究的主要任务.研究这样一个问题:给定序列S、支持度阈值和间隔约束,从序列S中挖掘所有出现次数不小于给定支持度阈值的频繁序列模式,并且要求模式中任意两个相邻元素在序列中的出现位置满足用户定义的间隔约束.设计了一种有效的带有通配符的模式挖掘算法One-Off Mining,模式在序列中的出现满足One-Off 条件,即模式的任意两次出现都不共享序列中同一位置的字符.在生物DNA 序列上的实验结果表明,One-Off Mining 比相关的序列模式挖掘算法具有更好的时间性能和完备性.

    Abstract:

    There is a huge wealth of sequence data available in real-world applications. The task of sequential pattern mining serves to mine important patterns from the sequence data. Given a sequence S, a certain threshold, and gap constraints, this paper aims to discover frequent patterns whose supports in S are no less than the given threshold value. There are flexible wildcards in pattern P, and the number of the wildcards between any two successive elements of P fulfills the user-specified gap constraints. The study designs an efficient mining algorithm: One-Off Mining, whose mining process satisfies the One-Off condition under which each character in the given sequence can be used at most once in all occurrences of a pattern. Experiments on DNA sequences show that this method performs better in time and completeness than the related sequential pattern mining algorithms.

    参考文献
    [1] van Belkum A, Scherer S, van Leeuwen W, Willemse W, van Alphen L, Verbrugh H. Variable number of tandem repeats inclinical strains of haemophilus influenzae. Infection and Immunity, 1997,65(12):5017-5027.
    [2] Agrawal R, Srikant R. Mining sequential patterns. In: Yu PS, Chen ALP, eds. Proc. of the Int’l Conf. on Data Engineering(ICDE’95). Los Alamitos: IEEE Computer Society, 1995. 3-14. [doi: 10.1109/ICDE.1995.380415]
    [3] Chen G, Wu X, Zhu X, Arslan A, He Y. Efficient string matching with wildcards and length constraints. Knowledge andInformation Systems, 2006,10(4):399-419. [doi: 10.1007/s10115-006-0016-8]
    [4] Srikant R, Agrawal R. Mining sequential patterns: Generalizations and performance improvements. In: Apers P, Bouzeghoub M,Gardarin G, eds. Proc. of the Int’l Conf. on Extending Database Technology (EDBT’96). LNCS 1057, Heidelberg: Springer-Verlag,1996. 13-17. [doi: 10.1007/BFb0014140]
    [5] Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC. PrefixSpan: Mining sequential patterns efficiently by prefix-projectedpattern growth. In: Georgakopoulos D, Buchmann A, eds. Proc. of the Int’l Conf. on Data Engineering (ICDE 2001). Los Alamitos:IEEE Computer Society, 2001. 215-224. [doi: 10.1109/ICDE.2001.914830]
    [6] Zaki MJ. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 2001,42(1):31-60. [doi: 10.1023/A:1007652502315]
    [7] Ayres J, Gehrke J, Yiu T, Flannick J. Sequential pattern mining using a bitmap representation. In: Simoff SJ, Za-ane OR, eds. Proc.of the ACM SIGKDD. New York: ACM Press, 2002. 429-435. [doi: 10.1145/775047.775109]
    [8] Zou X, Zhang W, Liu Y, Cai QS. Study on distributed sequential pattern discovery algorithm. Ruan Jian Xue Bao/Journal ofSoftware, 2005,16(7):1262-1269 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/16/1262.htm [doi: 10.1360/jos161262]
    [9] Ji XN, Bailey J, Dong GZ. Mining minimal distinguishing subsequence patterns with gap constraints. In: Han JW, Wah BW,Raghavan V, Wu XD, Rastogi R, eds. Proc. of the IEEE Int’l Conf. on Data Mining (ICDM 2005). Los Alamitos: IEEE ComputerSociety, 2005. 194-201. [doi: 10.1109/ICDM.2005.96]
    [10] Li C, Wang J. Efficiently mining closed subsequences with gap constraints. In: Proc. of the SIAM Int’l Conf. on Data Mining.Philadelphia: Society for Industrial Mathematics, 2008. 313-322.
    [11] Zhang M, Kao B, Cheung D, Yip K. Mining periodic patterns with gap requirement from sequences. In: Ozcan F, ed. Proc. of theACM SIGMOD. New York: ACM Press, 2005. 623-633.
    [12] Zhu XQ, Wu XD. Mining complex patterns across sequences with gap requirements. In: Veloso MM, ed. Proc. of the Int’l JointConf. on Artificial Intelligence (IJCAI 2007). Menlo Park: AAAI Press, 2007. 726-735.
    [13] He Y, Wu X, Zhu X, Arslan AN. Mining frequent patterns with wildcards from biological sequences. In: Chang, W, Joshi, JBD,eds. Proc. of the IEEE Int’l Conf. on Information Reuse and Integration (IRI 2007). Los Alamitos: IEEE Computer Society, 2007.329-334. [doi: 10.1109/IRI.2007.4296642]
    [14] Xie F, Wu X, Hu X, Gao J, Guo D, Fei Y, Hua E. Sequential pattern mining with wildcards. In: Gregoire E, ed. Proc. of the 22ndInt’l Conf. on Tools with Artificial Intelligence (ICTAI 2010). Los Alamitos: IEEE Computer Society, 2010. 241-247. [doi: 10.1109/ICTAI.2010.42]
    [15] Huang Y, Wu X, Hu X, Xie F, Gao J, Wu G. Mining frequent patterns with gaps and One-Off condition. In: Muzio JC, Brent RP,eds. Proc. of the 12th IEEE Int’l Conf. on Computational Science and Engineering (CSE 2009). New York: IEEE Press, 2009.180-186. [doi: 10.1109/CSE.2009.160]
    [16] Ding B, Lo D, Han J, Khoo S C. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: IoannidisYE, Lee DL, Ng RT, eds. Proc. of the 25th Int’l Conf. on Data Engineering (ICDE 2009). Los Alamitos: IEEE Computer Society,2009. 1024-1035. [doi: 10.1109/ICDE.2009.104]
    [17] NCBI. http://www.ncbi.nlm.nih.gov
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

吴信东,谢飞,黄咏明,胡学钢,高隽.带通配符和One-Off条件的序列模式挖掘.软件学报,2013,24(8):1804-1815

复制
分享
文章指标
  • 点击次数:6393
  • 下载次数: 8448
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2011-08-05
  • 最后修改日期:2012-09-12
  • 在线发布日期: 2013-07-26
文章二维码
您是第19921160位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号