[关键词]
[摘要]
基于模式的贝叶斯分类模型是解决数据挖掘领域分类问题的一种有效方法.然而,大多数基于模式的贝叶斯分类器只考虑模式在目标类数据集中的支持度,而忽略了模式在对立类数据集合中的支持度.此外,对于高速动态变化的无限数据流环境,在静态数据集下的基于模式的贝叶斯分类器就不能适用.为了解决这些问题,提出了基于显露模式的数据流贝叶斯分类模型EPDS(Bayesian classifier algorithm based on emerging pattern for data stream).该模型使用一个简单的混合森林结构来维护内存中事务的项集,并采用一种快速的模式抽取机制来提高算法速度.EPDS采用半懒惰式学习策略持续更新显露模式,并为待分类事务在每个类下建立局部分类模型.大量实验结果表明,该算法比其他数据流分类模型有较高的准确度.
[Key word]
[Abstract]
Pattern-Based Bayesian model is one of the solutions for the classification problem in data mining. Most pattern-based Bayesian classifiers consider the supports of patterns in the dataset of the home class only. However, the supports of the patterns in the counterpart class are ignored. In addition, for the high-speed dynamic changes and infinite data stream, pattern-based Bayesian classifier which aims at static datasets can not work. To overcome these problems, EPDS (Bayesian classifier algorithm based on emerging pattern for data stream) is proposed. EPDS is a Bayesian classification model based on the emerging patterns discovered over data stream. In this model, EPDS presents a simple hybrid forests (HYF) data structure to maintain the itemsets of the transactions in memory, and uses a fast pattern extracting mechanism to accelerate the algorithm. EPDS adopts partially-lazy learning strategy to update emerging itemsets continuously, and establishes a local classification model in each class for the test transaction. Experimental results on real and synthetic data streams show that EPDS achieves higher classification accuracy compared to other classic classifiers.
[中图分类号]
[基金项目]
国家自然科学基金(61672086)