[关键词]
[摘要]
在开放环境下,数据流具有数据高速生成、数据量无限和概念漂移等特性.在数据流分类任务中,利用人工标注产生大量训练数据的方式昂贵且不切实际.包含少量有标记样本和大量无标记样本且还带概念漂移的数据流给机器学习带来了极大挑战.然而,现有研究主要关注有监督的数据流分类,针对带概念漂移的数据流的半监督分类的研究尚未引起足够的重视.因此,在全面收集数据流半监督分类研究工作的基础上,对现有带概念漂移的数据流的半监督分类算法进行了多角度划分;并以算法采用的分类器类型为线索,对已有的多个算法进行了介绍与总结,包括现有数据流半监督分类采用的概念漂移检测方法;在一些被广泛使用的真实数据集和人工数据集上,对部分代表性数据流半监督分类算法进行了多方面的比较与分析;最后,提出了当前概念漂移数据流半监督分类中一些值得进一步深入探讨的问题.实验结果表明:数据流半监督分类算法的分类准确率与众多因素有关,但与数据分布的变化关系最大.本综述将有助于感兴趣的研究者快速进入数据流半监督分类问题领域.
[Key word]
[Abstract]
In the open environment, data streams have the characteristics of high-speed data generation, unlimited data volume, and concept drift. In the task of data stream classification, it is expensive and impractical to generate a large amount of training data by manual annotation. A data stream with a small number of samples labeled and a large number of samples unlabeled and with concept drifts presents a great challenge to machine learning. However, the existing research mainly focuses on supervised classification of data streams, while semi-supervised classification of data streams with concept drifts has not yet attracted attention enough. Therefore, based on the comprehensive collection of the work of semi-supervised classification of data streams, this study sorts the existing semi-supervised data stream classification algorithms into several types from several aspects, describes and summarizes many existing algorithms based on the types of classifiers used in the algorithms and the concept drift detection methods utilized. On some widely employed real and synthetic datasets, several representative semi-supervised classification algorithms for data streams are chosen to be compared and analyzed in many aspects. Finally, this study proposes some issues that are worthy to be further discussed in future for semi-supervised classification of data streams with concept drifts. The experimental results show that the classification accuracy of the algorithms for semi-supervised data stream classification is related to many factors, but it has the greatest relationship with the changes of data distribution. This review will help the interested researchers quickly enter into the field of semi-supervised classification of data streams.
[中图分类号]
[基金项目]
广西自然科学基金(2018GXNSFDA138006);国家自然科学基金(61866007);教育部人文社会科学研究项目(17JDGC022);广西图像图形与智能处理重点实验室项目(GIIP2005,GIIP201505,GIIP201706)