基于在线性能测试的概念漂移检测方法
作者:
作者单位:

作者简介:

郭虎升(1986-),男,博士,副教授,CCF专业会员,主要研究领域为数据挖掘,机器学习,计算智能;王文剑(1968-),女,博士,教授,博士生导师,CCF高级会员,主要研究领域为机器学习,数据挖掘,计算智能;张爱娟(1993-),女,硕士生,主要研究领域为流数据挖掘,机器学习.

通讯作者:

王文剑,E-mail:wjwang@sxu.edu.cn

中图分类号:

TP181

基金项目:

国家自然科学基金(61503229,61673249,U1805263);山西省自然科学基金(201901D111033);山西省重点研发计划(国际合作)(201903D421050)


Concept Drift Detection Method Based on Online Performance Test
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61503229, 61673249, U1805263); Natural Science Foundation of Shanxi Province of China (201901D111033); Key R&D Program of Shanxi Province (International Cooperation) (201903D421050)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    概念漂移是动态流数据挖掘中一类常见的问题,但混杂噪声或训练样本规模过小而产生的伪概念漂移会引起与真实概念漂移相似的结果,即模型在线测试性能的不稳定波动,导致二者容易混淆,发生概念漂移的误报.针对流数据中真伪概念漂移的混淆问题,提出一种基于在线性能测试的概念漂移检测方法(concept drift detection method based on online performance test,简称CDPT).该方法将最新获得的数据集进行均匀分组,在每组子数据集上分别进行在线学习,同时记录每组子数据集训练测试得到的分类精度向量,并计算相邻学习时间单元之间的精度落差,依据测试精度下降阈值得到有效波动位点.然后采用交叉检验的方式整合不同分组中的有效波动位点,以消除流数据在线学习过程中由于训练样本过小导致模型不稳定造成的检测干扰,根据精度波动一致性得到一致波动位点.最后,通过跟踪在线学习分类准确率,得到一致波动位点邻域参照点的测试精度变化,比较一致波动位点邻域参照点对应的模型测试精度下降幅度及收敛情况,以有效检测一致波动位点当中真实的概念漂移位点.实验结果表明,该方法能够有效辨识流数据在线学习过程中发生的真实概念漂移,并能有效避免训练样本过小或者流数据中噪声对检测结果的负面影响,同时提高模型的泛化性能.

    Abstract:

    Concept drift is a common problem in dynamic streaming data mining, but the false concept drift generated by the mixed noise data or too small scale size training data will cause similar results to the concept drift, that is, the instability fluctuation of model online testing performance, which leads to confusion between them, and the false alarm of concept drift. To address the problem which is easy to confuse the authenticity of concept drift, concept drift detection method based on online performance test, namely CDPT, is presented. With CDPT, the latest acquired data are evenly divided into groups, and online learning is performed on each group sub sets. At the same time, the classification accuracy vectors obtained by training and testing of each group sub sets are recorded, and the accuracy difference between adjacent learning time units is calculated. The effective fluctuation points are obtained according to the testing accuracy decline threshold. Then, the effective fluctuation points in different groups are integrated by cross checking to eliminate the detection interference caused by the instability of the model due to the small training samples in the online learning process of streaming data, and the consistent fluctuation points are obtained according to the consistency of accuracy fluctuation. Finally, by tracking the classification accuracy of online learning, the change of testing accuracy can be achieved of neighborhood reference points of consistent fluctuation points, and the decline and convergence of model testing accuracy can be compared of neighborhood reference points of consistent fluctuation points, so as to effectively detect the true concept drift points of the consistent fluctuation points. The experimental results demonstrate that the proposed CDPT method can effectively identify the true concept drift occurring in the online learning process of streaming data, effectively avoid the negative impact of too small training samples or noise on the detection results, and improve the generalization performance of the model.

    参考文献
    相似文献
    引证文献
引用本文

郭虎升,张爱娟,王文剑.基于在线性能测试的概念漂移检测方法.软件学报,2020,31(4):932-947

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-03-08
  • 最后修改日期:2019-07-11
  • 录用日期:
  • 在线发布日期: 2020-01-14
  • 出版日期: 2020-04-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号