一种大数据环境中分布式辅助关联分类算法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61100027, 61374178, 61202085, 61572117, 61572116); 中央高校基本科研业务费专项资金(N13 0417003); 高等学校博士学科点专项科研基金(20120042120010)


Distributed Assistant Associative Classification Algorithm in Big Data Environment
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在很多现实的分类应用中,新数据的类标需要由领域专家最终确定,而分类器的分类结果仅起辅助作用.另外,随着大数据所隐含价值越发被人们重视,分类器的训练会从面向单一数据集逐渐过渡到面向分布式空间数据集,大数据环境下辅助分类也将成为未来分类应用的重要分支.然而,现有的分类研究缺乏对此类应用的关注.大数据环境中的辅助分类面临以下3个问题:1) 训练集是分布式大数据集;2) 在空间上,训练集所包含的各局部数据源的类别分布不尽相同;3) 在时间上,训练集是动态变化的,会发生类别迁移现象.在考虑以上问题的基础上,提出一种大数据环境中分布式辅助关联分类方法.该方法首先给出一种大数据环境中分布式关联分类器构建算法,在该算法中,通过横向加权考虑分类数据集在空间上的类别分布差异,并给出"前件空间支持度-相关系数"的度量框架,改进关联分类算法面对不平衡数据的性能缺陷;然后,给出一种基于适应因子的辅助关联分类器动态调整方法,能够在分类器应用过程中充分利用领域专家实时反馈的结果对分类器进行动态调整,以提升其面向动态数据集的分类性能,减缓分类器的退化和重新训练的频率.实验结果表明,该方法能够面向分布式数据集较快地训练出有较高分类准确率的关联分类器,并在数据集不断扩充变化时提升分类性能,是一种有效的大数据环境中辅助分类应用方法.

    Abstract:

    For many practical classification applications, the class label of new data needs to be confirmed eventually by domain expert, and the result of classifier only plays an assistant role. In addition, with the implicit values of big data calling more people's attention, classifier training is going through a transition from single dataset to distributed space dataset, and assistant classification in big data environment will also become an important branch of future classification applications. However, existing classification research lacks attention to this kind of application. Assistant classification in big data environment faces with the following three problems: 1) the training set is distributed big dataset, 2) in space, the class distributions of local datasets contained in the training set are commonly different, and 3) in time, the training set is dynamic and its class distribution may change. To address the above problems, this paper proposes a distributed assistant associative classification approach in big data environment. Firstly, a distributed associative classifier constructing algorithm in big data environment is constructed. With the new algorithm, the class distribution difference in space of the classification dataset is considered by horizontal weighting, and the performance deficiency of associative classification algorithms to imbalanced class distribution datasets is improved by giving a measure framework of "antecedent space support-correlation coefficient". Next, an adaptive factor based dynamic adjustment method for assistant associative classifier is proposed. This method can make full use of domain experts' real-time feedback to adjust classifier dynamically in the applying process of the used classifier, to improve its performance facing dynamic datasets, and to slow down its retraining frequency. Experimental results demonstrate that the presented approach can relative quickly train associative classifiers with higher classification accuracy for distributed datasets, and can improve their performance when datasets are continually expanding and changing. Thus it's an effective approach for assistant classification applications in big data environment.

    参考文献
    相似文献
    引证文献
引用本文

张明卫,朱志良,刘莹,张斌.一种大数据环境中分布式辅助关联分类算法.软件学报,2015,26(11):2795-2810

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2015-05-27
  • 最后修改日期:2015-08-26
  • 录用日期:
  • 在线发布日期: 2015-11-04
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号