基于Tri-Training和数据剪辑的半监督聚类算法
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

Supported by the National Natural Science Foundation of China under Grant Nos.60702033,60772076(国家自然科学基金);the National High-Tech Researth and Development Plan of China under Grant No.2007AA012171(国家高技术研究发展计划(863));the Science Fund for Distinguished Young Scholars of Heilongjiang Province of China under Grant No.JC200611(黑龙江省杰出青年科学基金);the Natural Science Foundation of Heilongjiaag Province of China under Grant No.ZJG0705(黑龙江省自然科学重点基金);the Foundation of Harbin Institute of Technology of China under Grant No.HIT.2003.53(哈尔滨工业大学校基金)


Tri-Training and Data Editing Based Semi-Supervised Clustering Algorithm
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    提出一种半监督聚类算法,该算法在用seeds集初始化聚类中心前,利用半监督分类方法Tri-training的迭代训练过程对无标记数据进行标记,并加入seeds集以扩大规模;同时,在Tri-training训练过程中结合基于最近邻规则的Depuration数据剪辑技术对seeds集扩大过程中产生的误标记噪声数据进行修正、净化,以提高seeds集质量.实验结果表明,所提出的基于Tri-training和数据剪辑的DE-Tri-training半监督聚类新算法能够有效改善seeds集对聚类中心的初始化效果,提高聚类性能.

    Abstract:

    In this paper, a algorithm named DE-Tri-training semi-supervised K-means is proposed, which could get a seeds set of larger scale and less noise. In detail, prior to using the seeds set to initialize cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label unlabeled data and add them into the initial seeds set to enlarge the scale. Meanwhile, to improve the quality of the enlarged seeds set, a nearest neighbor rule based data editing technique named Depuration is introduced into Tri-training process to eliminate and correct the mislabeled noise data in the enlarged seeds. Experimental results show that the novel semi-supervised clustering algorithm could effectively improve the cluster centroids initialization and enhance clustering performance.

    参考文献
    相似文献
    引证文献
引用本文

邓 超,郭茂祖.基于Tri-Training和数据剪辑的半监督聚类算法.软件学报,2008,19(3):663-673

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2006-06-21
  • 最后修改日期:2007-03-07
  • 录用日期:
  • 在线发布日期:
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号