Abstract:In this paper, a algorithm named DE-Tri-training semi-supervised K-means is proposed, which could get a seeds set of larger scale and less noise. In detail, prior to using the seeds set to initialize cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label unlabeled data and add them into the initial seeds set to enlarge the scale. Meanwhile, to improve the quality of the enlarged seeds set, a nearest neighbor rule based data editing technique named Depuration is introduced into Tri-training process to eliminate and correct the mislabeled noise data in the enlarged seeds. Experimental results show that the novel semi-supervised clustering algorithm could effectively improve the cluster centroids initialization and enhance clustering performance.