[关键词]
[摘要]
密度峰值聚类(density peak clustering,DPC)是一种简单有效的聚类分析方法.但在实际应用中,对于簇间密度差别大或者簇中存在多密度峰的数据集,DPC很难选择正确的簇中心;同时,DPC中点的分配方法存在多米诺骨牌效应.针对这些问题,提出一种基于K近邻(K-nearest neighbors,KNN)和优化分配策略的密度峰值聚类算法.首先,基于KNN、点的局部密度和边界点确定候选簇中心;定义路径距离以反映候选簇中心之间的相似度,基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性,确定簇中心.然后,为了提升点的分配的准确性,依据共享近邻、高密度最近邻、密度差值和KNN之间距离构建相似度,并给出邻域、相似集和相似域等概念,以协助点的分配;根据相似域和边界点确定初始聚类结果,并基于簇中心获得中间聚类结果.最后,依据中间聚类结果和相似集,从簇中心到簇边界将簇划分为多层,分别设计点的分配策略;对于具体层次中的点,基于相似域和积极域提出积极值以确定点的分配顺序,将点分配给其积极域中占主导地位的簇,获得最终聚类结果.在11个合成数据集和27个真实数据集上进行仿真实验,与最新的基于密度峰值的聚类算法作对比,结果表明:所提算法在纯度、F度量、准确度、兰德系数、调整兰德系数和标准互信息上均表现出良好的聚类性能.
[Key word]
[Abstract]
The density peak clustering (DPC) algorithm is a simple and effective clustering analysis algorithm. However, in real-world practical applications, it is difficult for DPC to select the correct cluster centers for datasets with large differences of density among clusters or multi-density peaks in clusters. Furthermore, the allocation method of points in DPC has a domino effect. To address these issues, a density peak clustering algorithm based on the K-nearest neighbors (KNN) and the optimized allocation strategy was proposed. First, the candidate cluster centers using the KNN, densities of points, and boundary points were determined. The path distance was defined to reflect the similarity between the candidate cluster centers, based on which, the density factor and distance factor were proposed to quantify the possibility of candidate cluster centers as cluster centers, and then the cluster centers were determined. Second, to improve the allocation precision of points, according to the shared nearest neighbors, high density nearest neighbor, density difference, and distance between KNN, the similarity measures were constructed, and then some concepts of the neighborhood, similarity set, and similarity domain were proposed to assist in the allocation of points. The initial clustering results were determined according to the similarity domains and boundary points, and then the intermediate clustering results were achieved based on the cluster centers. Finally, according to the intermediate clustering results and similarity set, the clusters were divided into multiple layers from the cluster centers to the cluster boundaries, for which the allocation strategies of points were designed, respectively. To determine the allocation order of points in the specific layer, the positive value was presented based on the similarity domain and positive domain. The point was allocated to the dominant cluster in its positive domain. Thus, the final clustering results were obtained. The experimental results on 11 synthetic datasets and 27 real datasets demonstrate that the proposed algorithm has sound clustering performance in metrics of the purity, F-measure, accuracy, Rand index, adjusted Rand index, and normalized mutual information when compared with the state-of-the-art DPC algorithms.
[中图分类号]
[基金项目]
国家自然科学基金(62076089,61976082,61772176);河南省科技攻关项目(212102210136)