基于K近邻和优化分配策略的密度峰值聚类算法
作者:
作者单位:

作者简介:

孙林(1979-),男,博士,副教授,CCF专业会员,主要研究领域为粒计算,数据挖掘,机器学习,生物信息学;
徐久成(1964-),男,博士,教授,CCF高级会员,主要研究领域为粒计算,数据挖掘,机器学习;
秦小营(1995-),女,硕士生,主要研究领域为数据挖掘,机器学习;
薛占熬(1963-),男,博士,教授,CCF高级会员,主要研究领域为人工智能基础理论,数据挖掘.

通讯作者:

徐久成,E-mail:xjc@htu.edu.cn

中图分类号:

基金项目:

国家自然科学基金(62076089,61976082,61772176);河南省科技攻关项目(212102210136)


Density Peak Clustering Algorithm Based on K-nearest Neighbors and Optimized Allocation Strategy
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    密度峰值聚类(density peak clustering,DPC)是一种简单有效的聚类分析方法.但在实际应用中,对于簇间密度差别大或者簇中存在多密度峰的数据集,DPC很难选择正确的簇中心;同时,DPC中点的分配方法存在多米诺骨牌效应.针对这些问题,提出一种基于K近邻(K-nearest neighbors,KNN)和优化分配策略的密度峰值聚类算法.首先,基于KNN、点的局部密度和边界点确定候选簇中心;定义路径距离以反映候选簇中心之间的相似度,基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性,确定簇中心.然后,为了提升点的分配的准确性,依据共享近邻、高密度最近邻、密度差值和KNN之间距离构建相似度,并给出邻域、相似集和相似域等概念,以协助点的分配;根据相似域和边界点确定初始聚类结果,并基于簇中心获得中间聚类结果.最后,依据中间聚类结果和相似集,从簇中心到簇边界将簇划分为多层,分别设计点的分配策略;对于具体层次中的点,基于相似域和积极域提出积极值以确定点的分配顺序,将点分配给其积极域中占主导地位的簇,获得最终聚类结果.在11个合成数据集和27个真实数据集上进行仿真实验,与最新的基于密度峰值的聚类算法作对比,结果表明:所提算法在纯度、F度量、准确度、兰德系数、调整兰德系数和标准互信息上均表现出良好的聚类性能.

    Abstract:

    The density peak clustering (DPC) algorithm is a simple and effective clustering analysis algorithm. However, in real-world practical applications, it is difficult for DPC to select the correct cluster centers for datasets with large differences of density among clusters or multi-density peaks in clusters. Furthermore, the allocation method of points in DPC has a domino effect. To address these issues, a density peak clustering algorithm based on the K-nearest neighbors (KNN) and the optimized allocation strategy was proposed. First, the candidate cluster centers using the KNN, densities of points, and boundary points were determined. The path distance was defined to reflect the similarity between the candidate cluster centers, based on which, the density factor and distance factor were proposed to quantify the possibility of candidate cluster centers as cluster centers, and then the cluster centers were determined. Second, to improve the allocation precision of points, according to the shared nearest neighbors, high density nearest neighbor, density difference, and distance between KNN, the similarity measures were constructed, and then some concepts of the neighborhood, similarity set, and similarity domain were proposed to assist in the allocation of points. The initial clustering results were determined according to the similarity domains and boundary points, and then the intermediate clustering results were achieved based on the cluster centers. Finally, according to the intermediate clustering results and similarity set, the clusters were divided into multiple layers from the cluster centers to the cluster boundaries, for which the allocation strategies of points were designed, respectively. To determine the allocation order of points in the specific layer, the positive value was presented based on the similarity domain and positive domain. The point was allocated to the dominant cluster in its positive domain. Thus, the final clustering results were obtained. The experimental results on 11 synthetic datasets and 27 real datasets demonstrate that the proposed algorithm has sound clustering performance in metrics of the purity, F-measure, accuracy, Rand index, adjusted Rand index, and normalized mutual information when compared with the state-of-the-art DPC algorithms.

    参考文献
    相似文献
    引证文献
引用本文

孙林,秦小营,徐久成,薛占熬.基于K近邻和优化分配策略的密度峰值聚类算法.软件学报,2022,33(4):1390-1411

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-01-10
  • 最后修改日期:2021-07-16
  • 录用日期:
  • 在线发布日期: 2021-10-26
  • 出版日期: 2022-04-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号