基于K近邻和优化分配策略的密度峰值聚类算法
作者:
作者简介:

孙林(1979-),男,博士,副教授,CCF专业会员,主要研究领域为粒计算,数据挖掘,机器学习,生物信息学;
徐久成(1964-),男,博士,教授,CCF高级会员,主要研究领域为粒计算,数据挖掘,机器学习;
秦小营(1995-),女,硕士生,主要研究领域为数据挖掘,机器学习;
薛占熬(1963-),男,博士,教授,CCF高级会员,主要研究领域为人工智能基础理论,数据挖掘.

通讯作者:

徐久成,E-mail:xjc@htu.edu.cn

基金项目:

国家自然科学基金(62076089,61976082,61772176);河南省科技攻关项目(212102210136)


Density Peak Clustering Algorithm Based on K-nearest Neighbors and Optimized Allocation Strategy
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [41]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    密度峰值聚类(density peak clustering,DPC)是一种简单有效的聚类分析方法.但在实际应用中,对于簇间密度差别大或者簇中存在多密度峰的数据集,DPC很难选择正确的簇中心;同时,DPC中点的分配方法存在多米诺骨牌效应.针对这些问题,提出一种基于K近邻(K-nearest neighbors,KNN)和优化分配策略的密度峰值聚类算法.首先,基于KNN、点的局部密度和边界点确定候选簇中心;定义路径距离以反映候选簇中心之间的相似度,基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性,确定簇中心.然后,为了提升点的分配的准确性,依据共享近邻、高密度最近邻、密度差值和KNN之间距离构建相似度,并给出邻域、相似集和相似域等概念,以协助点的分配;根据相似域和边界点确定初始聚类结果,并基于簇中心获得中间聚类结果.最后,依据中间聚类结果和相似集,从簇中心到簇边界将簇划分为多层,分别设计点的分配策略;对于具体层次中的点,基于相似域和积极域提出积极值以确定点的分配顺序,将点分配给其积极域中占主导地位的簇,获得最终聚类结果.在11个合成数据集和27个真实数据集上进行仿真实验,与最新的基于密度峰值的聚类算法作对比,结果表明:所提算法在纯度、F度量、准确度、兰德系数、调整兰德系数和标准互信息上均表现出良好的聚类性能.

    Abstract:

    The density peak clustering (DPC) algorithm is a simple and effective clustering analysis algorithm. However, in real-world practical applications, it is difficult for DPC to select the correct cluster centers for datasets with large differences of density among clusters or multi-density peaks in clusters. Furthermore, the allocation method of points in DPC has a domino effect. To address these issues, a density peak clustering algorithm based on the K-nearest neighbors (KNN) and the optimized allocation strategy was proposed. First, the candidate cluster centers using the KNN, densities of points, and boundary points were determined. The path distance was defined to reflect the similarity between the candidate cluster centers, based on which, the density factor and distance factor were proposed to quantify the possibility of candidate cluster centers as cluster centers, and then the cluster centers were determined. Second, to improve the allocation precision of points, according to the shared nearest neighbors, high density nearest neighbor, density difference, and distance between KNN, the similarity measures were constructed, and then some concepts of the neighborhood, similarity set, and similarity domain were proposed to assist in the allocation of points. The initial clustering results were determined according to the similarity domains and boundary points, and then the intermediate clustering results were achieved based on the cluster centers. Finally, according to the intermediate clustering results and similarity set, the clusters were divided into multiple layers from the cluster centers to the cluster boundaries, for which the allocation strategies of points were designed, respectively. To determine the allocation order of points in the specific layer, the positive value was presented based on the similarity domain and positive domain. The point was allocated to the dominant cluster in its positive domain. Thus, the final clustering results were obtained. The experimental results on 11 synthetic datasets and 27 real datasets demonstrate that the proposed algorithm has sound clustering performance in metrics of the purity, F-measure, accuracy, Rand index, adjusted Rand index, and normalized mutual information when compared with the state-of-the-art DPC algorithms.

    参考文献
    [1] Chang B, Chen EH, Zhu FD, et al. Maximum a posteriori estimation for information source detection. IEEE Trans. on Systems, Man, and Cybernetics: Systems, 2020, 50(6): 2242-2256. [doi: 10.1109/TSMC.2018.2811410]
    [2] Kang Z, Lin ZP, Zhu XF, et al. Structured graph learning for scalable subspace clustering: From single view to multiview. IEEE Trans. on Cybernetics, 2021. [doi: 10.1109/TCYB.2021.3061660]
    [3] Sun L, Qin XY, Ding WP, Xu JC. Nearest neighbors-based adaptive density peaks clustering with optimized allocation strategy. Neurocomputing, 2022, 473: 159-181. [doi: 10.1016/j.neucom.2021.12.019]
    [4] Sun L, Qin XY, Ding WP, et al. Density peaks clustering based on k-nearest neighbors and self-recommendation. Int’l Journal of Machine Learning and Cybernetics, 2021. [doi: 10.1007/s13042-021-01284-x]
    [5] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proc. of the Berkeley Symp. on Mathematical Statistics & Probability. 1965. 281-297.
    [6] Zhang XC. Data Clustering. Beijing: Science Press, 2017. 68-95 (in Chinese).
    [7] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344(6191): 1492-1496. [doi: 10.1126/ science.1242072]
    [8] Sun L, Liu RN, Xu JC, et al. An adaptive density peaks clustering method with Fisher linear discriminant. IEEE Access, 2019, 7: 72936-72955. [doi: 10.1109/ACCESS.2019.2918952]
    [9] Chen YW, Shen LL, Zhong CM, et al. Survey on density peak clustering algorithm. Journal of Computer Research and Development, 2020, 57(2): 378-394 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.2020.20190104]
    [10] Gu ZW, Li P, Lang X, et al. A controllable clustering model of the electrical load curve based on variational mode decomposition and fast search of the density peak. Power System Protection and Control, 2021, 49(8): 118-127 (in Chinese with English abstract). [doi: 10.19783/j.cnki.pspc.200713]
    [11] Liu R, Wang H, Yu XM. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Information Sciences, 2018, 450: 200-226. [doi: 10.1016/j.ins.2018.03.031]
    [12] Jiang D, Zang WK, Sun R, et al. Adaptive density peaks clustering based on K-nearest neighbor and Gini coefficient. IEEE Access, 2020, 8: 113900-113917. [doi: 10.1109/ACCESS.2020.3003057]
    [13] Hou J, Zhang AH, Qi NM. Density peak clustering based on relative density relationship. Pattern Recognition, 2020, 108: 107554. [doi: 10.1016/j.patcog.2020.107554]
    [14] Wang YZ, Wang D, Zhang XF, et al. McDPC: Multi-center density peak clustering. Neural Computing and Applications, 2020, 32: 13465-13478. [doi: 10.1007/s00521-020-04754-5]
    [15] Seyedi SA, Lot A, Moradi P, et al. Dynamic graph-based label propagation for density peaks clustering. Expert Systems with Applications, 2019, 115: 314-328. [doi: 10.1016/j.eswa.2018.07.075]
    [16] Liu YH, Ma ZM, Yu F. Adaptive density peak clustering based on k-nearest neighbors with aggregating strategy. Knowledge-based Systems, 2017, 133: 208-220. [doi: 10.1016/j.knosys.2017.07.010]
    [17] Ding SF, Xu X, Wang YR. Optimized density peaks clustering algorithm based on dissimilarity measure. Ruan Jian Xue Bao/ Journal of Software, 2020, 31(11): 3321-3333 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5813.htm [doi: 10.13328/j.cnki.jos.005813]
    [18] Wang GT, Song QB. Automatic clustering via outward statistical testing on density metrics. IEEE Trans. on Knowledge & Data Engineering, 2016, 28(8): 1971-1985. [doi: 10.1109/TKDE.2016.2535209]
    [19] Dijkstra EW. A note on two problems in connexion with graphs. Numerische Mathematik, 1959, 1(1): 269-271. [doi: 10.1016/ j.knosys.2019.06.032]
    [20] Lotfi A, Moradi P, Beigy H. Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recognition, 2020, 107: 107449. [doi: 10.1016/j.patcog.2020.107449]
    [21] Xie JY, Gao HC, Xie WX, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Information Sciences, 2016, 354: 19-40. [doi: 10.1016/j.ins.2016.03.011]
    [22] Shi JB, Malik J. Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2000, 22(8): 167-172.
    [23] Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2006, 29(1): 167-172. [doi: 10.1109/TPAMI.2007.250608]
    [24] Zhu XT, Loy CC, Gong SG. Constructing robust affinity graphs for spectral clustering. In: Proc. of the IEEE Int’l Conf. on Computer Vision and Pattern Recognition. 2014. 1450-1457. [doi: 10.1109/CVPR.2014.188]
    [25] Mehmood R, Zhang GZ, Bie RF, et al. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing, 2016, 208: 210-217. [doi: 10.1016/j.neucom.2016.01.102]
    [26] Hou J, Cui HX. Experimental evaluation of a density kernel in clustering. In: Proc. of the Int’l Conf. on Intelligent Control and Information Processing. 2016. 5559. [doi: 10.1109/ICICIP.2016.7885876]
    [27] Powers DMW. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv: 2010.16061, 2020.
    [28] Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 2010, 11(95): 2837-2854.
    [29] He ZY, Xu XF, Deng SC. K-ANMI: A mutual information based clustering algorithm for categorical data. Information Fusion, 2008, 9(2): 223-233.
    [30] Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 1933, 24(6): 417-441. [doi: 10.1037/h0071325]
    [31] Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers and Geosciences, 1984, 10: 191-203. [doi: 10.1016/0098-3004(84)90020-7]
    [32] Ester M, Kriegel HP, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd Int’l Conf. on Knowledge Discovery and Data Mining, Vol.1. 1996. 226-231.
    [33] Du MJ, Ding SF, Jia HJ. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-based Systems, 2016, 99: 135-145.[doi: 10.1016/j.knosys.2016.02.001]
    [34] Lot A, Seyedi SA, Moradi P. An improved density peaks method for data clustering. In: Proc. of the IEEE 6th Int’l Conf. on Computer and Knowledge Engineering. 2016. 263-268.
    [35] Frey BJ, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972-976. [doi: 10.1126/science. 1136800]
    [36] Cheng DD, Zhang SL, Huang JL. Dense members of local cores-based density peaks clustering algorithm. Knowledge-Based Systems, 2020, 193: 105454.[doi: 10.1016/j.knosys.2019.105454]
    附中文参考文献:
    [6] 张宪超. 数据聚类. 北京: 科学出版社, 2017. 68-95.
    [9] 陈叶旺, 申莲莲, 钟才明, 王田, 陈谊, 杜吉祥. 密度峰值聚类算法综述. 计算机研究与发展, 2020, 57(2): 378-394. [doi: 10.7544/issn1000-1239.2020.20190104]
    [10] 谷紫文, 李鹏, 郎恂, 喻怡轩, 沈鑫, 曹敏. 基于变分模态分解和密度峰值快速搜索的电力负荷曲线可控聚类模型. 电力系统保护与控制, 2021, 49(8): 118-127. [doi: 10.19783/j.cnki.pspc.200713]
    [17] 丁世飞, 徐晓, 王艳茹. 基于不相似性度量优化的密度峰值聚类算法. 软件学报, 2020, 31(11): 3321-3333. http://www.jos.org.cn/1000-9825/5813.htm [doi: 10.13328/j.cnki.jos.005813]
    引证文献
引用本文

孙林,秦小营,徐久成,薛占熬.基于K近邻和优化分配策略的密度峰值聚类算法.软件学报,2022,33(4):1390-1411

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-01-10
  • 最后修改日期:2021-07-16
  • 在线发布日期: 2021-10-26
  • 出版日期: 2022-04-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号