基于密度的多度量空间数据聚类算法
作者:
中图分类号:

TP311

基金项目:

国家重点研发计划(2021YFC3300303); 国家自然科学基金(62025206, 61972338, 62102351); 杭州市人工智能重大科技创新项目(2022AIZD0116)


Density-based Data Clustering Algorithm in Multi-metric Spaces
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [50]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    具有噪声的基于密度的数据聚类(DBSCAN)算法是数据挖掘领域中的经典方法之一, 其不仅能发现数据中潜藏的复杂关系, 还能过滤其中的数据噪声, 从而获得高质量的数据聚类. 然而, 现有的基于密度的数据聚类算法仅支持单模态(类型)数据的聚类, 难以应对多模态(类型)数据并存的应用场景. 随着信息技术的快速发展, 数据呈现多模态化的发展态势, 现实生活中的数据不再是单一的数据类型, 而是多种数据模态(类型)的组合, 如文本、图像、地理坐标、数据特征等. 因此, 现有的数据聚类方法难以对复杂的多模态数据进行有效的数据建模, 更无法进行高效的多模态数据聚类. 基于此, 提出一种基于密度的多度量空间聚类算法. 首先, 为了刻画多模态数据间的复杂关系, 利用多度量空间表征数据之间的相似性关系, 并且利用聚合多度量图索引(AMG)实现多模态数据建模. 接着, 利用差分化的相似性关系优化聚合多度量图的图结构, 并且结合最优策略优先的搜索策略进行剪枝, 以实现高效的多模态数据聚类. 最后, 在真实与合成数据集上针对多种参数设置进行实验. 实验结果验证了所提方法运行效率提升了至少1个数量级, 并具有较高的聚类精度与良好的可扩展性.

    Abstract:

    The density-based spatial clustering of applications with noise (DBSCAN) algorithm is one of the clustering analysis methods in the field of data mining. It has a strong capability of discovering complex relationships between objects and is insensitive to noise data. However, existing DBSCAN methods only support the clustering of unimodal objects, struggling with applications involving multi-model data. With the rapid development of information technology, data has become increasingly diverse in real-life applications and contains a huge variety of models, such as text, images, geographical coordinates, and data features. Thus, existing clustering methods fail to effectively model complex multi-model data and cannot support efficient multi-model data clustering. To address these issues, in this study, a density-based clustering algorithm in multi-metric spaces is proposed. Firstly, to characterize the complex relationships within multi-model data, this study uses a multi-metric space to quantify the similarity between objects and employs aggregated multi-metric graph (AMG) to model multi-model data. Next, this study employs differential distances to balance the graph structure and leverages a best-first search strategy combined with pruning techniques to achieve efficient multi-model data clustering. The experimental evaluation on real and synthetic datasets, using various experimental settings, demonstrates that the proposed method achieves at least one order of magnitude improvement in efficiency with high clustering accuracy, and exhibits good scalability.

    参考文献
    [1] Ester M, Kriegel HP, Sander J, Xu XW. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd Int’l Conf. on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996. 226-231.
    [2] 于彦伟, 贾召飞, 曹磊, 赵金东, 刘兆伟, 刘惊雷. 面向位置大数据的快速密度聚类算法. 软件学报, 2018, 29(8): 2470–2484. http://www.jos.org.cn/1000-9825/5289.htm
    Yu YW, Jia ZF, Cao L, Zhao JD, Liu ZW, Liu JL. Fast density-based clustering algorithm for location big data. Ruan Jian Xue Bao/Journal of Software, 2018, 29(8): 2470–2484 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5289.htm
    [3] Gunasekaran YD, Rahman MF, Hasani S, Zhang N, Das G. DBLOC: Density based clustering over location based services. In: Proc. of the 2018 Int’l Conf. on Management of Data. Houston: ACM, 2018. 1697-1700. [doi: 10.1145/3183713.3193561]
    [4] Kim B, Koo K, Kim J, Moon B. DISC: Density-based incremental clustering by striding over streaming data. In: Proc. of the 37th IEEE Int’l Conf. on Data Engineering. Chania: IEEE, 2021. 828-839. [doi: 10.1109/ICDE51399.2021.00077]
    [5] Yang D, Rundensteiner EA, Ward MO. Summarization and matching of density-based clusters in streaming environments. Proc. of the VLDB Endowment, 2011, 5(2): 121–132.
    [6] Zhang BX, Zhang LF, Wang Z, Cui ZQ, Sun Y, Hua HC. Image reconstruction of planar electrical capacitance tomography based on DBSCAN and self-adaptive ADMM algorithm. IEEE Trans. on Instrumentation and Measurement, 2023, 72: 4504711.
    [7] Ren YZ, Wang N, Li MX, Xu ZL. Deep density-based image clustering. Knowledge-based Systems, 2020, 197: 105841.
    [8] Al-Shammari A, Zhou R, Naseriparsaa M, Liu CF. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. Int’l Journal of Medical Informatics, 2019, 126: 176–186.
    [9] 褚玉伟, 罗晓博, 屈珂, 陶煜波, 林军, 林海. DBSCAN和K-means混合聚类的牙齿特征自动识别. 计算机辅助设计与图形学学报, 2018, 30(7): 1276–1283.
    Chu YW, Luo XB, Qu K, Tao YB, Lin J, Lin H. DBSCAN and K-means hybrid clustering based automatic dental feature detection. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(7): 1276–1283 (in Chinese with English abstract).
    [10] Yang KY, Gao YJ, Ma R, Chen L, Wu S, Chen G. DBSCAN-MS: Distributed density-based clustering in metric spaces. In: Proc. of the 35th IEEE Int’l Conf. on Data Engineering. Macao: IEEE, 2019. 1346-1357. [doi: 10.1109/ICDE.2019.00122]
    [11] Chen L, Gao YJ, Zhang YL, Jensen CS, Zheng BL. Efficient and incremental clustering algorithms on star-schema heterogeneous graphs. In: Proc. of the 35th IEEE Int’l Conf. on Data Engineering. Macao: IEEE, 2019. 256-267. [doi: 10.1109/ICDE.2019.00031]
    [12] Chen L, Gao YJ, Huang XR, Jensen CS, Zheng BL. Efficient distributed clustering algorithms on star-schema heterogeneous graphs. IEEE Trans. on Knowledge and Data Engineering, 2020, 34(10): 4781–4796.
    [13] Gunawan A. A faster algorithm for DBSCAN [MS. Thesis]. Eindhoven: Eindhoven University of Technology, 2013.
    [14] Gan JH, Tao YF. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In: Proc. of the 2015 ACM SIGMOD Int’l Conf. on Management of Data. Melbourne: ACM, 2015. 519-530. [doi: 10.1145/2723372.2737792]
    [15] Lulli A, Dell'Amico M, Michiardi P, Ricci L. NG-DBSCAN: Scalable density-based clustering for arbitrary data. Proc. of the VLDB Endowment, 2016, 10(3): 157–168.
    [16] Moseley B, Vassilvitskii S, Wang YY. Hierarchical clustering in general metric spaces using approximate nearest neighbors. In: Proc. of the 24th Int’l Conf. on Artificial Intelligence and Statistics. PMLR, 2021. 2440-2448.
    [17] Li H, Liu XJ, Li T, Gan RD. A novel density-based clustering algorithm using nearest neighbor graph. Pattern Recognition, 2020, 102: 107206.
    [18] Chen YW, Zhou LD, Pei SW, Yu ZW, Chen Y, Liu X, Du JX, Xiong NX. KNN-BLOCK DBSCAN: Fast clustering for large-scale data. IEEE Trans. on Systems, Man, and Cybernetics: Systems, 2021, 51(6): 3939–3953.
    [19] 于亚新, 王国仁, 林利增, 李淼, 朱歆华. M2+-树: 一种支持医学病例多度量空间检索的高效索引. 计算机研究与发展, 2010, 47(4): 671–678.
    Yu YX, Wang GR, Lin LZ, Li M, Zhu XH. M2+-Tree: Processing multiple metric space queries of medical cases efficiently with just one index. Journal of Computer Research and Development, 2010, 47(4): 671–678 (in Chinese with English abstract).
    [20] Bustos B, Skopal T. Dynamic similarity search in multi-metric spaces. In: Proc. of the 8th ACM Int’l Workshop on Multimedia Information Retrieval. Santa Barbara: ACM, 2006. 137-146. [doi: 10.1145/1178677.1178698]
    [21] Franzke M, Emrich T, Züfle A, Renz M. Indexing multi-metric data. In: Proc. of the 32nd IEEE Int’l Conf. on Data Engineering. Helsinki: IEEE, 2016. 1122-1133. [doi: 10.1109/ICDE.2016.7498318]
    [22] Zhu YF, Chen L, Gao YJ, Zheng BH, Wang PF. DESIRE: An efficient dynamic cluster-based forest indexing for similarity search in multi-metric spaces. Proc. of the VLDB Endowment, 2022, 15(10): 2121–2133.
    [23] Zabot GF, Cazzolato MT, Scabora LC, Traina AJM, Traina C. Efficient indexing of multiple metric spaces with spectra. In: Proc. of the 2019 IEEE Int’l Symp. on Multimedia. San Diego: IEEE, 2019. 169-1697. [doi: 10.1109/ISM46123.2019.00038]
    [24] 刘俊岭, 孙焕良. 多维度量空间中发现相互kNN. 计算机科学与探索, 2010, 4(10): 881–889.
    Liu JL, Sun HL. Finding mutual k-nearest neighbors in multi-metric space. Journal of Frontiers of Computer Science & Technology, 2010, 4(10): 881–889 (in Chinese with English abstract).
    [25] Chen L, Gao YJ, Song X, Li Z, Zhu YF, Miao XY, Jensen CS. Indexing metric spaces for exact similarity search. ACM Computing Surveys, 2022, 55(6): 128.
    [26] Zhu YF, Chen L, Gao YJ, Jensen CS. Pivot selection algorithms in metric spaces: A survey and experimental study. The VLDB Journal, 2022, 31(1): 23–47.
    [27] Micó ML, Oncina J, Vidal E. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 1994, 15(1): 9–17.
    [28] Bozkaya T, Ozsoyoglu ZM. Distance-based indexing for high-dimensional metric spaces. ACM SIGMOD Record, 1997, 26(2): 357–368.
    [29] Chávez E, Navarro G. A compact space decomposition for effective metric indexing. Pattern Recognition Letters, 2005, 26(9): 1363–1376.
    [30] Ciaccia P, Patella M, Zezula P. M-tree: An efficient access method for similarity search in metric spaces. In: Proc. of the 23rd Int’l Conf. on Very Large Data Bases. Athens: Morgan Kaufmann Publishers Inc., 1997. 426-435.
    [31] Brin S. Near neighbor search in large metric spaces. In: Proc. of the 21st Int’l Conf. on Very Large Data Bases. Zurich: Morgan Kaufmann Publishers Inc., 1995. 574-584.
    [32] Novak D, Batko M, Zezula P. Metric index: An efficient and scalable solution for precise and approximate similarity search. Information Systems, 2011, 36(4): 721–733.
    [33] Skopal T, Pokorný J, Snásel V. PM-tree: Pivoting metric tree for similarity search in multimedia databases. In: Proc. of the 8th Advances in Databases and Information Systems. Budapest: ADBIS, 2004. 803-815.
    [34] Bentley JL. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517.
    [35] Xu XW, Jäger J, Kriegel HP. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3(3): 263–290.
    [36] Wang MZ, Xu XL, Yue Q, Wang YX. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. of the VLDB Endowment, 2021, 14(11): 1964–1978.
    [37] Malkov Y, Ponomarenko A, Logvinov A, Krylov V. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 2014, 45: 61–68.
    [38] Open food facts. 2023. https://world.openfoodfacts.org/data
    [39] Popani. Air quality data in India (2015–2020). 2020. https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india
    [40] Two sigma connect: Rental listing inquiries. 2023. https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries
    [41] Content-based photo image retrieval test-collection. 2023. http://cophir.isti.cnr.it
    [42] Geographical locations in los angeles. 2023. https://www.dbs.ifi.lmu.de/cms
    [43] Moby project. 2023. https://mobyproject.org
    [44] Word to vectors. 2023. https://code.google.com/archive/p/word2vec
    [45] Fritz M, Behringer M, Tschechlov D, Schwarz H. Efficient exploratory clustering analyses in large-scale exploration processes. The VLDB Journal, 2022, 31(4): 711–732.
    [46] Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987, 20: 53–65.
    相似文献
    引证文献
引用本文

朱轶凡,罗程阳,马瑞遥,陈璐,毛玉仁,高云君.基于密度的多度量空间数据聚类算法.软件学报,2025,36(2):851-873

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-11-09
  • 最后修改日期:2024-02-02
  • 在线发布日期: 2024-11-06
文章二维码
您是第19921089位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号