GPU加速的高维向量聚类算法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(62025206, U23A20296, 62302444); 浙江省尖兵领雁项目(2024C01259, 2025C01195)


GPU-accelerated Clustering Algorithm for High-dimensional Vectors
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    聚类是大规模高维向量数据分析的关键技术之一. 近年来, 基于密度的聚类算法DBSCAN (density-based spatial clustering of applications with noise)因其无须预先指定聚类数量、能够发现复杂聚类结构并有效识别噪声点的特性, 在数据分析领域得到了广泛应用. 然而, 现有的基于密度的聚类算法在处理高维向量数据时将产生极高的时间代价且面临维度灾难等问题, 难以在实际场景中部署应用. 此外, 随着信息技术的发展, 高维向量数据规模急剧增加, 使用CPU进行高维向量聚类在时间代价和可扩展性等方面将面临更大的挑战. 为此, 提出一种GPU加速的高维向量聚类算法, 通过引入K近邻 (K-nearest neighbor, KNN) 图索引加速DBSCAN的计算. 首先, 设计了GPU加速的并行K近邻图构建算法, 显著降低了K近邻图索引的构建开销. 其次, 提出了基于层间并行的K-means树分区算法及基于广度优先搜索和核心近邻图的并行聚类算法, 改进了DBSCAN算法的计算流程, 实现了高并发向量聚类. 最后, 在真实向量数据集上进行了大量实验, 并将所提出的方法与现有方法进行了性能对比. 实验结果表明, 所提方法在保证聚类精度的前提下, 将大规模向量聚类的效率提高了5.7–2822.5倍.

    Abstract:

    Clustering serves as one of the critical technologies for large-scale, high-dimensional vector data analysis. Recently, a density-based clustering algorithm DBSCAN (density-based spatial clustering of applications with noise) have been widely adopted in data analysis due to their advantages of not requiring pre-specified cluster numbers, discovering complex cluster structures, and identifying noise points. However, existing density-based clustering algorithms suffer from high computational costs when processing high-dimensional vectors. Meanwhile, these methods also face challenges like the “curse of dimensionality”, restricting their practical applications. With the rapid growth of high-dimensional vector data in the era of information technology, CPU-based clustering approaches encounter increasing challenges in time efficiency and scalability. To address these issues, this study proposes a GPU-accelerated clustering algorithm for high-dimensional vector data, introducing the K-nearest neighbor (KNN) graph index to accelerate DBSCAN. First, a GPU-accelerated parallel KNN graph construction algorithm is developed, significantly reducing the index construction overhead. Furthermore, to enhance the pipeline of DBSCAN and achieve highly concurrent vector clustering, a K-means tree partitioning algorithm with inter-layer parallelism and a parallel clustering algorithm based on breadth-first search and a core KNN graph are designed. Finally, extensive experiments are conducted on real-world datasets, and the proposed method is compared against existing approaches. Experimental results show that the proposed algorithm improves the efficiency of large-scale vector clustering by 5.7–2822.5 times while maintaining clustering accuracy.

    参考文献
    相似文献
    引证文献
引用本文

李忠根,龚盛豪,于浩然,朱轶凡,柳晴,高云君. GPU加速的高维向量聚类算法.软件学报,2026,37(3):1037-1057

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-01
  • 最后修改日期:2025-06-30
  • 录用日期:
  • 在线发布日期: 2025-09-02
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号