面向批量更新的向量索引召回率优化
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划 (2023YFC3341200); 中兴通讯产学研合作基金 (IA20250625030)


Vector Index Recall Optimization for Batch Updates
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近似最近邻搜索 (approximate nearest neighbor search, ANNS)是支撑向量数据库、推荐系统及大语言模型等上层应用的关键技术. 其中, 分层可导航小世界 (hierarchical navigable small world, HNSW)图索引通过构建层级化结构, 迅速定位结果至目标区域, 从而以较低的计算成本实现较高的检索召回率. 然而, 现有HNSW算法主要面向静态数据检索场景而设计, 而忽略了数据更新对检索性能的影响. 通过对现实数据集的研究发现, 向量数据库中的数据通常以批量方式进行更新, 其相似特性会削弱HNSW算法中启发式剪枝的有效性, 并诱发相似向量连接的稀疏化问题, 共同造成查询召回率的显著下降. 针对上述问题, 提出一种基于图结构局部调整的自适应细粒度剪枝策略, 构建了融合识别与修复机制的优化方案. 首先, 在识别阶段, 通过计算区域邻居距离量化局部拓扑密度, 从而精准定位待干预的致密区域. 其次, 在修复阶段, 针对处于致密区域的枢纽节点, 采用双重剪枝的邻居选择策略: 协同应用原生的与修正的启发式剪枝规则, 合并两种规则的结果集以在保证检索精度的同时提升邻居连接的多样性, 有效缓解过度剪枝与连接稀疏化问题. 在多个公开数据集上的实验结果表明, 所提方法对数据更新频繁的场景具备良好的适应性, 在维持查询延迟和吞吐量稳定的前提下, 实现了1%–4%的召回率提升.

    Abstract:

    Approximate nearest neighbor search (ANNS) is a foundational technology supporting applications such as vector databases, recommendation systems, and large language models (LLMs). Among these, the hierarchical navigable small world (HNSW) graph indexing technique constructs a hierarchical structure to quickly locate results within the target region, thus achieving high retrieval recall at low computational cost. However, existing HNSW algorithms are primarily designed for static data retrieval scenarios and fail to account for the impact of data updates on retrieval performance. Through research on real-world datasets, it is found that data in vector databases is typically updated in batches, and their similar characteristics weaken the effectiveness of heuristic pruning in HNSW algorithms and lead to sparsification issues in the connections among similar vectors, collectively causing a significant decline in retrieval recall. To address these issues, this study proposes an adaptive fine-grained pruning strategy based on local adjustments to the graph structure and constructs a comprehensive optimization scheme that integrates an identification and repair mechanism. First, in the identification phase, the regional neighbor distance is calculated to quantify local topological density, thereby precisely locating the dense regions requiring intervention. Second, in the repair phase, for hub nodes in dense regions, a dual pruning neighbor selection strategy is adopted: native and modified heuristic pruning rules are applied synergistically, and the results of both rules are merged to enhance neighbor connection diversity while maintaining retrieval accuracy, effectively alleviating over-pruning and connection sparsification issues. Experimental results on multiple public datasets show that the proposed method demonstrates good adaptability in scenarios with frequent data updates, achieving a 1%–4% improvement in recall while maintaining stable query latency and throughput.

    参考文献
    相似文献
    引证文献
引用本文

王可,胡思劼,胡卉芪,赵明昊,魏星,屠要峰,周烜.面向批量更新的向量索引召回率优化.软件学报,2026,37(3):1084-1103

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-07
  • 最后修改日期:2025-06-30
  • 录用日期:
  • 在线发布日期: 2025-09-02
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号