[关键词]
[摘要]
传统索引方法对高维数据存在"维数灾难"的困难.而对数据分布的精确描述及对数据空间的有效划分是高维索引机制中的关键问题.提出一种基于矢量量化的索引方法.该方法使用高斯混合模型描述数据的整体分布,并训练优化的矢量量化器划分数据空间.高斯混合模型能更好地描述真实图像库的数据分布;而矢量量化的划分方法可以充分利用维之间的统计相关性,能够对数据向量构造出更加精确的近似表示,从而提高索引结构的过滤效率并减少需要访问的数据向量.在大容量真实图像库上的实验表明,该方法显著减少了支配检索时间的I/O开销,提高了索引性能.
[Key word]
[Abstract]
Traditional indexing methods face the difficulty of 慶urse of dimensionality?at high dimensionality. Accurate estimate of data distribution and efficient partition of data space are the key problems in high-dimensional indexing schemes. In this paper, a novel indexing method using vector quantization is proposed. It assumes a Gaussian mixture distribution which fits real-world image data reasonably well. After estimating this distribution through EM (expectation-maximization) method, this approach trains the optimized vector quantizers to partition the data space, which will gain from the dependency of dimensions and achieve more accurate vector approximation and less quantization distortion. Experiments on a large real-world dataset show a remarkable reduction of I/O overhead of the vector accesses which dominate the query time in the exact NN (nearest neighbor) searches. They also show an improvement on the indexing performance compared with the existing indexing schemes.
[中图分类号]
[基金项目]
Supported by the National Natural Science Foundation of China under Grant No.60273005(国家自然科学基金)