基于查询采样的高维数据混合索引

微信服务号

微信订阅号

2025年4月1日 7:54 星期二

首页 > 过刊浏览>2008年第19卷第8期 >2054-2065

基于查询采样的高维数据混合索引
DOI:
                        
                    
CSTR:
                        
                    
作者:
                        张军旗张军旗
北京大学 信息科学技术学院 智能科学系,北京 100871; 北京大学 信息科学技术学院 机器感知与智能教育部重点实验室,北京 100871; 复旦大学 计算机与信息技术系,上海 200433
在期刊界中查找
在百度中查找
在本站中查找
周向东周向东
复旦大学 计算机与信息技术系,上海 200433
在期刊界中查找
在百度中查找
在本站中查找
施伯乐施伯乐
复旦大学 计算机与信息技术系,上海 200433
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60403018 (国家自然科学基金); the National Basic Research Program of China under Grant No.2005CB321905 (国家重点基础研究发展计划(973)); the Natural Science Foundation of Shanghai of China under Grant No.04ZR14011 (上海市自然科学基金); the College Cooperation Plan of AMD (AMD大学合作计划)

High Dimensional Hybrid Index Based on Query Sampling

Author:

ZHANG Jun-Qi
ZHANG Jun-Qi

在期刊界中查找
在百度中查找
在本站中查找
ZHOU Xiang-Dong
ZHOU Xiang-Dong

在期刊界中查找
在百度中查找
在本站中查找
SHI Bai-Le
SHI Bai-Le

在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

为了改进高维数据库查询的效率,通常需要根据数据分布来选择合适的索引策略.然而,经典的分布模型难以解决实际应用中图像、视频等高维数据复杂的分布估计问题.提出一种基于查询采样进行数据分布估计的方法,并在此基础上提出了一种支持最近邻查询的混合索引,即针对多媒体数据分布的不均匀性,自适应地对不同分布的数据使用不同的索引结构,建立统一的索引结构.为了实现混合索引,采用构造性方法:首先通过聚类分解分割数据并建立树状索引;然后使用查询采样算法,对数据实际分布进行估计;最后根据数据分布的特性,把稀疏数据从树状索引中剪裁出来,进行基于顺序扫描策略的索引,而分布比较密集的数据仍然保留在树状索引中.在4个真实的图像数据集上进行了充分的实验,结果显示,该索引方法明显优于iDistance,M-Tree等度量空间索引,在维数达到336时,查询效率仍高于顺序扫描.实验结果显示,该查询采样算法在采样数据量仅为(N为数据量)的情况下即可获得满足索引需要的分布估计结果.

关键词:最近邻查询;采样;高维索引;边缘数据;聚类分解

Abstract:

In order to improve the query answering of high-dimensional database, data distribution is necessary to select appropriate indexing strategy. However, traditional data distribution models can not estimate the accurate data distribution in the complex real multimedia data of image and video. This paper presents a method to estimate the accurate data distribution based on query sampling, and proposes a novel hybrid index to speed up processing of high-dimensional K-nearest neighbor (KNN) queries. The proposed hybrid index improves the query efficiency by adaptively selecting different index strategies for the data with different distribution. In the first step, the cluster analysis and cluster splitting methods are applied to construct a tree-based index, and then the relationship between data distribution and index performance is derived by sampling. At last some tree branches with sparse data are extracted for linear scan, while the aggregate data remains in the tree. Extensive experiments on four real image data sets show that the proposed hybrid index structure performs better than iDistance, M-Tree and linear scan, and scales better with dimensions. The index is still faster than linear scan when the dimension reaches 336. The experiments also show that the proposed query sampling algorithm can obtain the accurate data distribution when the amount of sampling is below (N is the size of data set).

Key words:nearest neighbor query; high dimensional index; marginal data; cluster partitioning

引用本文

张军旗,周向东,施伯乐.基于查询采样的高维数据混合索引.软件学报,2008,19(8):2054-2065

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2007-03-26
最后修改日期:2007-08-03
录用日期:
在线发布日期:
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码