[关键词]
[摘要]
基于局部敏感哈希的检索方法能够较好地解决高维大规模数据的近似近邻检索问题.但在开放环境下针对多种分布特性时,迄今尚未有令人满意的解决方案.利用Laplacian算子对数据分布剧烈变化敏感的特性,提出一种具有全局性、适用于开放环境下多种分布特性的基于Laplacian算子的局部敏感哈希搜索方法(LPLSH).该方法把Laplacian算子应用于数据投影的概率密度分布,找到数据投影分布的剧烈变化位置作为超平面的偏移量.从理论上证明了精简维度的哈希函数能够保持局部敏感性及低投影密度区间分割的有效性,分析了利用Laplacian算子计算的二阶导数对超平面偏移量设置的指导意义.与其他8种方法对比,LPLSH算法的F1值是其他方法最优值的0.8倍-5倍,耗费时间也大幅减少.通过对具有多种分布特性数据集上的实验验证,结果表明:LPLSH方法能够同时兼顾效率、精度和召回率,可满足开放环境下多分布特性的大规模高维检索的鲁棒性需求.
[Key word]
[Abstract]
The retrieval methods based-on locality-sensitive hashing (LSH) provide a feasible solution to the problem of approximate nearest neighbor (ANN) search on high-dimensional, multiple distributed characteristics, and massive data. However, there are still some unresolved problems in open environment, such as poor adaptability to the data with multiple distribution characteristics. Based on the fact that Laplacian operator is sensitive to sharp changes in data, an LSH retrieval method based on Laplacian operator (LPLSH) is proposed, which is suitable for data in open environment with a variety of distributed characteristics, and can segment data on global view. By applying Laplacian operator to the probability density distribution of data projection, the position of the sharp change of distribution will found as the offset of the hyperplane. This study proves theoretically that the reduced dimension can keep the local sensitivity characteristics of the hash function, and the global low projection density interval segmentation is helpful to improve the precision. The guiding significance of using Laplacian operator to obtain the second derivative to set the hyperplane offset is also analyzed. Compared with the other 8 methods based on LSH, the F1 value of LPLSH is 0.8-5 times of the optimal value of other methods, and it takes less time. Through the analysis of the distribution characteristics of experimental datasets, the experimental results show that LPLSH can take into account the efficiency, accuracy, and recall rate at the same time, can meet the robustness requirements of large-scale high- dimensional retrieval with multi-distribution characteristics in open environment.
[中图分类号]
[基金项目]
国家自然科学基金(61772004);福建省科技重大项目(2020H6011);福建省自然科学基金(2020J01161)