Abstract:Under the filter-and-refine framework and based on the learning techniques, a data-aware method for c-approximate nearest neighbor query for high-dimensional data is proposed in this paper. The study claims that data after random projection satisfies the entropy-maximizing criterion which is needed by the semantic hashing. The binary codes after random projection are treated as the labels, and a group of classifiers are trained, which are used for predicting the binary code for the query. The data objects are selected who’s Hamming distances between the query satisfying the threshold as the candidates. The real distances are evaluated on the candidate subset and the smallest one is returned. Experimental results on the synthetic datasets and the real datasets show that this method outperforms the existing work with shorter binary code, in addition, the performance and the result quality can be easily tuned.