Traditional indexing methods face the difficulty of 慶urse of dimensionality?at high dimensionality. Accurate estimate of data distribution and efficient partition of data space are the key problems in high-dimensional indexing schemes. In this paper, a novel indexing method using vector quantization is proposed. It assumes a Gaussian mixture distribution which fits real-world image data reasonably well. After estimating this distribution through EM (expectation-maximization) method, this approach trains the optimized vector quantizers to partition the data space, which will gain from the dependency of dimensions and achieve more accurate vector approximation and less quantization distortion. Experiments on a large real-world dataset show a remarkable reduction of I/O overhead of the vector accesses which dominate the query time in the exact NN (nearest neighbor) searches. They also show an improvement on the indexing performance compared with the existing indexing schemes.
[1]Rui Y, Huang TS, Chang SF. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 1999,10(4):39~62.
[2]Rui Y, Huang TS, Ortega M, Mehrotra S. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Trans. on Circuits and Systems for Video Technology, 1998,8(5):644~655.
[3]Nievergelt J, Hinterberger H, Sevcik K. The gridfile: An adaptable symmetric multikey file structure. ACM Trans. on Database Systems, 1984,9(1):38~71.
[4]Robinson J. The k-d-b-tree: A search structure for large multidimensional dynamic indexes. In: Edmund YL, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 1981.10~18.
[5]Beckmann N, Kriegel HP, Schneider R, Seeger B. The R*-tree: An efficient and robust access method for points and rectangles. In:Garcia-Molina H, Jagadish HV, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 1990.322~331.
[6]Katayama N, Satoh S. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In: Peckham J, ed. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 1997. 369~380.
[7]Weber R, Schek H, Blott S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta A, Shmueli O, Widom J, eds. Proc. of the 24th ACM Int'l Conf. on Very Large Data Bases (VLDB'98). New York: Morgan Kaufmann Publishers, 1998. 194~205.
[8]Beyer K, Goldstein J, Ramakrishnan R. When is ‘nearest neighbor' meaningful? In: Beeri C, Buneman P, eds. Proc. of the 7th ACM Int'l Conf. on Database Theory (ICDT'99). Lecture Notes in Computer Science 1540, Berlin: Springer-Verlag, 1999.217~235.
[9]Scott DW. Multivariate Density Estimation. New York: John Wiley and Sons, 1992.
[10]Gersho A, Gray RM. Vector Quantization and Signal Compression. Boston: Kluwer Academic Press, 1992.
[11]Sun SH, Lu ZM. Vector Quantization Technology and Applications. Beijing: Science Press, 2002 (in Chinese).
[12]Lookabaugh TD, Gray RM. High-Resolution theory and the vector quantizer advantage. IEEE Trans. on Information Theory,1989,35(5):1020~1033.
[13]Wu P, Manjunath B, Chandrasekaran S. An adaptive index structure for high-dimensional similarity search. In: Shnm HY, Liao M,Chang SF, eds. Proc. of Advances in Multimedia Information Processing-PCM 2001, the 2nd IEEE Pacific Rim Conf. on Multimedia. Lecture Notes in Computer Science 2195, Berlin: Springer-Verlag, 2001.71~77.
[14]Ferhatosmanoglu H, Tuncel E, Agrawal D. Vector approximation based indexing for non-uniform high dimensional data sets. In:Proc. of the ACM Int'l Conf. on Information and Knowledge Management (CIKM 2000). New York: ACM Press, 2000. 202~209.
[15]Forgy E. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 1965,21(3):768.
[16]Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 1977,39(1):1~38.
[18]Manjunath BS, Ma WY. Texture features for browsing and retrieval of image data. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996,18(8):837~842.