Time and Space Efficiencies Analysis of Full-Text Index Techniques
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [69]
  • |
  • Related [20]
  • |
  • Cited by [3]
  • | |
  • Comments
    Abstract:

    As one of the efficient methods of improving time and space efficiencies, the full-text index technique has been well studied in recent years. According to the implementation ways, it can be classified into three categories: Index technique, hybrid technique of index and compression, self-index technique. This paper reviews the recent researches on this topic, which include the main techniques such as inverted files, signature files, suffix trees, suffix arrays, compressed indices based on the indices mentioned above, self-index based on inverted files, and self-index based on suffix arrays. This paper also introduces the basic theories, problems, progress as well as space and time efficiencies of these techniques. Through a detailed efficiency analysis and comparison, the pros and cons of the techniques are given. Finally, the important features of these techniques are summarized, and the future research strategies and trends directions are pointed out as well.

    Reference
    [1] The UC Berkeley report. How much information? Report 1, UC Berkeley, 2003. 1?30.
    [2] Sievert MC. Full-Text information retrieval: Introduction. Journal of the American Society for Information Science,1996,47(4):261?262.
    [3] Liu XZ, Sun S, Zeng C, Peng ZY. An inverted index mechanisms based on buffers. Journal of Computer Research and Development, 2007,44(S):153?158 (in Chinese with English abstract).
    [4] Fox E, Harman D, Baeza-Yates R, Lee W. Inverted files. In: Frakes B, Baeza-Yates RA, eds. Information Retrieval: DataStructures and Algorithms. Englewood Cliffs: Prentice-Hall, 1992.
    [5] Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006,38(2):1?56.
    [6] Carterette B, Can F. Comparing inverted files and signature files for searching a large lexicon. Information Processing and Management, 2005,41(8):613?633.
    [7] Faloutsos C. Access methods for text. ACM Computing Surveys, 1985,17(1):49?74.
    [8] Faloutsos C, Christodoulakis S. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. on Office Information Systems, 1984,2(4):267?288.
    [9] Mccreight EM. A space-economical suffix tree construction algorithm. Journal of the ACM, 1976,23(2):262?272.
    [10] Manber U, Myers E. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 1993,22(5):935?948.
    [11] Gonnet GH, Baeza Y, Snider RA. New indices for text: PAT trees and PAT arrays. In: Frakes B, Baeza-Yates RA, eds. Information Retrieval: Data Structures and Algorithms. Englewood Cliffs: Prentice-Hall, 1992.
    [12] Shieh W Y, Chung C P. A statistics-based approach to incrementally update inverted files. Information Processing and Management, 2005,41(5):275?288.
    [13] Bell TC, Moffat A, Craig GN, Witten IH, Zobel J. Data compression in full-text retrieval systems. Journal of the American Society for Information Science, 1993,44(9):508?531.
    [14] Faloutsos C. Signature files. In: Frakes B, Baeza-Yates RA, eds. Information Retrieval: Data Structures and Algorithms.Englewood Cliffs: Prentice-Hall, 1992.
    [15] Sacks-Davis R, Kent A, Ramamohanarao K. Multikey access methods based on superimposed coding techniques. ACM Trans. on Database System, 1987,12(4):655?696.
    [16] Sacks-Davis R, Kent A, Ramamohanarao K, Thom J, Zobel J. Atlas: A nested relational database system for text applications. IEEE Trans. on Knowledge and Data Engineering, 1995,7(3):454?470.
    [17] Bookstein A, Klein ST. Using bitmaps for medium sized information retrieval systems. Information Processing and Management,1990,26(5):525?533.
    [18] Weiner P. Linear pattern matching algorithm. In: Pery NH, eds. Proc. of the 14th Annual IEEE Symp. on Switching and Automata Theory. IEEE Press, 1973. 1?11.
    [19] Ukkonen E. On-Line construction of suffix trees. Algorithmica, 1995,14(3):249?260.
    [20] Kurtz S. Reducing the space requirements of suffix trees. Software—Practice and Experience, 1999,29(13):1149?1171.
    [21] Abouelhoda M, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms,2004,2(1):53?86.
    [22] Zobel J, Moffat A, Kotagiri R. Inverted files versus signature files for text indexing. ACM Trans. on Database Systems,1998,23(4):453?490.
    [23] Zhou SG, Hu YF, Guan JH. Adjacency matrix based full-text indexing models. Journal of Software, 2002,13(10):1933?1942 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/13/1933.htm
    [24] Liu XW, Tao XP, Yu Y, Hu YF. A new full-text index model-subsequence array model. Journal of Software, 2002,13(1):150?158(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/13/150.htm
    [25] Elias P. Universal codeword sets and representation of the integers. IEEE Trans. on Information Theory, 1975,21(2):194?203.
    [26] Bentley JL, Yao A. An almost optimal algorithm for unbounded searching. Information Processing Letter, 1976,5(3):82?87.
    [27] Golomb SW. Run-Length encodings. IEEE Trans. on Information Theory, 1966,12(3):399?401.
    [28] Gallager R, Voorhis DV. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. on Information Theory,1975,21(2):228?230.
    [29] Croft WB, Savino P, Dor O. Implementing ranking strategies using text signatures. ACM Trans. on Office Information Systems,1988,6(1):42?62.
    [30] Whitte IH, Moffat A, Bell TC. Compression and full-text indexing for digital libraries. SIGOIS Bulletin, 1994,15(1):11?13.
    [31] Moffat A, Turpin A, Katajainen J. Space-Efficient construction of optimal prefix codes. In: Hec CC, ed. Proc. of the 5th IEEE Data Compression Conf. Snowbird: IEEE Computer Society Press, 1995. 192?201.
    [32] Bookstein A, Klein ST, Raita T. Simple Bayesian model for bitmap compression. Information Retrieval, 2000,1(4):315?328.
    [33] Witten IH, Moffat A, Bell TC. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed., San Francisco:Morgan Kaufmann Publishers, 1999. 50?73.
    [34] Moffat A, Zobel J. Parameterised compression for sparse bitmaps. In: Seyit K, ed. Proc. of the ACM Int’l Conf. on Research and Development in Information Retrieval. New York: ACM Press, 1992. 274?285.
    [35] Chiuen T, Varadarajan S. SASE: Implementation of a compressed text search engine. In: Mary CH, ed. Proc. of the 1st USENIXSymp. on Internet Technologies and Systems. USENIX Association, 1997. 47?56.
    [36] Navarro G, Moura ESD, Neubert M. Adding compression to block addressing inverted indexes. Information Retrieval,2000,3(1):49?77.
    [37] Ciaccia P, Tiberio P, Zezula P. Declustering of key-based partitioned signature files. ACM Trans. on Database System,1996,21(3):295?338.
    [38] Kocberbera S, Can F. Vertical framing of superimposed signature files using partial evaluation of queries. Information Processing and Management, 1997,33(3):353?376.
    [39] Lee DL, Kim YM, Patel G. Efficient signature file methods for text retrieval. IEEE Trans. on Knowledge and Data Engineering,1995,7(3):423?435.
    [40] Faloutsos C. Signature files: Design and performance comparison of some signature extraction methods. In: Ciaccia P, ed. Proc. of the 8th ACM-SIGIR Conf. on Information Retrieval. New York: ACM Press, 1985. 63?82.
    [41] Ciaccia P, Zfzula P. Estimating accesses in partitioned signature file organizations. ACM Trans. on Information System,1993,11(2):133?142.
    [42] Kazutaka F, Kazushige A, Atsushi I. Implementation and performance evaluation of compressed bit-sliced signature files. In: Can F,d. Proc. of the Conf. on Information Systems and Management of Data. New York: ACM Press, 1999. 164?177.
    [43] Seyit K, Fazh C. Compressed muti-framed signature files: An index structure for fast information retrieval. In: Kurtz S, ed. Proc. of the System Algorithm Conf. New York: ACM Press, 1999. 221?226.
    [44] Ferragina P, Manzini G. Opportunistic data structures with applications. In: Atsushi I, ed. Proc. of the 41st IEEE Symp. on Foundations of Computer Science (FOCS). New Jersey: IEEE Press, 2000. 390?398.
    [45] Burrows M, Wheeler D. A block sorting lossless data compression algorithm. Technical Report, 124, Digital Equipment Corporation, 1994. 1?24.
    [46] Moffat A, Zobel J. Self-Indexing inverted files for fast text retrieval. ACM Trans. on Information Systems, 1996,14(4):349?379.
    [47] Makinen V. Compact suffix array—A space-efficient full-text index. Fundamenta Informaticae, 2003,56(1-2):191?210.
    [48] Grossi R, Vitter JS. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journalon Computing, 2005,35(2):378?407.
    [49] Couvreur TR, Benzel RN, Miller SF, Zeitler DN, Lee D L. An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 1994,45(7):443?464.
    [50] Gonzalo N, Veli M. Compressed full-text indexes. ACM Computing Surveys, 2007,39(1):1?61.
    [51] Grossi R, Gupta A, Vitter J. High-Order entropy-compressed text indexes. In: Rauhe T, ed. Proc. of the 14th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA). New York: ACM Press, 2003. 841?850.
    [52] Lempel A, Ziv J. On the complexity of finite sequences. IEEE Trans. on Information Theory, 1976,22(1):75?81.
    [53] Bentley J, Sleator D, Tarjan R, Wei V. A locally adaptive compression scheme. Communications of the ACM, 1986,29(4):320?330.
    [54] Makinen V, Navarro G. Succinct suffix arrays based on run-length encoding. Lecture Notes in Computer Science, 2005,3537:45?56.
    [55] Sadakane K. Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Tarjan R, ed. Proc.of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA). New York: ACM Press, 2002. 225?232.
    [56] Makinen V, Navarro G. Run-Length FM-index. In: Rauhe T, ed. Proc. of the Burrows-Wheeler Transform: Ten Years Later. NewYork: ACM Press, 2004. 17?19.
    [57] Manzini G. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 2001,48(3):407?430.
    [58] Ferragina P, Manzini G, Makinen V, Navarro G. An alphabet-friendly FM-index. In: Apostolico A, ed. Proc. of the 11th Int’l Symp.on String Processing and Information Retrieval (SPIRE). Berlin, Heidelberg: Springer-Verlag, 2004. 150?160.
    [59] Ferragina P, Manzini G, Makinen V, Navarro G. Compressed representation of sequences and full-text indexes. ACM Trans. on Algorithms, 2007,3(2):1?24.
    [60] Sadakane K. Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Apostolico A, ed.Proc. of the 11th Int’l Symp. on Algorithms and Computation (ISAAC). LNCS 1969, Berlin, Heidelberg: Springer-Verlag, 2000. 410?421.
    [61] Sadakane K. New text indexing functionalities of the compressed suffix arrays. Journal of the Algorithms, 2003,48(2):294?313.
    [62] Grossi R, Gupta A, Vitter J. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Trans. on Algorithms, 2006,2(4):611?639.
    [63] Karkkainen J, Ukkonen E. Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Gupta A, ed. Proc. of the 3rd South American Workshop on String Processing (WSP). Carleton University Press, 1996. 141?155.
    [64] Alstrup S, Brodal G, Rauhe T. New data structures for orthogonal range searching. In: Vitter J, ed. Proc. of the 41st IEEE Symp. on Foundations of Computer Science (FOCS). IEEE Press, 2000. 198?207.
    [65] Ferragina P, Manzini G. Indexing compressed texts. Journal of the ACM, 2005,52(4):552?581.
    [66] PizzaChili project. 2005. http://pizzachili.dcc.uchile.cl 附中文参考文献:
    [3] 刘小珠,孙莎,曾承,彭智勇.基于缓存的倒排索引机制研究.计算机研究与发展,2007,44(S):153?158.
    [23] 周水庚,胡运发,关佶红.基于邻接矩阵的全文索引模型,软件学报,2002,13(10):1933?1942. http://www.jos.org.cn/1000-9825/ 13/1933.htm
    [24] 刘学文,陶晓鹏,于玉,胡运发.一种全新的全文索引模型—后继数组模型,软件学报,2002,13(1):150?158. http://www.jos.org.cn/ 1000-9825/13/150.htm
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘小珠,彭智勇.全文索引技术时空效率分析.软件学报,2009,20(7):1768-1784

Copy
Share
Article Metrics
  • Abstract:8060
  • PDF: 12163
  • HTML: 0
  • Cited by: 0
History
  • Received:March 21,2008
  • Revised:October 07,2008
You are the first2044946Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063