流数据分析与管理综述
作者:
基金项目:

Supported by the National High-Tech Research and Development Plan of China under Grant No.2002AA413310 (国家高技术研究发展计划(863))

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [52]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    有关流数据分析与管理的研究是目前国际数据库研究领域的一个热点.在过去30多年中,尽管传统数据库技术发展迅速且得到了广泛应用,但是它不能够处理在诸如网络路由、传感器网络、股票分析等应用中所生成的一种新型数据,即流数据.流数据的特点是数据持续到达,且速度快、规模宏大;其研究核心是设计高效的单遍数据集扫描算法,在一个远小于数据规模的内存空间里不断更新一个代表数据集的结构--概要数据结构,使得在任何时候都能够根据这个结构迅速获得近似查询结果.综述国际上关于流数据的概要数据结构生成与维护的研究成果,并通过列举解决流数据上两个重要问题的各种方案来比较各种算法的特点以及优劣.

    Abstract:

    The study on streaming data is one of the hot topics among the database circle all over the world recently. During the past three decades, conventional database technologies have been well developed and widely applied. Unfortunately, they could not be adopted to handle a new kind of data, named streaming data, which is generated from applications such as network routing, sensor networking, stock analysis, etc. Because of the rapid data arriving speed and huge size of data set in stream model, novel algorithms that only require seeing the whole data set once are devised to support aggregation queries on demand. In addition, this kind of algorithms usually owns a data structure far smaller than the size of the whole data set. The ways to devise such synopsis data structures are introduced. These different approaches are also compared by listing historical works upon two classical problems over stream.

    参考文献
    [1]Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data streams. In: Popa L, ed. Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. Madison: ACM Press, 2002. 1~16.
    [2]Terry D, Goldberg D, Nichols D, Oki B. Continuous queries over append-only databases. SIGMOD Record, 1992,21(2):321~330.
    [3]Avnur R, Hellerstein J. Eddies: Continuously adaptive query processing. In: Chen W, Naughton JF, Bernstein PA, eds. Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data. Dallas: ACM Press, 2000. 261~272.
    [4]Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 2000,23(2):7~18.
    [5]Carney D, Cetinternel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S. Monitoring streams?A new class of DBMS applications. Technical Report, CS-02-01, Providence: Department of Computer Science, Brown University, 2002.
    [6]Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In: Blum A, ed. The 41st Annual Symp. on Foundations of Computer Science, FOCS 2000. Redondo Beach: IEEE Computer Society, 2000. 359~366.
    [7]Domingos P, Hulten G. Mining high-speed data streams. In: Ramakrishnan R, Stolfo S, Pregibon D, eds. Proc. of the 6th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Boston: ACM Press, 2000. 71~80.
    [8]Domingos P, Hulten G, Spencer L. Mining time-changing data streams. In: Provost F, Srikant R, eds. Proc. of the 7th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM Press, 2001. 97~106.
    [9]Zhou A, Cai Z, Wei L, Qian W. M-Kernel merging: Towards density estimation over data streams. In: Cha SK, Yoshikawa M, eds. The 8th Int'l Conf. on Database Systems for Advanced Applications (DASFAA 2003). Kyoto: IEEE Computer Society, 2003. 285~292.
    [10]Gibbons PB, Matias Y. Synopsis data structures for massive data sets. In: Tarjan RE, Warnow T, eds. Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms. Baltimore: ACM/SIAM, 1999. 909~910.
    [11]Garofalakis M, Gehrke J, Rastogi R. Querying and mining data stream: you only get one look?A tutorial. In: Franklin MJ, Moon B, Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. Madison: ACM Press, 2002. 635.
    [12]Kooi RP. The optimization of queries in relational databases [Ph.D. Thesis]. Cleveland: Case Western Reserve University, 1980.
    [13]Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. SIGMOD Record, 1984,14(2): 256~276.
    [14]Gibbons PB, Matias Y, Poosala V. Fast incremental maintenance of approximate histograms. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA, eds. VLDB'97, Proc. of the 23rd Int'l Conf. on Very Large Data Bases. Athens: Morgan Kaufmann, 1997. 466~475.
    [15]Greenwald M, Khanna S. Space-Efficient online computation of quantile summaries. In: Proc. of the 2001 ACM SIGMOD Int'l Conf. on Management of Data. Santa Barbara: ACM Press, 2001. 58~66.
    [16]Poosala V, Ioannidis Y, Haas P, Shekita E. Improved histograms for selectivity estimation of range predicates. SIGMOD Record, 1996,25(2):294~305.
    [17]Ioannidis Y, Poosala V. Balancing histogram optimality and practicality for query result size estimation. SIGMOD Record, 1995, 24(2):233~244.
    [18]Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik K, Suel T. Optimal histograms with quality guarantees. In: Gupta A, Shmueli O, Widom J, eds. VLDB'98, Proc. of the 24th Int'l Conf. on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 275~286.
    [19]Guha S, Koudas N, Shim K. Data-Streams and histograms. In: Yannakakis M, ed. Proc. of the 33rd Annual ACM Symp. on Theory of Computing. Heraklion: ACM Press, 2001. 471~475.
    [20]Gilbert A, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M. Fast, small-space algorithms for approximate histogram maintenance. In: Reif JH, ed. Proc. of the 34th Annual ACM Symp. on Theory of Computing. Montréal: ACM Press, 2002. 389~398.
    [21]Vitter JS. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 1985,11(1):37~57.
    [22]Gibbons PB, Matias Y. New sampling-based summary statistics for improving approximate query answers. In: Haas LM, Tiwary A, eds. SIGMOD 1998, Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Seattle: ACM Press, 1998. 331~342.
    [23]Jawerth B, Sweldens W. An overview of wavelet based multiresolution analyses. SIAM Review, 1994,36(3):377~412.
    [24]Matias Y, Vitter JS, Wang M. Wavelet-Based histograms for selectivity estimation. In: Haas LM, Tiwary A, eds. SIGMOD 1998, Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Seattle: ACM Press, 1998. 448~459.
    [25]Matias Y, Vitter JS, Wang M. Dynamic maintenance of wavelet-based histograms. In: Abbadi AE, Brodie ML, Chakravarthy S, Dayal U, Kamel N, Schlageter G, Whang KY, eds. VLDB 2000, Proc. of the 26th Int'l Conf. on Very Large Data Bases. Cairo: Morgan Kaufmann, 2000. 101~110.
    [26]Gilbert AC, Kotidis Y, Muthukrishnan S, Strauss MJ. Surfing wavelets on streams: One-Pass summaries for approximate aggregate queries. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT, eds. VLDB 2001, Proc. of the 27th Int'l Conf. on Very Large Data Bases. Roma: Morgan Kaufmann, 2001. 79~88.
    [27]Garofalakis M, Gibbons PB. Wavelet synopses with error guarantees. In: Franklin MJ, Moon B, Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int'l Conf. on Management of Data. Madison: ACM Press, 2002. 476~487.
    [28]Bloom B. Space/time tradeoffs in hash coding with allowable errors. Communications of the ACM, 1970,13(7):422~426.
    [29]Fan L, Cao P, Almeida J, Broder AZ. Summary cache: A scalable wide-area Web cache sharing protocol. IEEE/ACM Trans. on Networking, 2000,8(3):281~293.
    [30]Mullin JK. Optimal semijoins for distributed database systems. IEEE Trans. on Software Engineering, 1990,16(5):558~560.
    [31]Marais H, Bharat K. Supporting cooperative and personal surfing with a desktop assistant. In: UIST'97. Proc. of the 10th Annual ACM Symp. on User Interface Software and Technology. Banff: ACM Press, 1997. 129~138.
    [32]Cohen S, Matias Y. Spectral bloom filters. In: Halevy AY, Ives ZG, Doan AH, eds. Proc. of the 2003 ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 241~252.
    [33]Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Miller G, ed. Proc. of the 28th Annual ACM Symp. on the Theory of Computing. Philadelphia: ACM Press, 1996. 20~29.
    [34]Alon N, Gibbons PB, Matias Y, Szegedy M. Tracking join and self-join sizes in limited storage. In: Proc. of the 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. Philadelphia: ACM Press, 1999. 10~20.
    [35]Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. Theoretical Computer Science, 2004,312(1):3~15.
    [36]Indyk P. Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Blum A, ed. The 41st Annual Symp. on Foundations of Computer Science, FOCS 2000. Redondo Beach: IEEE Computer Society, 2000. 189~197.
    [37]Cormode G. Stable distributions for stream computations: It's as easy as 0, 1, 2. 2003. http://www.research.att.com/conf/mpds2003/ schedule/cormode.ps
    [38]Flajolet P, Martin GN. Probabilistic counting algorithms for data base applications. Journal of Computer and Systems Sciences, 1992,31(2):182~209.
    [39]Ganguly S, Garofalakis M, Rastogi R. Processing set expressions over continuous update streams. In: Halevy AY, Ives ZG, Doan AH, eds. Proc. of the 2003 ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 265~276.
    [40]Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In: Eppstein D, ed. Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms. San Francisco: ACM/SIAM, 2002. 635~644.
    [41]Babcock B, Datar M, Motwani R, O'Callaghan L. Maintaining variance and k-Medians over data stream windows. In: Neven F, ed. Proc. of the 22nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. San Diego: ACM Press, 2003. 234~243.
    [42]Gibbons PB, Tirthapura S. Distributed streams algorithms for sliding windows. In: SPAA 2002: Proc. of the 14th Annual ACM Symp. on Parallel Algorithms and Architectures. Winnipeg: ACM Press, 2002. 63~72.
    [43]Zhu Y, Shasha D. StatStream: Statistical monitoring of thousands of data streams in real time. In: Bernstein P, Ioannidis Y, Ramakrishnan R, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002. 358~369.
    [44]DeHaan D, Demaine ED, Golab L, Lopez-Ortiz L, Munro JI. Towards identifying frequent items in sliding windows. Technical Report, CS-2003-06, Waterloo: University of Waterloo, 2003.
    [45]Babcock B, Datar M, Motwani R. Sampling from a moving window over streaming data. In: Eppstein D, ed. Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms. San Francisco: ACM/SIAM, 2002. 633~634.
    [46]Demaine E, L′opez-Ortiz A, Munro JI. Frequency estimation of Internet packet streams with limited space. In: Mohring RH, Raman R, eds. Algorithms--ESA 2002, Proc. of the 10th Annual European Symp. Rome: Springer-Verlag, 2002. 348~360.
    [47]Manku GS, Motwani R. Approximate frequency counts over data streams. In: Bernstein P, Ioannidis Y, Ramakrishnan R, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002. 346~357.
    [48]Cormode G, Muthukrishnan S. What's hot and what's not: Tracking most frequent items dynamically. In: Proc. of the 22nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. San Diego: ACM Press, 2003. 296~306.
    [49]Jin C, Qian W, Sha C, Yu JX, Zhou A. Dynamically maintaining frequent items over a data stream. In: Carbonell J, ed. Proc. of the 2003 ACM CIKM Int'l Conf. on Information and Knowledge Management. New Orleans: ACM Press, 2003. 287~294.
    [50]Manku GS, Rajagopalan S, Lindsay BG. Approximate medians and other quantiles in one pass and with limited memory. In: Haas LM, Tiwary A, eds. SIGMOD 1998, Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Seattle: ACM Press, 1998. 426~435.
    [51]Manku GS, Rajagopalan S, Lindsay BG. Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Delis A, Faloutsos C, Ghandeharizadeh S, eds. SIGMOD 1999, Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Philadelphia: ACM Press, 1999. 251~262.
    [52]Gilbert AC, Kotidis Y, Muthukrishnan S, Strauss MJ. How to summarize the universe: Dynamic maintenance of quantiles. In: Bernstein P, Ioannidis Y, Ramakrishnan R, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002. 454~465.
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

金澈清,钱卫宁,周傲英.流数据分析与管理综述.软件学报,2004,15(8):1172-1181

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2003-09-01
  • 最后修改日期:2003-09-01
文章二维码
您是第19728452位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号