大数据流式计算:关键技术及系统实例
作者:
基金项目:

国家自然科学基金(61170008,61272055);国家重点基础研究发展计划(973)(2014CB340402);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172012K12)


Big Data Stream Computing:Technologies and Instances
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [50]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    大数据计算主要有批量计算和流式计算两种形态,目前,关于大数据批量计算系统的研究和讨论相对充分,而如何构建低延迟、高吞吐且持续可靠运行的大数据流式计算系统是当前亟待解决的问题且研究成果和实践经验相对较少.总结了典型应用领域中流式大数据所呈现出的实时性、易失性、突发性、无序性、无限性等特征,给出了理想的大数据流式计算系统在系统结构、数据传输、应用接口、高可用技术等方面应该具有的关键技术特征,论述并对比了已有的大数据流式计算系统的典型实例,最后阐述了大数据流式计算系统在可伸缩性、系统容错、状态一致性、负载均衡、数据吞吐量等方面所面临的技术挑战.

    Abstract:

    Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.

    参考文献
    [1] Lynch C. Big data: How do your data grow? Nature, 2008,455(7209):28-29. [doi: 10.1038/455028a]
    [2] Kobielus A. The role of stream computing in big data architectures. 2013. http://ibmdatamag.com/2013/01/the-role-of-stream- computing-in-big-data-architectures/
    [3] Li GJ, Cheng XQ. Research status and scientific thinking of big data. Bulletin of Chinese Academy of Sciences, 2012,27(6): 647-657 (in Chinese with English abstract).
    [4] Wang YZ, Jin XL, Cheng XQ. Network big data: Present and future. Chinese Journal of Computers, 2013,36(6):1125-1138 (in Chinese with English abstract).
    [5] Feng ZY, Guo XH, Zeng DJ, Chen YB, Chen GQ. On the research frontiers of business management in the context of big data. Journal of Management Sciences in China, 2013,16(1):1-9 (in Chinese with English abstract).
    [6] Morales GDF. SAMOA: A platform for mining big data streams. In: Proc. of the 22th Int'l World Wide Web Conf. (WWW 2013). Rio de Janeiro: ACM Press, 2013. 777-778. http://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=M3862b207 144cdd0c07bM34761017816565&pageType=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=7&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes
    [7] Meng XF, Ci X. Big data management: Concepts, techniques and challenges. Journal of Computer Research and Development, 2013,50(1):146-169 (in Chinese with English abstract).
    [8] Lim L, Misra A, Mo TL. Adaptive data acquisition strategies for energy-efficient, smartphone-based, continuous processing of sensor streams. Distributed and Parallel Databases, 2013,31(2):321-351. [doi: 10.1007/s10619-012-7093-3]
    [9] Li BD, Mazur E, Diao YL. SCALLA: A platform for scalable one-pass analytics using MapReduce. ACM Trans. on Database Systems, 2012,37(4):1-43. [doi: 10.1145/2389241.2389246]
    [10] Yang D, Rundensteiner EA, Ward M. Mining neighbor-based patterns in data streams. Information Systems, 2013,38(3):331-350. [doi: 10.1016/j.is.2012.08.001]
    [11] Qin XP, Wang HJ, Du XY, Wang S. Big data analysis—Competition and symbiosis of RDBMS and MapReduce. Ruan Jian Xue Bao/ Journal of Software, 2012,23(1):32-45 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4091.htm [doi: 10.3724/SP.J.1001.2012.04091]
    [12] Tallon PP. Corporate governance of big data: Perspectives on value, risk, and cost. Computer, 2013,46(6):32-38. [doi: 10.1109/ MC.2013.155]
    [13] Talia D. Clouds for scalable big data analytics. Computer, 2013,46(5):98-101. [doi: 10.1109/MC.2013.162]
    [14] Chen HC, Chiang RHL, Storey VC. Business intelligence and analytics: From big data to big impact. MIS Quarterly, 2012,36(4): 1165-1188.
    [15] Li JZ, Liu XM. An important aspect of big data. Journal of Computer Research and Development, 2013,50(6):1147-1162 (in Chinese with English abstract).
    [16] Demirkan H, Delen D. Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud. Decision Support Systems, 2013,55(1):412-421. [doi: 10.1016/j.dss.2012.05.048]
    [17] Agrawal D, Das S, El AA. Big data and cloud computing: Current state and future opportunities. In: Proc. of the 14th Int'l Conf. on Extending Database Technology (EDBT 2011). Uppsala: ACM Press, 2011. 530-533. [doi: 10.1145/1951365.1951432]
    [18] Cugola G, Margara A. Deployment strategies for distributed complex event processing. Computing, 2013,95(2):129-156. [doi: 10. 1007/s00607-012-0217-9]
    [19] Zappia I, Paganelli F, Parlanti D. A lightweight and extensible complex event processing system for sense and respond applications. Expert Systems with Applications, 2012,39(12):10408-10419. [doi: 10.1016/j.eswa.2012.01.197]
    [20] Hoi SCH, Wang JL, Zhao PL, Jin R. Online feature selection for mining big data. In: Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2012). Beijing: ACM Press, 2012. 93-100. [doi: 10.1145/2351316.2351329]
    [21] Michael K, Miller KW. Big data: New opportunities and new challenges. Computer, 2013,46(6):22-24. [doi: 10.1109/MC. 2013.196]
    [22] Scalosub G, Marbach P, Liebeherr J. Buffer management for aggregated streaming data with packet dependencies. IEEE Trans. on Parallel and Distributed Systems, 2013,24(3):439-449. [doi: 10.1109/TPDS.2012.65]
    [23] Malensek M, Pallickara SL, Pallickara S. Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals. Future Generation Computer Systems, 2013,29(4):1049-1061. [doi: 10.1016/j.future.2012.05.024]
    [24] Cugola G, Margara A. Processing flows of information: From data stream to complex event processing. ACM Computing Surveys, 2012,44(3):15:1-62. [doi: 10.1145/2187671.2187677]
    [25] Lim L, Misra A, Mo TL. Adaptive data acquisition strategies for energy-efficient, smartphone-based, continuous processing of sensor streams. Distributed and Parallel Databases, 2013,31(2):321-351. [doi: 10.1007/s10619-012-7093-3]
    [26] He JY, Chaintreau A, Diot C. A performance evaluation of scalable live video streaming with nano data centers. Computer Networks, 2009,53(2):153-167. [doi: 10.1016/j.comnet.2008.10.014]
    [27] Vianna E, Comarela G, Pontes T, Almeida J, Almeida V, Wilkinson K, Kuno H, Dayal U. Analytical performance models for MapReduce workloads. Int'l Journal of Parallel Programming, 2013,41(4):495-525. [doi: 10.1007/s10766-012-0227-4]
    [28] Chatziantoniou D, Pramatari K, Sotiropoulos Y. Supporting real-time supply chain decisions based on RFID data streams. Journal of Systems and Software, 2011,84(4):700-710. [doi: 10.1016/j.jss.2010.12.011]
    [29] 杨栋.Beyond MapReduce:谈2011年风靡的数据流计算系统.2013.http://www.programmer.com.cn/9642/
    [30] Tatbul N, Ahmad Y, Çetintemel U, Hwang JH, Xing Y, Zdonik S. Load management and high availability in the borealis distributed stream processing engine. In: Proc. of the 2nd Int'l Conf. on GeoSensor Networks (GSN 2006). Boston: IEEE Press, 2006. 66-85. [doi: 10.1007/978-3-540-79996-2_5]
    [31] Balazinska M, Hwang J, Shah MA. Fault-Tolerance and high availability in data stream management systems. In: Proc. of the Encyclopedia of Database Systems. 2009. 1109-1115. [doi: 10.1007/978-0-387-39940-9_160]
    [32] Zhang Z, Gu Y, Ye F, Yang H, Kim M, Lei H, Liu Z. A hybrid approach to high availability in stream processing systems. In: Proc. of the 30th IEEE Int'l Conf. on Distributed Computing Systems (ICDCS 2010). Genova: IEEE Press, 2010. 138-148. [doi: 10. 1109/ICDCS.2010.81]
    [33] Nagano K, Itokawa T, Kitasuka T, Aritsugi M. Exploitation of backup nodes for reducing recovery cost in high availability stream processing systems. In: Proc. of the 14th Int'l Database Engineering and Applications Symp. (IDEAS 2010). Montreal: ACM Press, 2010. 61-63. [doi: 10.1145/1866480.1866490]
    [34] Aritsugi M, Nagano K. Recovery processing for high availability stream proc Data Engineering (ICDE 2012). Arlington: IEEE Press, 2012. 1370-1381. [doi: 10.1109/ICDE.2012.147]
    [55] Efficient data transfer through zero copy, zero copy, zero overhead. 2013. https://www.ibm.com/developerworks/linux/library/j- zerocopy/
    [56] Kafka, distributed publish-subscribe messaging system. 2013. http://data.linkedin.com/opensource/kafka/
    [57] Guo ZY, McDirmid S, Yang M, Zhuang L, Zhang P, Luo YW, Bergan T, Bodik P, Musuvathi M, Zhang Z, Zhou LD. Failure recovery: When the cure is worse than the disease. In: Proc. of the 14th USENIX Conf. on Hot Topics in Operating Systems (USENIX 2013). Santa Ana Pueblo: ACM Press, 2013. 1-6. http://research.microsoft.com/apps/pubs/default.aspx?id=191008
    [58] Ali M, Badrish C, Goldstein J, Schindlauer R. The extensibility framework in Microsoft StreamInsight. In: Proc. of the IEEE 27th Int'l Conf. on Data Engineering (ICDE 2011). Hannover: IEEE Press, 2011. 1242-1253. [doi: 10.1109/ICDE.2011.5767878]
    [59] Chandramouli B, Goldstein J, Barga R, Riedewald M, Santos I. Accurate latency estimation in a distributed event processing system. In: Proc. of the IEEE 27th Int'l Conf. on Data Engineering (ICDE 2011). Hannover: IEEE Press, 2011. 255-266. http://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=M3862b207144cdd0c07bM2d0a1017816565&pageType=quickSearch&CID=quickSearchDetailedFormat&DOCINDEX=1&database=1&format=quickSearchDetailedFormat&tagscope=&displayPagination=yes
    [60] Ali M, Chandramouli B, Fay J, Wong C, Drucker S, Raman BS. Online visualization of geospatial stream data using the WorldWide telescope. VLDB Endowment, 2011,4(12):1379-1382.
    [61] Qin XP, Wang HJ, Li FR, Li CP, Chen H, Zhou H, Du XY, Wang S. New landscape of data management technologies. Ruan Jian Xue Bao/Journal of Software, 2013,24(2):175-197 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4345.htm [doi: 10.3724/SP.J.1001.2013.04345]
    [62] Qi KY, Zhao ZF, Fang J, Ma Q. Real-Time processing for high speed data stream over larger scale data. Chinese Journal of Computers, 2012,35(3):477-490 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2012.00477]
    [63] Toyoda M, Sakurai Y, Ishikawa Y. Pattern discovery in data streams under the time warping distance. VLDB Journal, 2013,22(3): 295-318. [doi: 10.1007/s00778-012-0289-3]
    [64] Malensek M, Pallickara SL, Pallickara S. Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals. Future Generation Computer Systems, 2013,29(4):1049-1061. [doi: 10.1016/j.future.2012.05.024]
    [65] Farid DW, Zhang L, Hossain A, Rahman CM, Strachan R, Sexton G, Dahal K. An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 2013,40(15):5895-5906. [doi: 10.1016/j.eswa.2013.05.001]borealis comparis. 2013. http://oracle-abc.wikidot.com/zh:stream-computing- streambase-yahoo-s4-borealis-comparison
    [50] Squicciarini AC, Shehab M, Wede J. Privacy policies for shared content in social network sites. VLDB Journal, 2010,19(6): 777-796. [doi: 10.1007/s00778-010-0193-7]
    [51] Deng DP, Chuang TR, Shao KT, Mai GS, Lin TE, Lennens R, Hsu CH, Lin HH, Kraak MJ. Using social media for collaborative species identification and occurrence: Issues, methods, and tools. In: Proc. of the 1st ACM SIGSPATIAL Int'l Workshop on Crowdsourced and Volunteered Geographic Information (GEOCROWD 2012). Redondo Beach: ACM Press, 2012. 22-29. [doi: 10. 1145/2442952.2442957]
    [52] Segulja C, Abdelrahman TS. Architectural support for synchronization-free deterministic parallel programming. In: Proc. of the 18th IEEE Int'l Symp. on High Performance Computer Architecture (HPCA 2012). New Orleans: IEEE Press, 2012. 1-12. [doi: 10. 1109/HPCA.2012.6169038]
    [53] HiC2011—Realtime data streams and Analytics-Data Freeway and Puma-Facebook. 2013. http://ishare.iask.sina.com.cn/f/ 22023896.html
    [54] Auradkar A, Botev C, Das S, et al. Data infrastructure at LinkedIn. In: Proc. of the IEEE 28th Int'l Conf. on???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例.软件学报,2014,25(4):839-862

复制
分享
文章指标
  • 点击次数:15713
  • 下载次数: 23413
  • HTML阅读次数: 3868
  • 引用次数: 0
历史
  • 收稿日期:2013-09-07
  • 最后修改日期:2013-12-16
  • 在线发布日期: 2014-01-24
文章二维码
您是第19698481位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号