跳跃滤波:一种面向大数据治理的动态数据摘要设计
作者:
作者简介:

符鹏涛(1998-),男,硕士,CCF学生会员,主要研究领域为数据摘要技术,网络测量;赵翔(1986-),男,博士,教授,CCF高级会员,主要研究领域为知识图谱,先进数据分析;罗来龙(1991-),男,博士,副研究员,CCF高级会员,主要研究领域为数据摘要技术,分布式网络系统;李尚森(1997-),男,博士生,主要研究领域为网络测量,软件定义网络,数据摘要技术;郭得科(1980-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为网络计算与系统,分布式计算与系统,网络空间安全,边缘计算,软件定义网络,移动计算,网络大数据;王怀民(1962-),男,博士,教授,博士生导师,中国科学院院士,CCF会士,主要研究领域为分布计算,软件技术,云际计算,群体智能.

通讯作者:

罗来龙,luolailong09@nudt.edu.cn;郭得科,dekeguo@nudt.edu.cn

基金项目:

国家自然科学基金(U19B2024,62002378,61772544);国防科技大学科研基金(ZK20-30)


Jump Filter: Dynamic Sketch Design for Big Data Governance
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [41]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随着信息技术的迅速发展,数据体量维持指数增长,数据价值挖掘困难,这为数据采集、清洗、存储、共享等数据生命周期中各环节的高效管控带来极大的挑战.数据摘要技术利用哈希表/矩阵/位向量对数据的频数、基数、成员关系等核心基础特性进行追踪,使得数据摘要自身成为元数据,并在共享、传输、更新等场景得到广泛应用.大数据的快速流转特性更是催生了动态数据摘要技术.现有的动态数据摘要技术通过动态维护链状或树状结构的概率数据结构列表,具有其容量随数据流大小而扩增或缩减的优势,然而也存在空间开销过大以及时间开销随数据基数增加而增长的缺陷.基于先进的跳跃一致性哈希理论,设计了一种面向大数据治理的动态数据摘要技术.该方法可以同时实现随数据基数线性增长的空间开销以及数据处理分析常数级别的时间开销,能够有效地支撑要求苛刻的多种大数据处理分析任务.在多种合成和真实数据集上,通过与传统方法实验对比,验证了所提方法的有效性和高效性.

    Abstract:

    With the rapid development of information technology, the volume of data maintains an exponential growth, and the value of data is hard to mine. It brings significant challenges to the efficient management and control of each link in the data life cycle, such as data collection, cleaning, storage, and sharing. Sketch uses a hash table/matrix/bit vector to track the core characteristics of data, such as frequency, cardinality, membership, etc. This mechanism makes sketch itself metadata which has been widely used in the sharing, transmission, update and other scenarios. The rapid flow characteristic of big data has spawned the dynamic sketches. The existing dynamic sketches have the advantage of expanding or shrinking in capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure. However, there are defects of excessive space overhead and time overhead increasing with the increase of the dataset cardinality. This study designs a dynamic sketch for big data governance based on the advanced jump consistent hash. This method can simultaneously realize the space overhead that grows linearly with the dataset cardinality and the constant time overhead of data processing and analysis, effectively supporting the demanding big data processing and analysis tasks for big data governance. The validity and efficiency of the proposed method are verified by comparing it with traditional methods on various datasets, including synthetic and natural datasets.

    参考文献
    [1] Soares S.Big Data Governance:An Emerging Imperative.Boise:MC Press, 2012.3-286.
    [2] Lü WF, Zheng ZM, Tong YX, Zhang RS, Wei SY, Li WH.Intelligent system for distributed social governance based on big data.Ruan Jian Xue Bao/Journal of Software, 2022, 33(3):931-949(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/6455.htm[doi:10.13328/j.cnki.jos.006455]
    [3] Zhang MW, Huang JJ, Han L.Range-based multi-keyword searchable scheme with privacy protection in e-healthcare cloud systems.Ruan Jian Xue Bao/Journal of Software, 2021, 32(10):3266-3282(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/6086.htm[doi:10.13328/j.cnki.jos.006086]
    [4] Bloom BH.Space/Time trade-offs in Hash coding with allowable errors.Communications of the ACM, 1970, 13(7):422-426.
    [5] Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok Z.Don't thrash:How to cache your hash on flash.Proc.of the VLDB Endowment, 2012, 5(11):1627-1637.
    [6] Fan B, Andersen D, Kaminsky M, Mitzenmacher M.Cuckoo filter:Practically better than bloom.In:Proc.of the CoNEXT.2014.75-88.
    [7] Ozisik AP, Andresen G, Levine BN, Tapp D, Bissias G, Katkuri S.Graphene:Efficient interactive set reconciliation applied to blockchain propagation.In:Proc.of the SIGCOMM.2019.303-317.
    [8] Maggs BM, Sitaraman RK.Algorithmic nuggets in content delivery.ACM SIGCOMM Computer Communication Review, 2015, 45(3):52-66.
    [9] Li SS, Luo LL, Guo DK.Sketch for traffic measurement:design optimization application and implementation.arXiv:2012.07214, 2020.
    [10] Yang T, Jiang J, Liu P, Huang Q, Gong JZ, Zhou Y, Miao R, Li XM, Uhlig S.Elastic sketch:Adaptive and fast network-wide measurements.In:Proc.of the SIGCOMM.2018.561-575.
    [11] Mun JH, Lim H.New approach for efficient IP address lookup using a bloom filter in trie-based algorithms.IEEE Trans.on Computers, 2015, 65(5):1558-1565.
    [12] Druschel P, Rowstron A.Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility.In:Proc.of the SOSP.2001.188-201.
    [13] Luo LL, Guo DK, Zhao YW, Rottenstreich O, Ma RTB, Luo XS.MCFsyn:A multi-party set reconciliation protocol with the marked cuckoo filter.IEEE Trans.on Parallel and Distributed Systems, 2021, 32(11):2705-2718.
    [14] Liu LT, Shen YL, Yan YB, Yang T, Shahzad M, Cui B, Xie GG.Sf-Sketch:A two-stage sketch for data streams.IEEE Trans.on Parallel and Distributed Systems, 2020, 31(10):2263-2276.
    [15] Tong D, Prasanna VK.Sketch acceleration on FPGA and its applications in network anomaly detection.IEEE Trans.on Parallel and Distributed Systems, 2017, 29(4):929-942.
    [16] Zhong Z, Yan S, Li ZK, Tan DC, Yang T, Cui B.BurstSketch:Finding bursts in data streams.In:Proc.of the SIGMOD.2021.2375-2383.
    [17] Paul D, Peng YQ, Li FF.Bursty event detection throughout histories.In:Proc.of the ICDE.2019.1370-1381.
    [18] Yang T, Zhang HW, Li JY, Gong JZ, Uhlig S, Chen SG, Li XM.HeavyKeeper:An accurate algorithm for finding top-k elephant flows.IEEE/ACM Trans.on Networking, 2019, 27(5):1845-1858.
    [19] Guo DK, Wu J, Chen HH, Yuan Y, Luo XS.The dynamic Bloom filters.IEEE Trans.on Knowledge and Data Engineering, 2010, 22(1):120-133.
    [20] Guo DK, Li M.Set reconciliation via counting Bloom filters.IEEE Trans.on Knowledge and Data Engineering, 2013, 25(10):2367-2380.
    [21] Chen HH, LiaoY L, Jin H, Wu J.The dynamic cuckoo filter.In:Proc.of the ICNP.2017.1-10.
    [22] Luo LL, Guo DK, Rottenstreich O, Ma RTB, Luo XS, Ren BB.The consistent cuckoo filter.In:Proc.of the INFOCOM.2019.712-720.
    [23] Karger D, Lehman E, Leighton F, Panigrahy R, Levine M, Lewin D.Consistent hashing and random trees:Distributed caching protocols for relieving hot spots on the World Wide Web.In:Proc.of the STOC.1997.654-663.
    [24] Zhang F, Chen HH, Jin H, Reviriego P.The logarithmic dynamic Cuckoo filter.In:Proc.of the ICDE.2021.948-959.
    [25] Lamping J, Veach E.A fast, minimal memory, consistent hash algorithm.arXiv:1406.2294, 2014.
    [26] Luo LL, Guo DK, Ma RTB, Rottenstreich O, Luo XS.Optimizing Bloom filter:Challenges, solutions, and comparisons.IEEE Communications Surveys & Tutorials, 2018, 21(2):1912-1949.
    [27] Tarkoma S, Rothenberg CE, Lagerspetz E.Theory and practice of Bloom filters for distributed systems.IEEE Communications Surveys & Tutorials, 2011, 14(1):131-155.
    [28] Geravand S, Ahmadi M.Bloom filter applications in network security:A state-of-the-art survey.Computer Networks, 2013, 57(18):4047-4064.
    [29] Broder A, Mitzenmacher M.Network applications of Bloom filters:A survey.Internet Mathematics, 2004, 1(4):485-509.
    [30] Fan B, Andersen DG, Kaminsky M.MemC3:Compact and concurrent MemCache with dumber caching and smarter Hashing.In:Proc.of the USENIX NSDI.2013.371-384.
    [31] Reviriego P, Pontarelli S.Perfect Cuckoo filters.In:Proc.of the CoNEXT.2021.205-211.
    [32] Mitzenmacher M, Pontarelli S, Reviriego P.Adaptive Cuckoo filters.In:Proc.of the ALENEX.2018.36-47.
    [33] Eppstein D.Cuckoo filter:Simplification and analysis.In:Proc.of the SWAT, Vol.53.2016.8:1-8:12.
    [34] Fu PT, Luo LL, Li SS, Guo DK, Cheng GY, Zhou Y.The vertical Cuckoo filters:A family of insertion-friendly sketches for online applications.In:Proc.of the ICDCS.2021.57-67.
    [35] Wang MM, Zhou MX, Shi SQ, Qian C.Vacuum filters:More space-efficient and faster replacement for Bloom and Cuckoo filters.Proc.of the VLDB Endowment, 2019, 13(2):197-210.
    [36] https://github.com/fptjy/JumpFilter
    [37] https://mawi.wide.ad.jp/
    [38] http://snap.stanford.edu/data/
    附中文参考文献
    [2] 吕卫锋, 郑志明, 童咏昕, 张瑞升, 魏淑越, 李卫华.基于大数据的分布式社会治理智能系统.软件学报, 2022, 33(3):931-949.http://www.jos.org.cn/1000-9825/6455.htm[doi:10.13328/j.cnki.jos.006455]
    [3] 张明武, 黄嘉骏, 韩亮.医疗大数据隐私保护多关键词范围搜索方案.软件学报, 2021, 32(10):3266-3282.http://www.jos.org.cn/1000-9825/6086.htm[doi:10.13328/j.cnki.jos.006086]
    引证文献
引用本文

符鹏涛,罗来龙,郭得科,赵翔,李尚森,王怀民.跳跃滤波:一种面向大数据治理的动态数据摘要设计.软件学报,2023,34(3):1193-1212

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-05-14
  • 最后修改日期:2022-07-29
  • 在线发布日期: 2022-10-26
  • 出版日期: 2023-03-06
文章二维码
您是第19728008位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号