Apache IoTDB中的多模态数据编码压缩
作者:
基金项目:

国家重点研发计划(2021YFB3300500);国家自然科学基金(62232005,62021002,62072265,92267203);北京信息科学与技术国家研究中心青年创新基金(BNR2022RC01011)


Multimodal Data Encoding and Compression in Apache IoTDB
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [31]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    时间序列数据在工业制造、气象、船舶、电力、车辆、金融等领域都有着广泛的应用,促进了时间序列数据库管理系统的蓬勃发展.面对愈加庞大的数据规模和多样的数据模态,高效的数据存储和管理方式十分关键,而数据的编码压缩愈发成为一个具有重要意义和价值的问题.现有的编码方法和相关系统未能充分考虑不同模态的数据特点,或者未把一些时序数据的处理方法应用于数据编码问题中.全面阐述了Apache IoTDB时序数据库系统中的多模态数据编码压缩方法及其系统实现,特别是面向工业物联网等应用场景.该编码方法较为全面地考虑包括时间戳数据、数值数据、布尔值数据、频域数据、文本数据等多个不同模态的数据,充分挖掘和利用各自模态数据的特点,特别是包括时间戳模态中时间戳序列间隔近似的特点等,进行有针对性的编码方案设计.同时,将实际应用场景中可能出现的数据质量问题因素纳入编码算法的考量中.在多个数据集上的编码算法层面和系统层面的实验评估和分析,验证了该编码压缩方法及其系统实现的效果.

    Abstract:

    Time-series data are widely used in industrial manufacturing, meteorology, ships, electric power, vehicles, finance, and other fields, which promotes the booming development of time-series database management systems. Faced with larger data scales and more diverse data modalities, efficiently storing and managing the data is very critical, and data encoding and compression become more and more important and are worth studying. Existing data encoding methods and systems fail to consider the characteristics of data in different modalities thoroughly, and some methods of time-series data analysis have not been applied to the scenario of data encoding. This study comprehensively introduces the multimodal data encoding methods and their system implementation in the Apache IoTDB time-series database system, especially for the industrial Internet of Things application scenarios. In the proposed encoding methods, data are comprehensively considered in multiple modals including timestamp data, numerical data, Boolean data, frequency domain data, text data, etc., and the characteristics of the corresponding modal of data fully are explored and utilized, especially the characteristics of timestamp intervals approximation in timestamp modality, to carry out targeted data encoding design. At the same time, the data quality issue that may occur in practical applications has been taken into consideration in the coding algorithm. Experimental evaluation and analysis on the encoding algorithm level and the system level over multiple datasets validate the effectiveness of the proposed encoding method and its system implementation

    参考文献
    [1] Wang HT, Wang ZC, Chen F, et al. Research on industrial big data application based on time series database. Heavy Machinery, 2020(4):17-21(in Chinese with English abstract).
    [2] Wang MQ, Wei K, Jiang CY. New challenges in time series data processing in industrial Internet of Things. Information and Communications Technology and Policy, 2019(5):4-9(in Chinese with English abstract).
    [3] Wang C, Qiao JL, Huang XD, et al. Apache IoTDB:A time series database for IoT applications. Proc. of the ACM on Management of Data, 2023, 1(2):Article No. 195.
    [4] Wang C, Huang XD, Qiao JL, et al. Apache IoTDB:Time-series database for Internet of Things. Proc. of the VLDB Endowment, 2020, 13(12):2901-2904.
    [5] Zhang C, Tang Z, Li KL, et al. A polishing robot force control system based on time series data in industrial Internet of Things. ACM Trans. on Internet Technology, 2021, 21(2):1-22.
    [6] Blalock DW, Madden S, Guttag JV. Sprintz:Time series compression for the Internet of Things. Proc. of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies, 2018, 2(3):Article 93.
    [7] Campobello G, Segreto A, Zanafi S, et al. RAKE:A simple and efficient lossless compression algorithm for the Internet of Things. In:Proc. of the European Signal Processing Conf. 2017.
    [8] Huffman D. A method for the construction of minimum-redundancy codes. Proc. of the IRE, 1952, 40(9):1098-1101.
    [9] Vo NA, Alistair M. Index compression using 64-bit words. Software Practice and Experience, 2010, 40(2):131-147.
    [10] Chen HM, Li J, Mohapatra P. RACE:Time series compression with rate adaptivity and error bound for sensor networks. In:Proc. of the IEEE Int'l Conf. on Mobile Ad-Hoc & Sensor Systems. IEEE, 2004.
    [11] Deepu CJ, Heng CH, Lian Y. A hybrid data compression scheme for power reduction in wireless sensors for IoT. IEEE Trans. on Biomedical Circuits and Systems, 2017, 11(2):245-254.
    [12] Spiegel J, Wira P, Hermann G. A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In:Proc. of the 16th IEEE Int'l Conf. on Industrial Informatics (INDIN 2018). Porto:IEEE, 2018. 447-452.
    [13] Azar J, Makhoul A, Barhamgi M, et al. An energy efficient IoT data compression approach for edge machine learning. Future Generation Computer Systems, 2019, 96:168-175.
    [14] Yu XY, Peng YQ, Li FF, et al. Two-level data compression using machine learning in time series database. In:Proc. of the 36th IEEE Int'l Conf. on Data Engineering (ICDE). IEEE, 2020.
    [15] 2023. https://docs.influxdata.com/influxdb/clustered/
    [16] 2023. https://gitee.com/dolphindb/Tutorials_CN/tree/master
    [17] 2023. https://docs.taosdata.com/
    [18] 2023. https://iotdb.apache.org/
    [19] Fang CG, Song SX, Mei YN. On repairing timestamps for regular interval time series. Proc. of the VLDB Endowment, 2022, 15(9):1848-1860.
    [20] Nandivada VK, Barik R. Improved bitwidth-aware variable packing. ACM Trans. on Architecture & Code Optimization, 2013, 10(3):1-22.
    [21] Xiao JZ, Huang YX, Hu CY, et al. Time series data encoding for efficient storage:A comparative analysis in apache IoTDB. Proc. of the VLDB Endowment, 2022, 15(10):2148-2160.
    [22] Golomb SW. Run-length encodings (Corresp.). IEEE Trans. on Information Theory, 1966, 12(3):399-401.
    [23] Song B, Xiao LM, Qin GJ, et al. A deduplication algorithm based on data similarity and delta encoding. In:Proc. of the GeoSpatial Knowledge and Intelligence 4th Int'l Conf. on GeoInformatics in Resource Management and Sustainable Ecosystem (GRMSE 2016). 2016. 245-253.
    [24] Pelkonen T, Franklin S, Cavallaro P, et al. Gorilla:A fast, scalable, in-memory time series database. Proc. of the VLDB Endowment, 2015, 8(12):1816-1827.
    [25] Liu CW, Jiang H, Paparrizos J, et al. Decomposed bounded floats for fast compression and queries. Proc. of the VLDB Endowment, 2021, 14(11):2586-2598.
    [26] Wang HY, Song SX. Frequency domain data encoding in apache IoTDB. Proc. of the VLDB Endowment, 2022, 16(2):282-290.
    [27] Welch TA. A technique for high-performance data compression. Computer, 1984, 17(6):8-19.
    [28] Skininski P, Swacha J. Fast and efficient log file compression. In:Proc. of the Communications of the 11th East European Conf. on Advances in Databases and Information Systems. 2007.
    附中文参考文献:
    [1] 王红涛, 王志超, 陈峰, 等. 基于时序数据库的工业大数据应用研究. 重型机械, 2020(4):17-21.
    [2] 王妙琼, 魏凯, 姜春宇. 工业互联网中时序数据处理面临的新挑战. 信息通信技术与政策, 2019(5):4-9.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

贺文迪,夏天睿,宋韶旭,黄向东,王建民. Apache IoTDB中的多模态数据编码压缩.软件学报,2024,35(3):1173-1193

复制
分享
文章指标
  • 点击次数:583
  • 下载次数: 3138
  • HTML阅读次数: 2004
  • 引用次数: 0
历史
  • 收稿日期:2023-07-17
  • 最后修改日期:2023-09-05
  • 在线发布日期: 2023-11-08
  • 出版日期: 2024-03-06
文章二维码
您是第20269777位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号