Abstract:
Time series data are widely used in industrial manufacturing, meteorology, electricity, vehicles, and other fields, which has promoted the development of time series database management systems. More and more database systems are migrating to the cloud, and the architecture of end-cloud collaboration is becoming more common, leading to increasingly large data scales to be processed. In scenarios such as end-cloud collaboration and massive time series, a large number of short time series are generated due to short synchronization cycles, frequent data flushing, and other reasons, posing new challenges to database systems. Effective data management and compression methods can significantly improve storage performance, enabling database systems to handle the storage of massive time series. Apache TsFile is a columnar storage file format designed for time series scenarios, playing an important role in database management systems such as Apache IoTDB. This paper describes the group compression and merging methods used in Apache TsFile to address the scenario of a large number of short time series, especially in applications with a large number of time series such as industrial Internet of Things. This group compression method fully considers the data characteristics in the short time series scenario, improves the utilization of metadata through device grouping, reduces file index size, reduces short time series, and significantly improves compression efficiency. After validation with real-world datasets, our grouping method shows significant improvements in compression efficiency, reading, writing, file merging, and other aspects, enabling better management of TsFiles in scenarios with short time series.