Distributed Time Series Similarity Search Method Based on Key-value Data Stores
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [36]
  • |
  • Related
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    Time series similarity search is one of the most basic operations for temporal data analysis, which has various application scenarios. Existing distributed methods face the problems of dimension explosion, too large scan range, and time-consuming similarity calculation. To this end, this study proposes a distributed time series similarity search algorithm KV-Search. First, time series are segmented into blocks and stored in the key-value database, which solves the problem of high and growing dimension. Second, the lower bound is calculated based on Chebyshev distance, and the invalid data is filtered out in advance using key value range scanni ng, which reduce the data transmission and calculation overhead. Third, a block-based time series representation is used to calculate the lower bound of distance, which avoids the calculation of higher dimensional real data. KV-Search is implemented based on HBase, and a set of extensive experiments are conducted using both real and synthetic time series data. The results show that the proposed KV-Search is superior to benchmark experiment in efficiency and scalability.

    Reference
    [1] Zheng Y, Capra L, Wolfson O, Yang H. Urban computing:Concepts, methodologies, and applications. ACM TIST, 2014, 5(3): 1-55.
    [2] Yuan JD, Wang ZH, Sun YG, Zhang W. K-nearest neighbor classifier for complex time series. Ruan Jian Xue Bao/Journal of Software, 2017, 28(11):3002-3017(in Chinese with English abstract).[doi:10.13328/j.cnki.jos.005331]
    [3] Feng YC, Jiang T, Li GH, Zhu H. Underlying techniques of efficient similarity search on time series. Chinese Journal of Computers, 2009, 32(11):2107-2122(in Chinese with English abstract).
    [4] Ma RQ, Song BY, Ding LL, Wang JL. Dynamic matrix clustering method for time series events. Journal of Frontiers of Computer Science and Technology, 2020, 15(3):468-477(in Chinese with English abstract).
    [5] Wang MX, Ding XO, Wang HZ, Li JZ. Correlation-based method for tracing multi-dimensional time series data anomalies. Journal of Frontiers of Computer Science and Technology, 2020, 15(11):2142-2150(in Chinese with English abstract).
    [6] Ding XO, Yu SJ, Wang MX, Wang HZ, Gao H, Yang DH. Anomaly detection on industrial time series based on correlation analysis. Ruan Jian Xue Bao/Journal of Software, 2020, 31. 3):726-747(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5907.htm[doi:10.13328/j.cnki.jos.005907]
    [7] Li R, He H, Wang R, Huang Y, Liu J, Ruan S, He T, Bao J, Zheng Y. Just:JD urban spatio-temporal data engine. In:Proc. of the IEEE 36th Int'l Conf. on Data Engineering. IEEE, 2020. 1558-1569.
    [8] Li R, He H, Wang R, Ruan S, Sui Y, Bao J, Zheng Y. Trajmesa:A distributed NoSQL storage engine for big trajectory data. In: Proc. of the IEEE 36th Int'l Conf. on Data Engineering. IEEE, 2020. 2002-2005.
    [9] Li R, He H, Wang R, Ruan S, He T, Bao J, Zhang J, Hong L, Zheng Y. TrajMesa:A distributed NoSQL-based trajectory data management system. IEEE Trans. on Knowledge and Data Engineering, 2021. https://ieeexplore.ieee.org/abstract/document/9430714
    [10] Yi XW, Zheng Y, Zhang JB, Li TR. ST-MVL:Filling missing values in geo-sensory time series data. In:Proc. of the 15th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2016). AAAI, 2016. 2704-2710.
    [11] Cha SH. Comprehensive survey on distance/similarity measures between probability density functions. Int'l Journal of Mathematical Models and Methods in Applied Sciences, 2007, 1. 4):300-307.
    [12] Baskar SS, Arockiam L. C-LAS relief An improved feature selection technique in data mining. Int'l Journal of Computer Applications, 2013, 83(13):33-36.
    [13] Yagoubi DE, Akbarinia R, Masseglia F, Palanas T. Massively distributed time series indexing and querying. IEEE Trans. on Knowledge and Data Engineering, 2018, 32(1):108-120.
    [14] Yang K, Ding X, Zhang Y, Chen L, Zheng B, Gao Y. Distributed similarity queries in metric spaces. Data Science and Engineering, 2019, 4(2):93-108.
    [15] Yang ZX, Tang N, Tang Y, Pan MM, Li DD, Ye XP. Temporal index and query based on timing partition. Ruan Jian Xue Bao/Journal of Software, 2020, 31. 11):3519-3539(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5826.htm[doi: 10.13328/j.cnki.jos.005826]
    [16] Morse MD, Patel JM. An efficient and accurate method for evaluating time series similarity. In:Proc. of the SIGMOD Conf. ACM, 2007. 569-580.
    [17] Wu WCH, Yeh MY, Pei J. Random error reduction in similarity search on time series:A statistical approach. In:Proc. of the ICDE. 2012. 858-869.
    [18] Jin S, Keogh E. iSAX:Indexing and mining terabyte sized time series. In:Proc. of the 14th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2008. 623-631.
    [19] Zoumpatianos K, Idreos S, Palpanas T. Indexing for interactive exploration of big data series. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. ACM, 2014. 1555-1566.
    [20] Peng B, Fatourou P, Paris PT. The next destination for fast data series indexing and query answering. In:Proc. of the 2018 IEEE Int'l Conf. on Big Data (Big Data). 2018. 791-800.
    [21] Michele L, Palpanas T. Scalable, variable-length similarity search in data series:The ULISSE approach. Proc. of the VLDB Endowment, 2018, 11. 13):2236-2248.
    [22] Huijse P, Estevez PA, Protopapapas P, Principe J, Zegers P. Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Computational Intelligence Magazine, 2014, 9(3):27-39.
    [23] Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. In:Proc. of the 25th Int'l Conf. on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., 1999. 518-529.
    [24] Alghamdi N, Zhang L, Zhang H, Rundensteiner E, Eltabakh M. ChainLink:Indexing big time series data for long subsequence matching. In:Proc. of the 2020 IEEE ICDE. IEEE, 2020. 529-540.
    [25] Ur RMH, Liew CS, Abbas A, Jayaraman P, Wah T, Khan S. Big data reduction methods:A survey. Data Science and Engineering, 2016, 1. 4):265-284.
    [26] Popivanov I, Miller RJ. Similarity search over time-series data using wavelets. In:Proc. of the 18th Int'l Conf. on Data Engineering. IEEE, 2002. 212-221.
    [27] Peng JL, Wang HZ, Li JZ, Gao H. Set-based similarity search for time series. In:Proc. of the SIGMOD Conf. ACM, 2016. 2039-2052.
    [28] Vora MN. Hadoop-HBase for large-scale data. In:Proc. of the 2011 ICCS. IEEE, 2011. 601-605.
    [29] Borthakur D. HDFS architecture guide. 2008. https://www.researchgate.net/publication/242787758_HDFS_architecture_guide
    附中文参考文献:
    [2] 原继东, 王志海, 孙艳歌, 张伟. 面向复杂时间序列的k近邻分类器. 软件学报, 2017, 28(11):3002-3017. http://www.jos.org. cn/1000-9825/5331.htm[doi:10.13328/j.cnki.jos.005331]
    [3] 冯玉才, 蒋涛, 李国徽, 朱虹. 高效时序相似搜索技术. 计算机学报, 2009, 32(11):2107-2122.
    [4] 马瑞强, 宋宝燕, 丁琳琳, 王俊陆. 面向时间序列事件的动态矩阵聚类方法. 计算机科学与探索, 2020, 15(3):468-477.
    [5] 王沐贤, 丁小欧, 王宏志, 李建中. 基于相关性的多维时序数据异常溯源方法. 计算机科学与探索, 2020, 15(11):2142-2150.
    [6] 丁小欧, 于晟健, 王沐贤, 王宏志, 高宏, 杨东华. 基于相关性分析的工业时序数据异常检测. 软件学报, 2020, 31. 3): 726-747. http://www.jos.org.cn/1000-9825/5907.htm[doi:10.13328/j.cnki.jos.005907]
    [15] 杨佐希, 汤娜, 汤庸, 潘明明, 李丁丁, 叶小平. 基于时序分区的时态索引与查询. 软件学报, 2020, 31. 11):3519-3539. http://www.jos.org.cn/1000-9825/5826.htm[doi:10.13328/j.cnki.jos.005826]
    Related
    Cited by
Get Citation

俞自生,李瑞远,郭阳,蒋忠元,鲍捷,郑宇.基于键值存储的分布式时序相似性搜索方法.软件学报,2022,33(3):950-967

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 30,2021
  • Revised:July 31,2021
  • Online: October 21,2021
  • Published: March 06,2022
You are the first2033174Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063