面向复合异常的分布式数据库异常诊断方法
作者:
通讯作者:

邵蓥侠, E-mail: shaoyx@bupt.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(62272054, 62192784); 新一代人工智能国家科技重大专项(2022ZD0116315); 北京市科技新星计划(20230484319); 小米青年学者项目


Distributed Database Diagnosis Method for Compound Anomalies
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [29]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    数据库是计算机服务中的重要基础组件, 然而其运行中可能出现性能异常, 影响业务服务质量. 如何对数据库产生的性能异常进行诊断成为工业界与学术界的热点问题. 近年来, 一系列自动化的数据库异常诊断方法被相继提出, 它们通过分析数据库运行状态, 对数据库整体的异常类型进行判断. 但随着数据规模的不断扩大, 分布式数据库正成为在业界中愈受欢迎的重要解决方案. 在分布式数据库中, 数据库整体由多个服务器节点共同组成. 现有的异常诊断方法难以有效地定位节点异常, 无法识别在多节点上发生的复合异常, 不能感知节点间复杂的性能影响关系, 欠缺有效的诊断能力. 针对上述问题, 提出了一种面向分布式数据库的复合异常诊断的方法: DistDiagnosis. 该方法采用复合异常图对分布式数据库的异常状态进行建模, 在表示各节点异常的同时有效地捕获节点间的相关性. DistDiagnosis提出了节点相关性感知的根因异常排序方法, 根据节点对数据库整体的影响力有效地定位根因异常. 在国产分布式数据库OceanBase上构建了不同场景的异常测试案例. 实验结果表明, 该方法优于其他先进的对比方法, 异常诊断的AC@1、AC@3、AC@5最高达到0.97、0.98和0.98, 在各诊断场景中相较于次优方法最多提升了5.20%、5.45%和4.46%.

    Abstract:

    Databases are important foundational components in computer services. However, performance anomalies may occur during their operation, affecting business service quality. How to diagnose performance anomalies in databases has become a hot issue in industry and academia. Recently, a series of automated database anomaly diagnosis methods have been successively proposed. They analyze the runtime status of the database and determine the overall database anomaly types. However, with the continuous expansion of data scale, distributed databases are becoming an increasingly popular solution in the industry. In a distributed database, which is composed of multiple nodes, existing anomaly diagnosis methods struggle to effectively locate node anomalies, fail to identify compound anomalies across multiple nodes, and are unable to perceive the complex performance influence relationships between nodes, lacking effective diagnostic capabilities. To address these challenges, this study proposes a distributed database diagnosis method for compound anomalies, named DistDiagnosis. It models the anomalous state of distributed databases using a Compound Anomaly Graph, which not only represents anomalies at each node but also effectively captures the correlations between nodes. DistDiagnosis introduces a node correlation-aware root cause anomaly ranking method, effectively locating root cause anomalies according to the influence of nodes on the database. In this study, anomaly testing cases for various scenarios are constructed on OceanBase, a domestically developed distributed database. Experimental results show that DistDiagnosis outperforms other advanced baselines, achieving the AC@1, AC@3, and AC@5 values of 0.97, 0.98, and 0.98. Compared to the second-best method, DistDiagnosis improves accuracy by up to 5.20%, 5.45%, and 4.46% in each diagnostic scenario.

    参考文献
    [1] Huang SY, Qin YZ, Zhang XY, Tu YF, Li ZL, Cui B. Survey on performance optimization for database systems. Science China Information Sciences, 2023, 66(2): 121102.
    [2] Yoon DY, Niu N, Mozafari B. DBSherlock: A performance diagnostic tool for transactI/Onal databases. In: Proc. of the 2016 Int’l Conf. on Management of Data. San Francisco: ACM, 2016. 1599–1614. [doi: 10.1145/2882903.2915218]
    [3] 金连源, 李国良. 基于人工智能方法的数据库智能诊断. 软件学报, 2021, 32(3): 845–858. http://www.jos.org.cn/1000-9825/6177.htm
    Jin LY, Li GL. AI-based database performance diagnosis. Ruan Jian Xue Bao/Journal of Software, 2021, 32(3): 845–858 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6177.htm
    [4] Wang J, Yang YQ, Wang T, Sherratt RS, Zhang JY. Big data service architecture: A survey. Journal of Internet Technology, 2020, 21(2): 393–405.
    [5] Zhang GY, Li CH, Zhou K, Liu L, Zhang C, Chen WC, Fang HT, Cheng B, Yang J, Xing JS. DBCatcher: A cloud database online anomaly detection system based on indicator correlation. In: Proc. of the 2023 IEEE Int’l Conf. on Data Engineering (ICDE). Anaheim: IEEE, 2023. 1126–1139. [doi: 10.1109/ICDE55515.2023.00091]
    [6] Huang SY, Wang ZW, Zhang XY, Tu YF, Li ZL, Cui B. DBPA: A benchmark for transactional database performance anomalies. Proc. of the ACM on Management of Data, 2023, 1(1): 72.
    [7] Ma MH, Yin Z, Zhang SL, Wang S, Zheng C, Jiang XH, Hu HW, Luo C, Li YL, Qiu NJ, Li FF, Chen CC, Pei D. Diagnosing root causes of intermittent slow queries in cloud databases. Proc. of the VLDB Endowment, 2020, 13(8): 1176–1189.
    [8] Cao W, Gao YS, Lin BC, Feng XJ, Xie Y, Lou X, Wang P. TcpRT: Instrument and diagnostic analysis system for service quality of cloud databases at massive scale in real-time. In: Proc. of the 2018 Int’l Conf. on Management of Data. Houston: ACM, 2018. 615–627. [doi: 10.1145/3183713.3190659]
    [9] Pettitt AN. A non-parametric approach to the change-point problem. Journal of the Royal Statistical Society: Series C (Applied Statistics), 1979, 28(2): 126–135.
    [10] Dundjerski D, Tomašević M. Automatic database troubleshooting of Azure SQL Databases. IEEE Trans. on Cloud Computing, 2022, 10(3): 1604–1619.
    [11] Malhotra P, Ramakrishnan A, Anand G, Vig L, Agarwal P, Shroff G. LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv:1607.00148, 2016.
    [12] Chen TQ, Guestrin C. XGBoost: A scalable tree boosting system. In: Proc. of the 22nd Int’l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM, 2016. 785–794. [doi: 10.1145/2939672.2939785]
    [13] Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32.
    [14] Zhou XH, Li GL, Sun ZY, Liu ZY, Chen WZ, Wu JM, Liu JS, Feng RH, Zeng GY. D-bot: Database diagnosis system using large language models. Proc. of the VLDB Endowment, 2024, 17(10): 2514–2527.
    [15] Ma M, Xu JM, Wang Y, Chen PF, Zhang ZH, Wang P. AutoMAP: Diagnose your microservice-based Web applications automatically. In: Proc. of the Web Conf. 2020. Taipei: ACM, 2020. 246–258. [doi: 10.1145/3366423.3380111]
    [16] Lu RM, Xu EC, Zhang YM, Zhu FY, Zhu ZS, Wang MT, Zhu ZP, Xue GT, Shu JW, Li ML, Wu JS. Perseus: A fail-slow detection framework for cloud storage systems. In: Proc. of the 21st USENIX Conf. on File and Storage Technologies. Santa Clara: USENIX Association, 2023. 49–63.
    [17] Yang ZK, Yang CH, Han FS, Zhuang MQ, Yang B, Yang ZF, Cheng XJ, Zhao YZ, Shi WH, Xi HF, Yu H, Liu B, Pan Y, Yin BX, Chen JQ, Xu QQ. OceanBase: A 707 million tpmC distributed relational database system. Proc. of the VLDB Endowment, 2022, 15(12): 3385–3397.
    [18] Corbett JC, Dean J, Epstein M, et al. Spanner: Google’s globally distributed database. ACM Trans. on Computer Systems (TOCS), 2013, 31(3): 8.
    [19] Huang DX, Liu Q, Cui Q, Fang ZH, Ma XY, Xu F, Shen L, Tang L, Zhou YX, Huang ML, Wei W, Liu C, Zhang J, Li JJ, Wu XL, Song LY, Sun RX, Yu SP, Zhao L, Cameron N, Pei LQ, Tang X. TiDB: A Raft-based HTAP database. Proc. of the VLDB Endowment, 2020, 13(12): 3072–3084.
    [20] Xing W, Ghorbani A. Weighted PageRank algorithm. In: Proc. of the 2nd Annual Conf. on Communication Networks and Services Research. Fredericton: IEEE, 2004. 305–314. [doi: 10.1109/DNSR.2004.1344743]
    [21] Liu P, Zhang SL, Sun YQ, Meng Y, Yang JH, Pei D. FluxInfer: Automatic diagnosis of performance anomaly for online database system. In: Proc. of the 39th IEEE Int’l Performance Computing and Communications Conf. (IPCCC). Austin: IEEE, 2020. 1–8.
    [22] Meng Y, Zhang SL, Sun YQ, Zhang RR, Hu ZL, Zhang YY, Jia CY, Wang ZG, Pei D. Localizing failure root causes in a microservice through causality inference. In: Proc. of the 28th IEEE/ACM Int’l Symp. on Quality of Service (IWQoS). Hangzhou: IEEE, 2020. 1–10. [doi: 10.1109/IWQoS49365.2020.9213058]
    [23] Kim M, Sumbaly R, Shah S. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review, 2013, 41(1): 93–104.
    [24] Benesty J, Chen JD, Huang YT, Cohen I. Pearson correlation coefficient. In: Noise Reduction in Speech Processing. Berlin: Springer, 2009. 1–4. [doi: 10.1007/978-3-642-00296-0_5]
    [25] Xu Y, Kostamaa P, Zhou X, Chen L. Handling data skew in parallel joins in shared-nothing systems. In: Proc. of the 2008 ACM SIGMOD Int’l Conf. on Management of Data. Vancouver: ACM, 2008. 1043–1052. [doi: 10.1145/1376616.1376720]
    [26] Zilio DC, Jhingran A, Padmanabhan S. Partitioning key selection for a shared-nothing parallel database system. 1994. https://api.semanticscholar.org/CorpusID:14485315
    [27] Kumar TVV, Kumar A, Singh R. Distributed query plan generation using particle swarm optimization. Int’l Journal of Swarm Intelligence Research (IJSIR), 2013, 4(3): 58–82.
    [28] Wang P, Xu JM, Ma M, Lin WL, Pan DS, Wang Y, Chen P. CloudRanger: Root cause identification for cloud native systems. In: Proc. of the 18th IEEE/ACM Int’l Symp. on Cluster, Cloud and Grid Computing (CCGRID). Washington: IEEE, 2018. 492–502.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

向清风,邵蓥侠,徐泉清,杨传辉.面向复合异常的分布式数据库异常诊断方法.软件学报,2025,36(3):1022-1039

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-05-27
  • 最后修改日期:2024-07-16
  • 在线发布日期: 2024-09-13
文章二维码
您是第19728195位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号