AmazeMap:基于多层次影响图的微服务故障定位方法
作者:
作者简介:

李亚晓(2000—),男,博士生,CCF学生会员,主要研究领域为智能化运维,微服务;李青山(1973—),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为国产开源软件,软件体系结构,自适应软件演化,智能化运维,智能软件工程;王璐(1991—),女,博士,副教授,CCF高级会员,主要研究领域为智能化运维,微服务与云原生,软件演化;姜宇轩(1999—),男,硕士生,主要研究领域为微服务故障诊断.

通讯作者:

王璐,E-mail:wanglu@xidian.edu.cn

基金项目:

国家自然科学基金(62372351,U21B2015);陕西省科协青年人才托举计划(20220113)


AmazeMap: Microservices Fault Localization Method Based on Multi-level Impact Graph
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [56]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    微服务软件系统由于其具有大量复杂的服务依赖关系和组件化模块, 一个服务发生故障往往造成与之相关的1个或多个服务发生故障, 导致故障定位的难度不断提高. 因此, 如何有效地检测系统故障、快速而准确地定位故障根因问题, 是当前微服务领域研究的重点. 现有研究一般通过分析故障对服务、指标的作用关系来构建故障关系模型, 但存在运维数据利用不充分、故障信息建模不全面、根因定位粒度粗等问题, 因此提出了AmazeMap方法. 该方法设计了多层次故障影响图建模方法以及基于多层次故障影响图的微服务故障定位方法. 其中: 多层次故障影响图建模方法通过挖掘系统运行时指标时序数据与链路数据, 考虑不同层次间的相互关系, 能够较全面地建模故障信息; 基于多层次故障影响图的微服务故障定位方法通过缩小故障影响范围, 从服务实例和指标两个方面发现根因, 输出最有可能的故障根因节点和指标序列. 基于开源基准微服务系统和AIOps挑战赛数据集, 从有效性和效率两个方面设计了微服务软件故障定位实验, 并与现有方法进行对比, 实验结果验证了AmazeMap的有效性、准确性和效率.

    Abstract:

    Due to the large number of complex service dependencies and componentized modules, a failure in one service often causes one or more related services to fail, making it increasingly difficult to locate the cause of the failure. Therefore, how to effectively detect system faults and locate the root cause of faults quickly and accurately is the focus of current research in the field of microservices. Existing research generally builds a failure relationship model by analyzing the relationship between failures and services and metrics, but there are problems such as insufficient utilization of operation and maintenance data, incomplete modeling of fault information, coarse granularity of root cause localization, etc. Therefore, this study proposes AmazeMap, for which a multi-level fault impact graph modeling method and a microservice fault localization method are designed based on the fault impact graph. Specifically, the multi-level fault impact graph modeling method can comprehensively model the fault information by mining the collected temporal metric data and trace data while system running and considering the interrelationships between different levels; the fault localization method narrows the scope of fault impact, discovers the root cause from service instances and metrics, and finally outputs the most probable root cause of fault and metrics sequence. Based on an open-source benchmark microservice system and the AIOps contest dataset, this study designs experiments to validate AmazeMap, and also compares it with the existing methods. The results confirm AmazeMap’s effectiveness, accuracy, and efficiency.

    参考文献
    [1] Bogner J, Zimmermann A. Towards integrating microservices with adaptable enterprise architecture. In: Proc. of the 20th IEEE Int’l Enterprise Distributed Object Computing Workshop (EDOCW). 2016. 1-6. [doi: 10.1109/EDOCW.2016.7584392]
    [2] Wu L, Bogatinovski J, Nedelkoski S, Tordsson J, Kao O. Performance diagnosis in cloud microservices using deep learning. In: Hacid H, Outay F, Paik H, Alloum A, Petrocchi M, Bouadjenek MR, Beheshti A, Liu X, Maaradji A, eds. Proc. of the Service- Oriented Computing Workshops (ICSOC 2020). Cham: Springer, 2021. 85-96.
    [3] Lin J, Chen P, Zheng Z. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In: Pahl C, Vukovic M, Yin J, Yu Q, eds. Proc. of the Service-oriented Computing. Cham: Springer, 2018. 3-20.
    [4] Nandi A, Mandal A, Atreja S, Dasgupta GB, Bhattacharya S. Anomaly detection using program control flow graph mining from execution logs. In: Proc. of the 22nd ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. New York: ACM, 2016. 215-224. [doi: 10.1145/2939672.2939712]
    [5] 王子勇, 王焘, 张文博, 陈宁江, 左春. 一种基于执行轨迹监测的微服务故障诊断方法. 软件学报, 2017, 28(6): 1435-1454. http://www.jos.org.cn/1000-9825/5223.htm [doi: 10.13328/j.cnki.jos.005223]
    Wang ZY, Wang T, Zhang WB, Chen NJ, Zuo C. Fault diagnosis for microservices with execution trace monitoring. Ruan Jian Xue Bao/Journal of Software, 2017, 28(6): 1435-1454(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5223.htm [doi: 10.13328/j.cnki.jos.005223]
    [6] Ma SP, Fan CY, Chuang Y, Lee WT, Lee SJ, Hsueh NL. Using service dependency graph to analyze and test microservices. In: Proc. of the 42nd IEEE Annual Computer Software and Applications Conf. (COMPSAC). 2018. 81-86.
    [7] Chen P, Qi Y, Zheng P, Hou D. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: Proc. of the IEEE Conf. on Computer Communications (IEEE INFOCOM 2014). 2014. 1887-1895.
    [8] Wang P, Xu J, Ma M, Lin W, Pan D, Wang Y, Chen P. CloudRanger: Root cause identification for cloud native systems. In: Proc. of the 18th IEEE/ACM Int’l Symp. on Cluster, Cloud and Grid Computing (CCGRID). 2018. 492-502.
    [9] Gill SS, Buyya R. Failure management for reliable cloud computing: A taxonomy, model, and future directions. Computing in Science & Engineering, 2020, 22(3): 52-63.
    [10] Aguilera MK, Chen W, Toueg S. Failure detection and consensus in the crash-recovery model. Distributed Computing, 2000, 13(2): 99-125.
    [11] Langville AN, Meyer CD. A survey of eigenvector methods for Web information retrieval. SIAM Review, 2005, 47(1): 135-161. [doi: 10.1137/S0036144503424786]
    [12] Granger CWJ. Some properties of time series data and their use in econometric model specification. Journal of Econometrics, 1981, 16(1): 121-130. [doi: 10.1016/0304-4076(81)90079-8]
    [13] Chen P, Qi Y, Hou D. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Trans. on Services Computing, 2019, 12(2): 214-230.
    [14] Mariani L, Monni C, Pezzé M, Riganelli O, Xin R. Localizing faults in cloud systems. In: Proc. of the 11th IEEE Int’l Conf. on Software Testing, Verification and Validation (ICST). 2018. 262-273.
    [15] Kim M, Sumbaly R, Shah S. Root cause detection in a service-oriented architecture. In: Proc. of the ACM SIGMETRICS/Int’l Conf. on Measurement and Modeling of Computer Systems. New York: ACM, 2013. 93-104.
    [16] 2023. https://github.com/microservices-demo/microservices-demo
    [17] 2023. https://spring.io/projects/spring-boot
    [18] 2023. https://github.com/go-kit/kit
    [19] 2023. https://nodejs.org/en
    [20] 2023. https://chaos-mesh.org/website-zh/
    [21] 2023. https://www.oracle.com/cn/
    [22] 2023. https://github.com/redis/redis
    [23] 2023. https://www.docker.com/
    [24] Jia T, Chen P, Yang L, Li Y, Meng F, Xu J. An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services. In: Proc. of the 2017 IEEE Int’l Conf. on Web Services (ICWS). 2017. 25-32.
    [25] Jia T, Yang L, Chen P, Li Y, Meng F, Xu J. LogSed: Anomaly diagnosis through mining time-weighted control flow graph in logs. In: Proc. of the 10th IEEE Int’l Conf. on Cloud Computing (CLOUD). 2017. 447-455. [doi: 10.1109/CLOUD.2017.64]
    [26] Soldani J, Tamburri DA, Heuvel WJVD. The pains and gains of microservices: A systematic grey literature review. Journal of Systems and Software, 2018, 146: 215-232.
    [27] Rezende DJ, Mohamed S. Variational inference with normalizing flows. In: Proc. of the 32nd Int’l Conf. on Machine Learning, Vol.37. 2015. 1530-1538.
    [28] Jin M, Lv A, Zhu Y, Wen Z, Zhong Y, Zhao Z, Wu J, Li H, He H, Chen F. An anomaly detection algorithm for microservice architecture based on robust principal component analysis. IEEE Access, 2020, 8: 226397-226408. [doi: 10.1109/ACCESS.2020. 3044610]
    [29] Meng L, Ji F, Sun Y, Wang T. Detecting anomalies in microservices with execution trace comparison. Future Generation Computer Systems, 2021, 116: 291-301. [doi: https://doi.org/10.1016/j.future.2020.10.040]
    [30] Wang T, Zhang W, Xu J, Gu Z. Workflow-aware automatic fault diagnosis for microservice-based applications with statistics. IEEE Trans. on Network and Service Management, 2020, 17(4): 2350-2363. [doi: 10.1109/TNSM.2020.3022028]
    [31] 张攀, 高丰, 周逸, 饶涵宇, 毛东, 李静. 一种在线实时微服务调用链异常检测方法. 计算机工程, 2022, 48(11): 161-169.
    Zhou X, Peng X, Xie T, Sun J, Ji C, Liu D, Xiang Q, He C. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: Proc. of the 27th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. New York: ACM, 2019. 683-694.
    [32] Zhang P, Gao F, Zhou Y, Rao HY, Mao D, Li J. An online real-time anomaly detection method for microservice call chains. Computer Engineering, 2022, 48(11): 161-169(in Chinese with English abstract). [doi: 10.19678/j.issn.1000-3428.0063817]
    [33] Pitakrat T, Okanovic D, Van Hoorn A, Grunske L. An architecture-aware approach to hierarchical online failure prediction. In: Proc. of the 12th Int’l ACM SIGSOFT Conf. on Quality of Software Architectures (QoSA). 2016. 60-69. [doi: 10.1109/QoSA.2016. 16]
    [34] Pitakrat T, Okanović D, van Hoorn A, Grunske L. Hora: Architecture-aware online failure prediction. Journal of Systems and Software, 2018, 137: 669-685. [doi: 10.1016/j.jss.2017.02.041]
    [35] Zang X, Chen W, Zou J, Zhou S, Lisong H, Ruigang L. A fault diagnosis method for microservices based on multi-factor self- adaptive heartbeat detection algorithm. In: Proc. of the 2nd IEEE Conf. on Energy Internet and Energy System Integration (EI2). 2018. 1-6.
    [36] Wu L, Tordsson J, Elmroth E, Kao O. MicroRCA: Root cause localization of performance issues in microservices. In: Proc. of the 2020 IEEE/IFIP Network Operations and Management Symp. (NOMS 2020). 2020. 1-9. [doi: 10.1109/NOMS47738.2020. 9110353]
    [37] 吴封斌, 李笑瑜, 蒲睿强, 张量, 岳洪吉. Istio: 微服务架构服务治理升级研究. 网络安全技术与应用, 2022(7): 1-2.
    Wu FB, Li XY, Pu RQ, Zhang L, Yue HJ. Istio: Research on service governance upgrade of microservice architecture. Network Security Technology and Application, 2022(7): 1-2(in Chinese with English abstract).
    [38] Mi H, Wang H, Zhou Y, Lyu MRT, Cai H. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Trans. on Parallel and Distributed Systems, 2013, 24(6): 1245-1255.
    [39] Coefficient of Variation. The Concise Encyclopedia of Statistics. New York: Springer, 2008. 95-96.
    [40] Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? Journal of the ACM, 2011, 58(3): 1-37.
    [41] Yu G, Chen P, Chen H, Guan Z, Huang Z, Jing L, Weng T, Sun X, Li X. MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. In: Proc. of the Web Conf. 2021. New York: ACM, 2021. 3087-3098.
    [42] Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. The MIT Press, 2001. [doi: 10.1007/978-1-4612-2748-9]
    [43] Aggarwal P, Gupta A, Mohapatra P, Nagar S, Mandal A, Wang Q, Paradkar A. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In: Hacid H, Outay F, Paik H, Alloum A, Petrocchi M, Bouadjenek MR, Beheshti A, Liu X, Maaradji A, eds. Proc. of the Service-oriented Computing Workshops (ICSOC 2020). Cham: Springer, 2021. 137-149.
    [44] Ma M, Lin W, Pan D, Wang P. MS-Rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications. In: Proc. of the 2019 IEEE Int’l Conf. on Web Services (ICWS). 2019. 60-67.
    [45] Ma M, Lin W, Pan D, Wang P. Self-Adaptive root cause diagnosis for large-scale microservice architecture. IEEE Trans. on Services Computing, 2022, 15(3): 1399-1410.
    [46] Ma M, Xu J, Wang Y, Chen P, Zhang Z, Wang P. AutoMAP: Diagnose your microservice-based web applications automatically. In: Proc. of the Web Conf. 2020. New York: ACM, 2020. 246-258. [doi: 10.1145/3366423.3380111]
    [47] Qiu J, Du Q, Yin K, Zhang S, Qian C. A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences, 2020, 10(6): 2166. [doi: 10.3390/app10062166]
    [48] Meng Y, Zhang S, Sun Y, Zhang R, Hu Z, Zhang Y, Jia C, Wang Z, Pei D. Localizing failure root causes in a microservice through causality inference. In: Proc. of the 28th IEEE/ACM Int’l Symp. on Quality of Service (IWQoS). 2020. 1-10. [doi: 10.1109/ IWQoS49365.2020.9213058]
    [49] Liu D, He C, Peng X, Lin F, Zhang C, Gong S, Li Z, Ou J, Wu Z. MicroHECL: High-efficient root cause localization in large-scale microservice systems. In: Proc. of the 43rd Int’l Conf. on Software Engineering: Software Engineering in Practice. Virtual Event: IEEE, 2021. 338-347.
    [50] Wu L, Tordsson J, Bogatinovski J, Elmroth E, Kao O. MicroDiag: Fine-grained performance diagnosis for microservice systems. In: Proc. of the 2021 IEEE/ACM Int’l Workshop on Cloud Intelligence (CloudIntelligence). 2021. 31-36. [doi: 10.1109/Cloud Intelligence52565.2021.00015]
    [51] Shan H, Chen Y, Liu H, Zhang Y, Xiao X, He X, Li M, Ding W. ∈-Diagnosis: Unsupervised and real-time diagnosis of small- window long-tail latency in large-scale microservice platforms. In: Proc. of the World Wide Web Conf. New York: ACM, 2019. 3215-3222.
    [52] Wang L, Zhao N, Chen J, Li P, Zhang W, Sui K. Root-cause metric location for microservice systems via log anomaly detection. In: Proc. of the 2020 IEEE Int’l Conf. on Web Services (ICWS). 2020. 142-150. [doi: 10.1109/ICWS49710.2020.00026]
    [53] Kaldor J, Mace J, Bejda M, Gao E, Kuropatwa W, O’Neill J, Ong KW, Schaller B, Shan P, Viscomi B, Venkataraman V, Veeraraghavan K, Song YJ. Canopy: An end-to-end performance tracing and analysis system. In: Proc. of the 26th Symp. on Operating Systems Principles. New York: ACM, 2017. 34-50.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李亚晓,李青山,王璐,姜宇轩. AmazeMap:基于多层次影响图的微服务故障定位方法.软件学报,2024,35(7):3115-3140

复制
分享
文章指标
  • 点击次数:876
  • 下载次数: 3827
  • HTML阅读次数: 1184
  • 引用次数: 0
历史
  • 收稿日期:2023-09-08
  • 最后修改日期:2023-10-30
  • 在线发布日期: 2024-01-05
  • 出版日期: 2024-07-06
文章二维码
您是第20255412位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号