基于混沌工程的微服务韧性风险识别和分析
作者:
作者简介:

殷康璘(1992-),男,博士,CCF学生会员,主要研究领域为软件工程,智能运维.
杜庆峰(1968-),男,博士,教授,博士生导师,主要研究领域为软件工程与质量控制,机器学习与智能运维.

通讯作者:

杜庆峰,E-mail:du_cloud@tongji.edu.cn

基金项目:

国家自然科学基金(U1934212);国家重点研发计划(2020YFB2103300)


Microservice Resilience Risk Identification and Analysis Based on Chaos Engineering
Author:
Fund Project:

National Natural Science Foundation of China (U1934212); National Key Research and Development Program of China (2020YFB2103300)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [79]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    微服务架构近年来已成为互联网应用所采用的主流架构模式.然而与传统的软件架构相比,微服务架构更加复杂的部署结构使其面临更多能够导致系统发生故障的潜在威胁,且微服务架构系统故障的症状也更加多样化.在可靠性等一些传统的软件度量已不能充分体现微服务架构系统故障应对能力的情况下,微服务的开发者们开始使用"韧性(resilience)"一词描述微服务架构系统的故障应对能力.为了提高微服务架构系统的韧性,开发者往往需要针对特定的系统环境扰动因素设计应对机制.如何判断一个系统环境扰动因素是否为影响微服务系统韧性的风险因素,以及如何在系统运行发布之前尽可能多地寻找到这些潜在的韧性风险,都是微服务架构系统开发过程中待研究的问题.在先前研究中提出的微服务韧性度量模型的基础上,结合混沌工程,提出了针对微服务架构系统的韧性风险识别和分析方法.韧性风险的识别方法通过不断地向微服务架构系统引入随机系统环境扰动并观察系统服务性能的变化,寻找系统潜在的韧性风险,大幅度减少了软件风险识别过程中的人力成本.对于识别到的韧性风险,通过收集执行混沌工程过程中的系统性能监控数据,韧性风险分析方法将利用因果搜索算法构建出各项系统性能指标之间的影响链路,并将可能性较高的链路提供给运维人员,作为进一步分析的参考.最后,通过在一个微服务架构系统上实施的案例,研究展示了所提出的韧性风险识别和分析方法的有效性.

    Abstract:

    Microservice architecture has already become the mainstream architecture pattern of Internet applications in recent years. However, compared with traditional software architectures, microservice architecture has a more sophisticated deployment structure, which makes it have to face more potential threats that make the system in fault, as well as the greater diversity of fault symptoms. Since traditional measurements like reliability cannot fully show a microservice architecture system's capability to cope with failures, microservice developers started to use the word "resilience" to describe such capability. In order to improve a microservice architecture system's resilience, developers usually need to design specific mechanisms for different system environment disruptions. How to judge whether a system environment disruption is a risk to microservice resilience, and how to find these resilience risks as much as possible before the system is released, are the research questions in microservice development. According to the microservice resilience measurement model which is proposed in authors' previous research, by integrating the chaos engineering practice, resilience risk identification and analysis approaches for microservice architecture systems are proposed. The identification approach continuously generates random system environment disruptions to the target system and monitors variations in system service performance, to find potential resilience risks, which greatly reduces human effort in risk identification. For identified resilience risks, by collecting performance monitoring data during chaos engineering, the analysis approach uses the causality search algorithm to build influence chains among system performance indicators, and provide chains with high possibility to system operators for further analysis. Finally, the effectiveness of the proposed approach is proved by a case study on a microservice architecture system.

    参考文献
    [1] Lewis J, Fowler M. Microservices:A definition of this new architectural term. 2014. https://martinfowler.com/articles/microservices.html
    [2] Balalaie A, Heydarnoori A, Jamshidi P. Microservices architecture enables devops:Migration to a cloud-native architecture. IEEE Software, 2016,33(3):42-52.[doi:10.1109/MS.2016.64]
    [3] Mauro T. Adopting microservices at netflix:Lessons for architectural design. 2015. https://www.nginx.com/blog/adopting-microservices-at-netflix-lessons-for-team-and-process-design/
    [4] Ihde S. InfoQ-From a monolith to microservices + REST:The evolution of LinkedIn's service architecture. 2015. https://www.infoq.com/presentations/linkedin-microservices-urn/
    [5] Calçado P. Building products at soundcloud-Part III:Microservices in scala and finagle. SoundCloud Limited, 2014. https://developers.soundcloud.com/blog/building-products-at-soundcloud-part-3-microservices-in-scala-and-finagle
    [6] Dragoni N, Giallorenzo S, Lafuente AL, et al. Microservices:Yesterday, today, and tomorrow. In:Proc. of the Present and Ulterior Software Engineering. Cham:Springer-Verlag, 2017.195-216.[doi:10.1007/978-3-319-67425-4\_12]
    [7] Gunawi HS, Hao M, Suminto RO, et al. Why does the cloud stop computing? Lessons from hundreds of service outages. In:Proc. of the 7th ACM Symp. on Cloud Computing. New York:ACM, 2016.1-16.[doi:10.1145/2987550.2987583]
    [8] ISO/IEC 25010:2011, Systems and Software Engineering-Systems and Software Quality Requirements and Evaluation (SQuaRE)-System and Software Quality Models. Int'l Standards Organization, 2011. https://www.iso.org/standard/35733.html
    [9] Gunawi HS, Hao M, Leesatapornwongsa T, et al. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In:Proc. of the ACM Symp. on Cloud Computing. New York:ACM, 2014.7:1-7:14.[doi:10.1145/2670979.2670986]
    [10] Newman S. Building Microservices:Designing Fine-grained Systems. New York:O'Reilly Media, Inc., 2015.
    [11] Nadareishvili I, Mitra R, McLarty M, et al. Microservice Architecture:Aligning Principles, Practices, and Culture. New York:O'Reilly Media, Inc., 2016.
    [12] Nygard MT. Release It! Design and Deploy Production-ready Software. 2nd ed., United States:Pragmatic Bookshelf, 2018.
    [13] Windle G, Bennett KM, Noyes J. A methodological review of resilience measurement scales. Health and quality of life outcomes, 2011,9(1):Article No.8.[doi:10.1186/1477-7525-9-8]
    [14] Yin K, Du Q, Wang W, et al. On representing resilience requirements of microservice architecture systems. arXiv Preprint arXiv:1909.13096, 2019.
    [15] Boehm B. Software risk management. In:Proc. of the European Software Engineering Conf. Berlin, Heidelberg:Springer-Verlag, 1989.1-19.[doi:10.1007/3-540-51635-2_29]
    [16] Holling CS. Resilience and stability of ecological systems. Annual Review of Ecology and Systematics, 1973,4(1):1-23.[doi:10.1146/annurev.es.04.110173.000245]
    [17] Hosseini S, Barker K, Ramirez-Marquez JE. A review of definitions and measures of system resilience. Reliability Engineering & System Safety, 2016,145(2016):47-61.[doi:10.1016/j.ress.2015.08.006]
    [18] Xue X, Wang L, Yang RJ. Exploring the science of resilience:Critical review and bibliometric analysis. Natural Hazards, 2018, 90(1):477-510.[doi:10.1007/s11069-017-3040-y]
    [19] Laprie JC. Dependability:Basic Concepts and Terminology. Vienna:Springer-Verlag, 1992.[doi:10.1007/978-3-7091-9170-5]
    [20] Laprie JC. From dependability to resilience. In:Proc. of the 38th IEEE/IFIP Int'l Conf. on Dependable Systems and Networks. Los Alamitos:IEEE Computer Society, 2008. G8-G9.
    [21] Wolter K, Avritzer A, Vieira M, Van Moorsel A, eds. Resilience Assessment and Evaluation of Computing Systems. Berlin, London:Springer-Verlag, 2012.[doi:10.1007/978-3-642-29032-9]
    [22] Trivedi KS, Kim DS, Ghosh R. Resilience in computer systems and networks. In:Proc. of the 2009 Int'l Conf. on Computer-aided Design. New York:ACM, 2009.74-77.[doi:10.1145/1687399.1687415]
    [23] Bishop M, Carvalho M, Ford R, et al. Resilience is more than availability. In:Proc. of the 2011 New Security Paradigms Workshop. New York:ACM, 2011.95-104.[doi:10.1145/2073276.2073286]
    [24] Diez O, Silva A. Resilience of cloud computing in critical systems. Quality and Reliability Engineering Int'l, 2014,30(3):397-412.[doi:10.1002/qre.1579]
    [25] Wolff E. Microservices:Flexible Software Architecture. Boston:Addison-Wesley Professional, 2016.
    [26] Toffetti G, Brunner S, Blöchlinger M, et al. An architecture for self-managing microservices. In:Proc. of the 1st Int'l Workshop on Automated Incident Management in Cloud. New York:ACM, 2015.19-24.[doi:10.1145/2747470.2747474]
    [27] Rusek M, Dwornicki G, Orłowski A. A decentralized system for load balancing of containerized microservices in the cloud. In:Proc. of the Int'l Conf. on Systems Science. Cham:Springer-Verlag, 2016.142-152.[doi:10.1007/978-3-319-48944-5_14]
    [28] Soenen T, Tavernier W, Colle D, et al. Optimising microservice-based reliable NFV management & orchestration architectures. In:Proc. of the 20179th Int'l Workshop on Resilient Networks Design and Modeling (RNDM). Piscataway:IEEE, 2017.1-7.[doi:10.1109/RNDM.2017.8093034]
    [29] Haselböck S, Weinreich R, Buchgeher G. Decision guidance models for microservices:Service discovery and fault tolerance. In:Proc. of the 5th European Conf. on the Engineering of Computer-based Systems. New York:ACM, 2017.1-10.[doi:10.1145/3123779.3123804]
    [30] Heorhiadi V, Rajagopalan S, Jamjoom H, et al. Gremlin:Systematic resilience testing of microservices. In:Proc. of the 2016 IEEE 36th Int'l Conf. on Distributed Computing Systems (ICDCS). Piscataway:IEEE, 2016.57-66.[doi:10.1109/ICDCS.2016.11]
    [31] Düllmann TF, van Hoorn A. Model-driven generation of microservice architectures for benchmarking performance and resilience engineering approaches. In:Proc. of the 8th ACM/SPEC on Int'l Conf. on Performance Engineering Companion. New York:ACM, 2017.171-172.[doi:10.1145/3053600.3053627]
    [32] Giedrimas V, Omanovic S, Alic D. The aspect of resilience in microservices-based software design. In:Proc. of the Federation of Int'l Conf. on Software Technologies:Applications and Foundations. Cham:Springer-Verlag, 2018.589-595.[doi:10.1007/978-3-030-04771-9_44]
    [33] Van Hoorn A, Aleti A, Düllmann TF, et al. ORCAS:Efficient resilience benchmarking of microservice architectures. In:Proc. of the 2018 IEEE Int'l Symp. on Software Reliability Engineering Workshops (ISSREW). Piscataway:IEEE, 2018.146-147.[doi:10.1109/ISSREW.2018.00-10]
    [34] Jagiełło M, Rusek M, Karwowski W. Performance and resilience to failures of an cloud-based application:Monolithic and microservices-based architectures compared. In:Proc. of the IFIP Int'l Conf. on Computer Information Systems and Industrial Management. Cham:Springer-Verlag, 2019.445-456.[doi:10.1007/978-3-030-28957-7_37]
    [35] Williams RC, Pandelios GJ, Behrens SG. Software risk evaluation (SRE) method description:Version 2.0. Pittsburgh:Software Engineering Institute, Carnegie Mellon University, 1999.
    [36] Lee WS, Grosh DL, Tillman FA, et al. Fault tree analysis, methods, and applications-A review. IEEE Trans. on Reliability, 1985, 34(3):194-203.[doi:10.1109/TR.1985.5222114]
    [37] Alexander I. Misuse cases:Use cases with hostile intent. IEEE Software, 2003,20(1):58-66.[doi:10.1109/MS.2003.1159030]
    [38] Shostack A. Threat Modeling:Designing for Security. John Wiley & Sons, 2014.
    [39] Stamatis DH. Failure Mode and Effect Analysis:FMEA from Theory to Execution. 2nd ed., Milwaukee:ASQ Quality Press, 2003.
    [40] Lindvall M, Diep M, Klein M, et al. Safety-focused security requirements elicitation for medical device software. In:Proc. of the 2017 IEEE 25th Int'l Requirements Engineering Conf. (RE). Piscataway:IEEE, 2017.134-143.[doi:10.1109/RE.2017.21]
    [41] Friedberg I, McLaughlin K, Smith P, et al. STPA-SafeSec:Safety and security analysis for cyber-physical systems. Journal of Information Security and Applications, 2017,34(2):183-196.[doi:10.1016/j.jisa.2016.05.008]
    [42] Basiri A, Behnam N, de Rooij R, et al. Chaos engineering. IEEE Software, 2016,33(3):35-41.[doi:10.1109/MS.2016.60]
    [43] Tucker H, Hochstein L, Jones N, et al. The business case for chaos engineering. IEEE Cloud Computing, 2018,5(3):45-54.[doi:10.1109/MCC.2018.032591616]
    [44] Blohowiak A, Basiri A, Hochstein L, et al. A platform for automating chaos experiments. In:Proc. of the 2016 IEEE Int'l Symp. on Software Reliability Engineering Workshops (ISSREW). Piscataway:IEEE, 2016.5-8.[doi:10.1109/ISSREW.2016.52]
    [45] Basiri A, Hochstein L, Jones N, et al. Automating chaos experiments in production. In:Proc. of the 2019 IEEE/ACM 41st Int'l Conf. on Software Engineering:Software Engineering in Practice (ICSE-SEIP). Piscataway:IEEE, 2019.31-40.[doi:10.1109/ICSE-SEIP.2019.00012]
    [46] Zhang L, Morin B, Haller P, et al. A chaos engineering system for live analysis and falsification of exception-handling in the JVM. IEEE Trans. on Software Engineering, 2019, PrePrints:1-1.[doi:10.1109/TSE.2019.2954871]
    [47] Simonsson J, Zhang L, Morin B, et al. Observability and chaos engineering on system calls for containerized applications in docker. arXiv preprint arXiv:1907.13039, 2019.
    [48] Salinas E. Tammy Bütow on chaos engineering. IEEE Software, 2018,35(5):125-128.[doi:10.1109/MS.2018.3571246]
    [49] ThoughtWorks. Technology radar vol.18.2018. https://thoughtvorks.com/radar
    [50] ThoughtWorks. Technology radar vol.20.2019. https://thoughtvorks.com/radar
    [51] Sharma B, Jayachandran P, Verma A, et al. CloudPD:Problem determination and diagnosis in shared dynamic clouds. In:Proc. of the IEEE/IFIP Int'l Conf. on Dependable Systems & Networks. Piscataway:IEEE. 2013.1-12.[doi:10.1109/DSN.2013.6575298]
    [52] Bodik P, Goldszmidt M, Fox A, et al. Fingerprinting the datacenter:automated classification of performance crises. In:Proc. of the 5th European Conf. on Computer Systems. New York:ACM, 2010.111-124.[doi:10.1145/1755913.1755926]
    [53] Cherkasova L, Kivanc O, Mi NF, et al. Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. on Computer Systems, 2009,27(3):1-32.[doi:10.1145/1629087.1629089]
    [54] Duan S, Babu S, Munagala K. Fa:A system for automating failure diagnosis. In:Proc. of the 2009 IEEE 25th Int'l Conf. on Data Engineering. Piscataway:IEEE, 2009.1012-1023.[doi:10.1109/ICDE.2009.115]
    [55] Kandula S, Mahajan R, Verkaik P, et al. Detailed diagnosis in enterprise networks. In:Proc. of the ACM SIGCOMM 2009 Conf. on Data communication. New York:ACM, 2009.243-254.[doi:10.1145/1592568.1592597]
    [56] Nguyen H, Shen Z, Tan Y, et al. FChain:Toward black-box online fault localization for cloud systems. In:Proc. of the 2013 IEEE 33rd Int'l Conf. on Distributed Computing Systems. Piscataway:IEEE, 2013.21-30.[doi:10.1109/ICDCS.2013.26]
    [57] Kandula S, Chandra R, Katabi D. What's going on? Learning communication rules in edge networks. ACM SIGCOMM Computer Communication Review, 2008,38(4):87-98.[doi:10.1145/1402958.1402970]
    [58] Nguyen H, Tan Y, Gu X. Pal:Propagation-aware anomaly localization for cloud hosted distributed applications. In:Proc. of the Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques (SLAML 2011). New York:ACM, 2011.1-8.[doi:10.1145/2038633.2038634]
    [59] Fonseca R, Porter G, Katz RH, et al. X-trace:A pervasive network tracing framework. In:Proc. of the 4th USENIX Symp. on Networked Systems Design & Implementation (NSDI 2007). USENIX, 2007.271-284.
    [60] Chen MY, Kiciman E, Fratkin E, et al. Pinpoint:Problem determination in large, dynamic Internet services. In:Proc. of the Int'l Conf. on Dependable Systems and Networks. Piscataway:IEEE, 2002.595-604.[doi:10.1109/DSN.2002.1029005]
    [61] Zhao X, Zhang Y, Lion D, et al. Lprof:A non-intrusive request flow profiler for distributed systems. In:Proc. of the 11th USENIX Symp. on Operating Systems Design and Implementation (OSDI 2014). USENIX, 2014.629-644.
    [62] Chow M, Meisner D, Flinn J, et al. The mystery machine:End-to-end performance analysis of large-scale Internet services. In:Proc. of the 11th USENIX Symp. on Operating Systems Design and Implementation (OSDI 2014). USENIX, 2014.217-231.
    [63] Wang P, et al. CloudRanger:Root cause identification for cloud native systems. In:Proc. of the 201818th IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGRID). Piscataway:IEEE, 2018.492-502.[doi:10.1109/CCGRID.2018.00076]
    [64] Lin JJ, Chen P, Zheng Z. Microscope:Pinpoint performance issues with causal graphs in micro-service environments. In:Proc. of the Int'l Conf. on Service-oriented Computing. Cham:Springer-Verlag, 2018.3-20.[doi:10.1007/978-3-030-03596-9_1]
    [65] Chen P, Qi Y, Hou D. CauseInfer:Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Trans. on Service Computing, 2019,12(2):214-230.[doi:10.1109/TSC.2016.2607739]
    [66] Standard Performance Evaluation Corporation. SPEC Benchmark. 2000. https://www.spec.org/benchmarks.html
    [67] Transaction processing performance council. TPC Benchmark™C-Standard Specification Revision 5.11.2010. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf
    [68] Transaction Processing Performance Council. TPC Benchmark™W-Standard Specification Revision 2.0r. 2003. http://tpc.org/tpc_documents_current_versions/pdf/tpcw_v2.0.0.pdf
    [69] European Telecommunications Standards Institute. ETSI GS NFV-TST 001 Network Functions Virtualisation (NFV); Pre-deployment Testing; Report on Validation of NFV Environments and Services. 2016. https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/001/01.01.01_60/gs_NFV-TST001v010101p.pdf
    [70] Al-Masri E, Mahmoud QH. Qos-based discovery and ranking of Web services. In:Proc. of the 200716th Int'l Conf. on Computer Communications and Networks. Piscataway:IEEE, 2007.529-534.[doi:10.1109/ICCCN.2007.4317873]
    [71] Zhang Y, Zheng Z, Lyu MR. Wsexpress:A QoS-aware search engine for Web services. In:Proc. of the 2010 IEEE Int'l Conf. on Web Services. Piscataway:IEEE, 2010.91-98.[doi:10.1109/ICWS.2010.20]
    [72] Kalepu S, Krishnaswamy S, Loke SW. Verity:A QoS metric for selecting Web services and providers. In:Proc. of the 4th Int'l Conf. on Web Information Systems Engineering Workshops. Piscataway:IEEE, 2003.131-139.[doi:10.1109/WISEW.2003.1286795]
    [73] Spirtes P, Clark G, Richard S. Causation, Prediction, and Search. 2nd ed., Cambridge:MIT Press, 1996.[doi:10.1007/978-1-4612-2748-9]
    [74] Pearl J. Causality:Models, Reasoning, and Inference. 2nd ed., New York:Cambridge University Press, 2009.
    [75] Anderson TW, Amemiya Y. The asymptotic normal distribution of estimators in factor analysis under general conditions. The Annals of Statistics, 1988,16(2):759-771.[doi:10.1214/aos/1176350834]
    [76] Luo C, Lou JG, Lin Q, et al. Correlating events with time series for incident diagnosis. In:Proc. of the 20th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York:ACM, 2014.1583-1592.[doi:10.1145/2623330.2623374]
    [77] Aderaldo CM, Mendonça NC, Pahl C, et al. Benchmark requirements for microservices architecture research. In:Proc. of the 1st Int'l Workshop on Establishing the Community-wide Infrastructure for Architecture-based Software Engineering. Piscataway:IEEE, 2017.8-13.[doi:10.1109/ECASE.2017.4]
    [78] European telecommunications standards institute. ETSI GS NFV-REL 001, Network Functions Virtualisation (NFV):Resiliency Requirements. 2015. https://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf
    [79] Thalheim J, Rodrigues A, Akkus IE, et al. Sieve:Actionable insights from monitored metrics in distributed systems. In:Proc. of the 18th ACM/IFIP/USENIX Middleware Conf. New York:ACM, 2017.14-27.[doi:10.1145/3135974.3135977]
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

殷康璘,杜庆峰.基于混沌工程的微服务韧性风险识别和分析.软件学报,2021,32(5):1231-1255

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-07-10
  • 最后修改日期:2020-12-15
  • 在线发布日期: 2021-02-07
  • 出版日期: 2021-05-06
文章二维码
您是第19938250位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号