Abstract:In a large microservice system, there usually exist many services with complex dependencies among them. A failure in one component may propagate widely and cause large-scale service anomalies. To ensure system quality, it is critical to effectively identify abnormalities and locate root causes. Invocation-chain analysis is a commonly used method for service performance modeling and anomaly detection. Existing techniques are mostly data-driven, facing many challenges of big data analysis such as diversified chain structures, a vast number of instances, and imbalanced datasets that many structures have only a small number of samples. In counter to the problems, the study proposes a model-based approach which builds high-level abstractions of method invocation models based on control-flow analysis. The instances of various invocation-chain structures are clustered into various method invocation models, which can greatly reduce the size of chain structures. Performance models are built for the method invocation models, and thresholds are defined based on the predicted execution time derived from the performance model. Outliers in the trace logs are thus identified as candidates of anomalies. Experiments were exercised on real industry logs from Baidu PhoenixNest Ads system. A one-day log with over 1.7 billion records was selected. The experiment results show that, compared with pure data-driven sequence analysis methods, the proposed model-based approach can greatly reduce the size of invocation-chain structures while effectively analyzing and detecting microservice performance anomalies and root causes.