Abstract:As distributed computing and distributed systems are being widely applied in various areas, how to improve the efficiency of system operations to guarantee the stability and reliability of the services provided by these distributed systems have gained massive momentum from both academia and industry. However, system operation tasks are confronted with tough challenges due the large scale, the intricate structures and dependency, the continuous updating and concurrent service requests of distributed systems. Previous component-/node-/process-/thread-centric monitoring and tracing methods are not sufficient to support the system operation tasks such as fault diagnosis, performance optimization, and system understanding in a distributed system. To address this issue, distributed tracing is proposed and designed. Distributed tracing identifies all the events belonging to the same request and causally correlates these events. Distributed tracing technology precisely and fine-grainedly depicts the behavior of a distributed system in a service-request or workflow-centric way, which is critical to improve the efficiency of system operations. This paper presents a comprehensive survey of existing research work and application of distributed tracing technology. A research framework is proposed and existing research achievements in this field are compared and analyzed with this framework from four perspectives which are acquiring tracing data, identifying the events from the same request, determining the causal relationships among these events, and representing the request execution path. Then the research work of applying distributed tracing technology to system operation tasks such as fault diagnosis and performance optimization is briefly introduced. Finally, the data dependency issue, the generality issue, and evaluation metrics issue of distributed tracing are discussed and a perspective of the future research direction in distributed tracing technology is presented.