Survey of State-of-the-art Distributed Tracing Technology
Author:
Affiliation:

Fund Project:

Key R&D Project of Guangdong Province (2020B010164003)

  • Article
  • | |
  • Metrics
  • |
  • Reference [82]
  • |
  • Related
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    As distributed computing and distributed systems are being widely applied in various areas, how to improve the efficiency of system operations to guarantee the stability and reliability of the services provided by these distributed systems have gained massive momentum from both academia and industry. However, system operation tasks are confronted with tough challenges due the large scale, the intricate structures and dependency, the continuous updating and concurrent service requests of distributed systems. Previous component-/node-/process-/thread-centric monitoring and tracing methods are not sufficient to support the system operation tasks such as fault diagnosis, performance optimization, and system understanding in a distributed system. To address this issue, distributed tracing is proposed and designed. Distributed tracing identifies all the events belonging to the same request and causally correlates these events. Distributed tracing technology precisely and fine-grainedly depicts the behavior of a distributed system in a service-request or workflow-centric way, which is critical to improve the efficiency of system operations. This paper presents a comprehensive survey of existing research work and application of distributed tracing technology. A research framework is proposed and existing research achievements in this field are compared and analyzed with this framework from four perspectives which are acquiring tracing data, identifying the events from the same request, determining the causal relationships among these events, and representing the request execution path. Then the research work of applying distributed tracing technology to system operation tasks such as fault diagnosis and performance optimization is briefly introduced. Finally, the data dependency issue, the generality issue, and evaluation metrics issue of distributed tracing are discussed and a perspective of the future research direction in distributed tracing technology is presented.

    Reference
    [1] Yong Y, Long W, Jing G, Ying L. Transparently capturing execution path of service/job request processing. In:Proc. of the Int'l Conf. on Service-oriented Computing. Springer-Verlag, 2018. 879-887.
    [2] Zhao X, Kirk R, Yu L, et al. Log20:Fully automated optimal placement of log printing statements under specified overhead threshold. In:Proc. of the 26th Symp. on Operating Systems Principles. ACM, 2017. 565-581.
    [3] Thereska E, Salmon B, Strunk J, Wachs M, Abd-El-Malek M, Lopez J, Ganger GR. Stardust:Tracking activity in a distributed storage system. ACM SIGMETRICS Performance Evaluation Review, 2006,34(1):3-14.
    [4] Cantrill MB, Michael WS, Adam HL. Dynamic instrumentation of production systems. In:Proc. of the USENIX Annual Technical Conf. USENIX. 2004. 15-28.
    [5] Gupta M, Anindya N, Manoj KA, Gautam K. Discovering dynamic dependencies in enterprise environments for problem determination. In:Proc. of the Int'l Workshop on Distributed Systems:Operations and Management. Springer-Verlag, 2003. 221-233.
    [6] Jonathan, M. End-to-end tracing:Adoption and use cases, survey. Brown University, 2017. http://cs.brown.edu/people/jcmace/papers/mace2017survey.pdf
    [7] Lamport L. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 1978,21(7):558-565.
    [8] Chow M, Meisner D, Flinn J, Peek D, Wenisch TF. The mystery machine:End-to-end performance analysis of large-scale internet services. In:Proc. of the 11th {USENIX} Symp. on Operating Systems Design and Implementation ({OSDI} 2014). 2014. 217-231.
    [9] Sigelman BH, Barroso LA, Burrows M, Mike B, Pat S, Manoj P, Donald B, Saul J, Chandan S. Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report, Google Inv., 2010.
    [10] Kitajima S, Matsuoka N. Inferring calling relationship based on external observation for microservice architecture. In:Proc. of the Int'l Conf. on Service-oriented Computing. Springer-Verlag, 2017. 229-237.
    [11] Zhang Z, Zhan J, Li Y, Lei W, Dan M, Bo S. Precise request tracing and performance debugging for multi-tier services of black boxes. In:Proc. of the IEEE/IFIP Int'l Conf. on Dependable Systems & Networks. IEEE, 2009. 337-346.
    [12] Bahl P, Chandra R, Greenberg A, Srikanth K, David AM, Ming Z. Towards highly reliable enterprise network services via inference of multi-level dependencies. ACM SIGCOMM Computer Communication Review, 2007,37(4):13-24.
    [13] Sambasivan RR, Zheng AX, De RM, Elie K, Spencer W, Mchael S, William W, Lianghong X, Grrgory RG. Diagnosing performance changes by comparing request flows. In:Proc. of the Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 2011,5:1.1-5.8.
    [14] Lai CA, Kimball J, Zhu T, Qingyang Q, Claton P. MilliScope:A fine-grained monitoring framework for performance debugging of n-tier Web services. In:Proc. of the 37th IEEE Int'l Conf. on Distributed Computing Systems (ICDCS). IEEE, 2017. 92-102.
    [15] Mi H, Wang H, Zhou Y, Micharl RTL, Hua C. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Trans. on Parallel and Distributed Systems, 2013,24(6):1245-1255.
    [16] Reynolds P, Killian CE, Wiener JL, Jeffrey CM, Mehul AS, Amin V. Pip:Detecting the unexpected in distributed systems. In:Proc. of the Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 2006,6:9-9.
    [17] Chen M, Emre K, Anthony A, Armondo F, Eric B. Using runtime paths for macroanalysis. In:Proc. of the USENIX Workshop on Hot Topics in Operating Systems (HotOS). USENIX, 2003. 79-84.
    [18] Chen MY, Kiciman E, Fratkin E, Armondo F, Eric B. Pinpoint:Problem determination in large, dynamic internet services. In:Proc. Int'l Conf. on Dependable Systems and Networks. IEEE, 2002. 595-604.
    [19] Mace J, Roelke R, Fonseca R. Pivot tracing:Dynamic causal monitoring for distributed systems. ACM Trans. on Computer Systems (TOCS), 2018,35(4):11.
    [20] Mace J, Roelke R, Fonseca R. Pivot tracing:Dynamic causal monitoring for distributed systems. In:Proc. of the 25th Symp. on Operating Systems Principles (SOSP). ACM, 2015. 378-393.
    [21] Barham P, Donnelly A, Isaacs R, Richad. Using magpie for request extraction and workload modelling. In:Proc. of the Symp. on Operating Systems Principles (SOSP). USENIX, 2004,4:18-18
    [22] Barham P, Rebecca I, Richard M, Dushyanth N. Magpie:Online modelling and performance-aware systems. In:Proc. of the USENIX Workshop on Hot Topics in Operating Systems (HotOS). USENIX, 2003. 85-90.
    [23] Li D, Mickens J, Nath S, Lenin R. Domino:Understanding wide-area, asynchronous event causality in Web applications. In:Proc. of the 6th ACM Symp. on Cloud Computing. ACM, 2015. 182-188.
    [24] Kaldor J, Mace J, Bejda M, Edison G, Wiktor K, Joe O, Kian WO, Bill S, Pingjia S, Brendan V, Vinod V, Kaushik V, Yee JS. Canopy:An end-to-end performance tracing and analysis system. In:Proc. of the 26th Symp. on Operating Systems Principles (SOSP). ACM, 2017. 34-50.
    [25] https://zipkin.io/
    [26] https://opentracing.io/
    [27] Fonseca R, Freedman MJ, Porter G. Experiences with tracing causality in networked services. In:Proc. of the Internet Network Management Workshop/Workshop on Research on Enterprise Networking (INM/WREN). USENIX, 2010,10:10-10.
    [28] Fonseca R, Porter G, Katz R H, Scott S, Ion S. X-trace:A pervasive network tracing framework. In:Proc. of the 4th USENIX Conf. on Networked systems design & implementation (NSDI). USENIX Association, 2007. 20.
    [29] Attariyan M, Chow M, Flinn J. X-ray:Automating root-cause diagnosis of performance anomalies in production software. In:Proc. of the 10th Symp. on Operating Systems Design and Implementation (OSDI). USENIX, 2012. 307-320.
    [30] Pham C, Wang L, Tak BC, Salman B, Chunqiang T, Zbigniew K, Ravishankar KI. Failure diagnosis for distributed systems using targeted fault injection. IEEE Trans. on Parallel and Distributed Systems, 2017,28(2):503-516.
    [31] Tak BC, Tang C, Zhang C, Sriram G, Bhuvan U, Rong NC. vPath:Precise discovery of request processing paths from black-box observations of thread and network activities. In:Proc. of the USENIX Annual technical conference (ATC). USENIX, 2009.
    [32] Koskinen E, Jannotti J. Borderpatrol:Isolating events for black-box tracing. ACM SIGOPS Operating Systems Review, 2008,42(4):191-203.
    [33] Reynolds P, Wiener JL, Mogul JC, Marcos KA, Amin V. WAP5:Black-box performance debugging for wide-area systems. In:Proc. of the 15th Int'l Conf. on World Wide Web. ACM, 2006. 347-356.
    [34] Neves F, Machado N, Jose P. Falcon:A practical log-based analysis tool for distributed systems. In:Proc. of the 48th Annual IEEE/IFIP Int'l Conf. on Dependable Systems and Networks (DSN). IEEE, 2018. 534-541.
    [35] Aguilera MK, Mogul JC, Wiener JL, Patrick R, Athicha M. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review, 2003,37(5):74-89.
    [36] Du M, Feifei L, Guineng Z, Vivek S. Deeplog:Anomaly detection and diagnosis from system logs through deep learning. In:Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security. ACM, 2017. 1285-1298.
    [37] Zhao X, Rodrigues K, Luo Y, Ding Y, Michael S. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In:Proc. of the Symp. on Operating Systems Design and Implementation (OSDI). 2016. 603-618.
    [38] Tak BC, Tao S, Yang L, Chao Z, Yaoping R. LOGAN:Problem diagnosis in the cloud using log-based reference models. In:Proc. of the 2016 IEEE Int'l Conf. on Cloud Engineering (IC2E). IEEE, 2016. 62-67.
    [39] Abrahamson J, Beschastnikh I, Brun Y, Michael DE. Shedding light on distributed system executions. In:Companion Proc. of the 36th Int'l Conf. on Software Engineering. ACM, 2014. 598-599.
    [40] Zhao X, Zhang Y, Lion D, Muhammud F, Yu L, Ding Y, Micheal S. Lprof:A non-intrusive request flow profiler for distributed systems. In:Proc. of the Symp. on Operating Systems Design and Implementation (OSDI). USENIX, 2014. 629-644.
    [41] Ivan B, Yuriy B, Micheal DE, Arvind K. Inferring models of concurrent systems from logs of their behavior with CSight. In:Proc. of the 36th Int'l Conf. on Software Engineering (ICSE). ACM, 2014. 468-479.
    [42] Tan J, Kavulya S, Gandhi R, Priya N. Visual, log-based causal tracing for performance debugging of mapreduce systems. In:Proc. of the 30th IEEE Int'l Conf. on Distributed Computing Systems. IEEE, 2010. 795-806.
    [43] Fu Q, Lou JG, Wang Y, Jiang L. Execution anomaly detection in distributed systems through unstructured log analysis. In:Proc. of the 9th IEEE Int'l Conf. on Data Mining. IEEE, 2009. 149-158.
    [44] Anandkumar A, Bisdikian C, Agrawal D. Tracking in a Spaghetti Bowl:Monitoring transactions using footprints. ACM SIGMETRICS Performance Evaluation Review, 2008,36(1):133-144.
    [45] Xu H, Ning X, Zhang H, Junghwan R, Guofei J. Pinfer:Learning to infer concurrent request paths from system kernel events. In:Proc. of the IEEE Int'l Conf. on Autonomic Computing (ICAC). IEEE, 2016. 199-208.
    [46] Zhang H, Rhee J, Arora N, Sanhan G, Guofei J, Kenji Y, Dongyan X. CLUE:System trace analytics for cloud service performance diagnosis. In:Proc. of the IEEE Network Operations and Management Symp. (NOMS). IEEE, 2014. 1-9.
    [47] Erlingsson Ú, Peinado M, Peter S, Mihai B, Gloria MR. Fay:Extensible distributed tracing from kernels to clusters. ACM Trans. on Computer Systems (TOCS), 2012,30(4):13.
    [48] Mace, J, Peter B, Rodrigo F, Madanlal M. Retro:Targeted resource management in multi-tenant distributed systems. In:Proc. of the 12th USENIX Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 2015. 589-603.
    [49] Gschwind T, Eshghi K, Garg PK, Klaus W. Webmon:A performance profiler for Web transactions. In:Proc. of the 4th IEEE Int'l Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS 2002). IEEE, 2002. 171-176.
    [50] Barham P, Donnelly A, Isaacs R, Richard M. Using magpie for request extraction and workload modelling. In:Proc. of the Symp. on Operating Systems Design and Implementation (OSDI). USENIX, 2004,4:18-18.
    [51] Yu X, Joshi P, Xu J, Guoliang J. Cloudseer:Workflow monitoring of cloud infrastructures via interleaved logs. ACM SIGPLAN Notices, 2016,51(4):489-502.
    [52] Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P. SALSA:Analyzing logs as state machines. In:Proc. of the USENIX Workshop on the Analysis of System Logs (WASL). USENIX, 2008. 6.
    [53] Wang T, Perng C, Tao T, et al. A temporal data-mining approach for discovering end-to-end transaction flows. In:Proc. of the IEEE Int'l Conf. on Web Services. IEEE, 2008. 37-44.
    [54] Mi HB, Wang HM, Cai H, et al. P-Tracer:Path-based performance profiling in cloud computing systems. In:Proc. of the 36th IEEE Annual Computer Software and Applications Conf. IEEE, 2012. 509-514.
    [55] Ostrowski K, Mann G, Sandler M. Diagnosing latency in multi-tier black-box services. In:Proc. of the 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS). ACM, 2011.
    [56] Kc K, Gu XH. ELT:Efficient log-based troubleshooting system for cloud computing infrastructures. In:Proc. of the 30th IEEE Int'l Symp. on Reliable Distributed Systems. IEEE, 2011. 11-20.
    [57] Tak BC, Tang C, Zhang C, Sriram G, Bhuvan U, Rong NC. vPath:Precise discovery of request processing paths from black-box observations of thread and network activities. In:Proc. of the USENIX Annual Technical Conf (ATC). USENIX, 2009.
    [58] Cai H, Douglas T. Distea:Efficient dynamic impact analysis for distributed systems. arXiv Preprint, arXiv:1604.0463, 2016.
    [59] Wu LJ, Li HW, Cheng YJ, et al. Application dependency tracing for message-oriented middleware. In:Proc. of the 16th Asia-Pacific Network Operations and Management Symp. IEEE, 2014. 1-6.
    [60] Chanda A, Cox AL, Zwaenepoel W. Whodunit:Transactional profiling for multi-tier applications. ACM SIGOPS Operating Systems Review, 2007,41(3):17-30.
    [61] Kobayashi S, Kensuke F, Hiroshi E. Mining causes of network events in log data with causal inference. In:Proc. of the IFIP/IEEE Symp. on Integrated Network and Service Management (IM). IEEE, 2017. 45-53.
    [62] Kanuparthy P, Dai Y, Pathak S, Sambit S, Theophilus B, Mojgan G, Narayan PPS. YTrace:End-to-end performance diagnosis in large cloud and content providers. arXiv Preprint, arXiv:1602.03273, 2016.
    [63] Mann G, Sandler M, Krushevskaja D, Sudipto G, Eyar E. Modeling the parallel execution of black-box services. In:Proc. of the USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). USENIX, 2011.
    [64] Guo Z, Zhou D, Lin HX, et al. G2:A graph processing system for diagnosing distributed systems. In:Proc. of the USENIX Annual Technical Conf (ATC). USENIX, 2011.
    [65] Gu J, Wang L, Yang Y, et al. KEREP:Experience in extracting knowledge on distributed system behavior through request execution path. In:Proc. of the IEEE Int'l Symp. on Software Reliability Engineering Workshops (ISSREW). IEEE, 2018. 30-35.
    [66] Israr T, Murray W, Greg F. Interaction tree algorithms to extract effective architecture and layered performance models from traces. Journal of Systems and Software, 2007,80(4):474-492.
    [67] Abdelwahab HL, Timothy L. Summarizing the content of large traces to facilitate the understanding of the behaviour of a software system. In:Proc. of the 14th IEEE Int'l Conf. on Program Comprehension (ICPC). IEEE, 2006. 181-190.
    [68] Moc J, David AC. Understanding distributed systems via execution trace data. In:Proc. of the 9th Int'l Workshop on Program Comprehension (IWPC). IEEE, 2001. 60-67.
    [69] Kuhlenkamp J, Markus K. Costradamus:A cost-tracing system for cloud-based software services. In:Proc. of the Int'l Conf. on Service-oriented Computing. Springer-Verlag, 2017. 657-672.
    [70] Fonseca R, Dutta P, Phillip L, Ion S. Quanto:Tracking energy in networked embedded systems. In:Proc. of the Symp. on Operating Systems Design and Implementation (OSDI). USENIX, 2008,8:323-338.
    [71] Enck W, Gilbert P, Han S, Vasant T, Byung-gon C, Landon PC, Jaeyeon J, Patrick M, Anmol NS. TaintDroid:An information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans. on Computer Systems (TOCS), 2014,32(2):5.
    [72] Sambasivan RR, Shafer I, Mazurek ML, Gregory RG. Visualizing request-flow comparison to aid performance diagnosis in distributed systems. IEEE Trans. on Visualization and Computer Graphics, 2013,19(12):2466-2475.
    [73] Chen MY, Accardi A, Kiciman E, Jim L, Dave P, Armondo F, Eric B. Path-based faliure and evolution management. In:Proc. of the 1st Conf. on Symp. on Networked Systems Design and Implementation. USENIX Association, 2004. 23.
    [74] Kavulya SP, Daniels S, Joshi K, Matti H, Rajeev G, Priya N. Draco:Statistical diagnosis of chronic problems in large distributed systems. In:Proc. of the IEEE/IFIP Int'l Conf. on Dependable Systems and Networks (DSN 2012). IEEE, 2012. 1-12.
    [75] Yuan D, Mai HH, Xiong WW, et al. SherLog:Error diagnosis by connecting clues from run-time logs. ACM SIGARCH Computer Architecture News, 2010, 143-154.
    [76] Wang C, Kavulya SP, Tan J, Liting H, Mahendra K, Mike K, Karsten S, Priya N, Rajeev G. Performance troubleshooting in data centers:An annotated bibliography. ACM SIGOPS Operating Systems Review, 2013,47(3):50-62.
    [77] Luo C, Lou JG, Lin Q, Qiang F, Rui D, Dongmei Z, Zhe W. Correlating events with time series for incident diagnosis. In:Proc. of the 20th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM, 2014. 1583-1592.
    [78] Chen P, Qi Y, Hou D. CauseInfer:Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Trans. on Services Computing, 2016,12(2):214-230.
    [79] Chen P, Qi Y, Zheng P, Di H. Causeinfer:Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In:Proc. of the IEEE INFOCOM & IEEE Conf. on Computer Communications. IEEE, 2014. 1887-1895.
    [80] Zhang L, Bild DR, Dick RP, Mao ZM, Peter D. Panappticon:Event-based tracing to measure mobile application and platform performance. In:Proc. of the Int'l Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS). IEEE, 2013. 1-10.
    [81] http://incubator.apache.org/projects/htrace.html
    [82] Alawneh L, Hamou-Lhadj A. Execution traces:A new domain that requires the creation of a standard metamodel. In:Proc. of the Int'l Conf. on Advanced Software Engineering and Its Applications. Springer-Verlag, 2009. 253-263.
    Related
    Cited by
Get Citation

杨勇,李影,吴中海.分布式追踪技术综述.软件学报,2020,31(7):2019-2039

Copy
Share
Article Metrics
  • Abstract:3272
  • PDF: 7051
  • HTML: 5780
  • Cited by: 0
History
  • Received:May 30,2019
  • Revised:September 04,2019
  • Online: April 21,2020
  • Published: July 06,2020
You are the first2044213Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063