• Article
  • | |
  • Metrics
  • |
  • Reference [16]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.

    Reference
    [1] TOP 500 Supercomputers home page. http://www.top500.org
    [2] Birman KP, Glade BB. Consistent failure reporting in reliable communication systems. Technical Report, TR93-1349, Ithaca: Cornell University, 1993.
    [3] Panzieri F, Shrivastava SK. Rajdoot: A remote procedure call mechanism supporting orphan detection and killing. IEEE Trans. on Software Engineering, 1988,14(1):30?37. [doi: 10.1109/32.4620]
    [4] Muller G, Volanschi EN, Marlet R. Scaling up partial evaluation for optimizing the Sun commercial RPC protocol. ACM SIGPLAN Notices, 1997,32(12):116?126. [doi: 10.1145/258994.259010]
    [5] Bouteiller A, Desprez F. Fault tolerance management for a hierarchical GridRPC middleware. In: Proc. of the 8th IEEE Int’l Symp. on Cluster Computing and Grid (CCGRID 2008). Lyon: IEEE Press, 2008. 484?491. http://icl.cs.utk.edu/news_pub/ submissions/ bouteiller-FTgridRPC.pdf
    [6] Welch BB. The sprite remote procedure call system. Technical Report, CSD-87-302, Berkeley: University of California at Berkeley, 1986.
    [7] Tay BH, Ananda AL. A survey of remote procedure calls. ACM SIGOPS Operating Systems Review, 1990,24(3):68?79.
    [8] Frances C, Kao IL, Lin CL. Adaptive timeout value setting for distributed computing environment (DCE) applications. United States Patent 6526433, 2003-02-25. http://www.freepatentsonline.com/6526433.html
    [9] Khandker AM, Honeyman P, Teorey TJ. Performance of DCE RPC. In: Proc. of the 2nd Int’l Workshop on Services in Distributed and Networked Environments. Whistler: IEEE Computer Society, 1995.
    [10] Delaney WP, Copas KW, Jantz RM, Lewis CW. Polling-Based mechanism for improved RPC timeout handling. United States Pattent 7146427, 2002-04-23. http://www.freepatentsonline.com/7146427.html
    [11] Birrell AD, Nelson BJ. Implementing remote procedure calls. ACM Trans. on Computer Systems, 1984,2(1):39?59. [doi: 10.1145/2080.357392]
    [12] Dineen TH, Leach PJ, Mishkin NW, Pato JN, Wyant GL. The network computing architecture and system: An environment for developing distributed applications. In: Proc. of the 33rd Int’l Conf. on IEEE Computer Society. 1988. 296?299.
    [13] Schwan P. Lustre: Building a file system for 1 000-node clusters. In: Proc. of the Linux Symp. 2003. 380?386. http://www.kernel. org/doc/ols/2003/ols2003-pages-380-386.pdf
    [14] Fahey M, Larkin J, Adams J. I/O performance on a massively parallel Cray XT3/XT4. In: Proc. of the IEEE Int’l Symp. on Parallel and Distributed Processing (IPDPS 2008). Miami: IEEE Computer Society, 2008. 1?12.
    [15] Nieuwejaar N, Kotz D, Purakayastha A, Ellis CS, Best ML. File-Access characteristics of parallel scientific workloads. IEEE Trans. on Parallel and Distributed Systems, 2005,7(10):1075?1089. [doi: 10.1109/71.539739]
    [16] Lustre simulator. 2009. https://bugzilla.lustre.org/show_bug.cgi?id=13634
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

钱迎进,肖侬,金士尧.大规模集群中一种自适应可扩展的RPC超时机制.软件学报,2010,21(12):3199-3210

Copy
Share
Article Metrics
  • Abstract:4721
  • PDF: 6958
  • HTML: 0
  • Cited by: 0
History
  • Received:April 28,2009
  • Revised:August 12,2009
You are the first2032479Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063