Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters

微信服务号

微信订阅号

2025-4-9- 11

Home > Archive>Volume 21, Issue 12, 2010 >3199-3210

Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters
DOI:
                        
                    
Author:
                        QIAN Ying-JinQIAN Ying-Jin

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
XIAO NongXIAO Nong

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
JIN Shi-YaoJIN Shi-Yao

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [16]

Cited by

Materials

Comments

Abstract:

Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.

Key words:RPC (remote produce call); failure detection; timeout; large scale; scalability; responsibility;reliability

Reference

[1] TOP 500 Supercomputers home page. http://www.top500.org

[2] Birman KP, Glade BB. Consistent failure reporting in reliable communication systems. Technical Report, TR93-1349, Ithaca: Cornell University, 1993.

[3] Panzieri F, Shrivastava SK. Rajdoot: A remote procedure call mechanism supporting orphan detection and killing. IEEE Trans. on Software Engineering, 1988,14(1):30?37. [doi: 10.1109/32.4620]

[4] Muller G, Volanschi EN, Marlet R. Scaling up partial evaluation for optimizing the Sun commercial RPC protocol. ACM SIGPLAN Notices, 1997,32(12):116?126. [doi: 10.1145/258994.259010]

[5] Bouteiller A, Desprez F. Fault tolerance management for a hierarchical GridRPC middleware. In: Proc. of the 8th IEEE Int’l Symp. on Cluster Computing and Grid (CCGRID 2008). Lyon: IEEE Press, 2008. 484?491. http://icl.cs.utk.edu/news_pub/ submissions/ bouteiller-FTgridRPC.pdf

[6] Welch BB. The sprite remote procedure call system. Technical Report, CSD-87-302, Berkeley: University of California at Berkeley, 1986.

[7] Tay BH, Ananda AL. A survey of remote procedure calls. ACM SIGOPS Operating Systems Review, 1990,24(3):68?79.

[8] Frances C, Kao IL, Lin CL. Adaptive timeout value setting for distributed computing environment (DCE) applications. United States Patent 6526433, 2003-02-25. http://www.freepatentsonline.com/6526433.html

[9] Khandker AM, Honeyman P, Teorey TJ. Performance of DCE RPC. In: Proc. of the 2nd Int’l Workshop on Services in Distributed and Networked Environments. Whistler: IEEE Computer Society, 1995.

[10] Delaney WP, Copas KW, Jantz RM, Lewis CW. Polling-Based mechanism for improved RPC timeout handling. United States Pattent 7146427, 2002-04-23. http://www.freepatentsonline.com/7146427.html

[11] Birrell AD, Nelson BJ. Implementing remote procedure calls. ACM Trans. on Computer Systems, 1984,2(1):39?59. [doi: 10.1145/2080.357392]

[12] Dineen TH, Leach PJ, Mishkin NW, Pato JN, Wyant GL. The network computing architecture and system: An environment for developing distributed applications. In: Proc. of the 33rd Int’l Conf. on IEEE Computer Society. 1988. 296?299.

[13] Schwan P. Lustre: Building a file system for 1 000-node clusters. In: Proc. of the Linux Symp. 2003. 380?386. http://www.kernel. org/doc/ols/2003/ols2003-pages-380-386.pdf

[14] Fahey M, Larkin J, Adams J. I/O performance on a massively parallel Cray XT3/XT4. In: Proc. of the IEEE Int’l Symp. on Parallel and Distributed Processing (IPDPS 2008). Miami: IEEE Computer Society, 2008. 1?12.

[15] Nieuwejaar N, Kotz D, Purakayastha A, Ellis CS, Best ML. File-Access characteristics of parallel scientific workloads. IEEE Trans. on Parallel and Distributed Systems, 2005,7(10):1075?1089. [doi: 10.1109/71.539739]

[16] Lustre simulator. 2009. https://bugzilla.lustre.org/show_bug.cgi?id=13634

Get Citation

钱迎进,肖侬,金士尧.大规模集群中一种自适应可扩展的RPC超时机制.软件学报,2010,21(12):3199-3210

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 28,2009
Revised:August 12,2009
Adopted:
Online:
Published:

You are the first2034066Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History