P2PSIP Working Group J. Maenpaa Internet-Draft Ericsson Intended status: Standards Track A. Swaminathan Expires: January 7, 2010 S. Das Qualcomm, Inc. G. Camarillo J. Hautakorpi Ericsson July 6, 2009 A Topology Plug-in for REsource LOcation And Discovery draft-maenpaa-p2psip-topologyplugin-00 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on January 7, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Maenpaa, et al. Expires January 7, 2010 [Page 1] Internet-Draft Topology Plug-in for RELOAD July 2009 Abstract REsource LOcation And Discovery (RELOAD) is a peer-to-peer signaling protocol that can be used to maintain an overlay network, and to store data in and retrieve data from the overlay. This document defines a new topology plug-in for RELOAD that is more appropriate for real world large scale overlays. This topology plug-in implements three important functionalities that allow RELOAD to operate under real world constraints. First, it includes a load balancing algorithm that specifies efficient allocation of load to different nodes in the network. Second, the document describes robust techniques for stabilization of fingers and successors and specifies self tuning mechanisms that allow dynamic and automatic adjustment of parameters needed for these advanced techniques in the topology plug-in. Finally, it specifies a locality aware finger selection algorithm that reduces average lookup latency. Maenpaa, et al. Expires January 7, 2010 [Page 2] Internet-Draft Topology Plug-in for RELOAD July 2009 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Need for Advanced Topology Plug-in . . . . . . . . . . . . . . 8 3.1. Need for Load Balancing . . . . . . . . . . . . . . . . . 8 3.2. Need for Robust Stabilization . . . . . . . . . . . . . . 9 3.3. Need for Locality Awareness . . . . . . . . . . . . . . . 9 3.4. Need for Self-Tuning of System Parameters . . . . . . . . 10 4. Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5. Load Balancing in the Proposed Topology Plug-in for RELOAD . . 11 5.1. Basic Load Balancing scheme . . . . . . . . . . . . . . . 11 5.2. Recommendations on Virtual Servers . . . . . . . . . . . . 12 5.3. Routing . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.4. Obtaining Virtual Server Identities . . . . . . . . . . . 14 5.4.1. Without Enrollment Server . . . . . . . . . . . . . . 14 5.4.2. With Enrollment Server . . . . . . . . . . . . . . . . 15 5.5. Extensions to Overlays with Heterogeneous Nodes . . . . . 16 6. Stabilizing Fingers, Successors, and Predecessors in the Topology Plug-in . . . . . . . . . . . . . . . . . . . . . . . 16 6.1. Choice of Approach to Stabilization . . . . . . . . . . . 16 6.2. Update Messages for Stabilization . . . . . . . . . . . . 18 6.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 21 6.3.1. Locality-aware Finger Selection . . . . . . . . . . . 22 6.4. Successor Stabilization . . . . . . . . . . . . . . . . . 23 6.5. Predecessor Stabilization . . . . . . . . . . . . . . . . 24 6.6. Joining the Overlay . . . . . . . . . . . . . . . . . . . 24 6.6.1. Contents of the Join Message . . . . . . . . . . . . . 25 6.7. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 26 6.7.1. Contents of the Leave Message . . . . . . . . . . . . 26 6.8. Self Tuning System Parameters . . . . . . . . . . . . . . 27 6.8.1. Self Tuning Load Balancing Algorithm Parameters . . . 28 6.8.2. Self Tuning the Stabilization Interval . . . . . . . . 32 7. Security Considerations . . . . . . . . . . . . . . . . . . . 37 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 37 10. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.1. Comparison of the Load Balancing Algorithm with Chord . . 38 10.2. Performance of the Load Balancing Algorithm as Network Grows . . . . . . . . . . . . . . . . . . . . . . . . . . 38 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40 11.1. Normative References . . . . . . . . . . . . . . . . . . . 40 11.2. Informative References . . . . . . . . . . . . . . . . . . 40 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42 Maenpaa, et al. Expires January 7, 2010 [Page 3] Internet-Draft Topology Plug-in for RELOAD July 2009 1. Introduction REsource LOcation And Discovery (RELOAD) [1] is a peer-to-peer signaling protocol that can be used to maintain an overlay network, and to store data in and retrieve data from the overlay. For interoperability reasons, RELOAD specifies one overlay algorithm that is mandatory to implement. Additionally, RELOAD supports a variety of other overlay algorithms through the use of topology plug-ins. A topology plug-in implements the topology defined by a specific overlay algorithm. This document defines a new topology plug-in for RELOAD that is more appropriate for real world large scale overlays that have to deal with object storage in a fair manner, provide good lookup performance to a variety of applications, and self-organize to deal with churn. This topology plug-in implements three important functionalities that allow RELOAD to operate under these real world constraints. First, it includes a load balancing algorithm that specifies efficient allocation of load to different nodes in the network. Second, the document describes robust techniques for stabilization of fingers and successors and specifies self tuning mechanisms that allow dynamic and automatic adjustment of parameters needed for these advanced techniques in the topology plug-in; this avoids the need for constants that may only work in specific scenarios. Finally, it specifies a locality aware finger selection algorithm that reduces average lookup latency. Load balancing is essential to effectively manage data and provide services on overlays. Load balancing, as an integral part of the overlay, encourages participation by imposing collective fate sharing on the nodes. Without such a scheme, overlay adoption may be significantly affected. However, the mandatory-to-implement RELOAD DHT protocol based on Chord does not support operating the overlay in a load balanced manner. For instance, for a system with N nodes, it can been shown that the imbalance factor, defined as the maximum load on any node on the network divided by the average load, is of the order of O(log2(N)) in the number of items even if all objects are assumed to be homogeneous. In the case of heterogeneous networks, where the capabilities of nodes (storage and bandwidth) can differ by multiple orders of magnitude, the problem of load balancing becomes more important because the imbalance in load distribution could potentially create a bottleneck in the system. Thus to enable scalable, real world deployments of RELOAD, the topology plug-in in this document specifies a scheme for load balancing. Peer-to-peer overlay networks face a fundamental challenge with churn: joining and leaving of nodes. This can affect the integrity of the routing structure, cause a node to become disconnected from Maenpaa, et al. Expires January 7, 2010 [Page 4] Internet-Draft Topology Plug-in for RELOAD July 2009 the overlay, and cause overhead in maintaining consistency. Several research studies have been performed on the type of stabilization that can be used in DHT based peer-to-peer overlays such as RELOAD. This document specifies a method for stabilization to deal with churn to allow RELOAD to be scalable and reliable in real world conditions. We specifically prescribe the use of a periodic stabilization routine to counter the undesirable effects of churn on routing. To enable load balancing and stabilization to deal with churn some parameters need to be set (described later). The use of specific constants for such parameters leads to deployment that may only work for specific scenarios (where the parameters are evaluated). Thus, this document specifies self tuning mechanisms for system parameters to allow RELOAD to scale naturally as network dynamics dictate. For example, as churn increases, the topology plug-in specified in this document adapts by increasing the frequency of stabilization. Similarly, as network size increases, load balancing parameters and DHT parameters are modified to ensure quick finger and successor updates and to keep the load balancing property over time. To enable self tuning of system parameters, some characteristics such as churn rate and network size are estimated. These characteristics are then used to dynamically adjust the topology plug-in parameters such as the size of the successor set, size of the routing table, and rate of maintenance messages. The benefit of this approach is that it avoids the problem with static parameters. Using static parameters, it is not possible to achieve a low failure rate and a low communication overhead. The topology plug-in specified in this document allows the system to take into account the evolution of network conditions and adapt to them. Even with up-to-date fingers and successors, making progress in the identifier space can be expensive in terms of network latency because small progress in identifier space could result in a significant leap in physical distance. The successor and predecessor lists can be used to optimize network latency by relaxing the requirement for finger selection. Specifically, at each overlay hop, as progress is made in the identifier space, small physical distance hops are used so as to avoid high latencies in overall lookup. More details on the proposed locality aware finger selection are described later in this document. In summary, the topology plug-in proposed in this document has the following advantages: o First, building on the RELOAD framework, this document introduces a load balancing algorithm that provides an imbalance factor of the order of O(1). The solution proposed in the document can also be extended to the case of heterogeneous nodes in such a way that Maenpaa, et al. Expires January 7, 2010 [Page 5] Internet-Draft Topology Plug-in for RELOAD July 2009 the number of objects assigned to a node is proportional to the amount of load it can handle; o Second, this topology plug-in specifies stabilization techniques that deal effectively with churn; o Third, this document proposes self-tuning of parameters required in these algorithms such that users no longer need to tune every DHT parameter correctly for a given operating environment. By periodically computing the network parameters, the system automatically adapts to changing operating conditions; and o Finally, the locality aware finger selection algorithm incorporated with the DHT algorithm further optimizes the selection of successors and predecessors to reduce network latency. This document is organized as follows. We first describe the algorithms for load balancing, stabilization, and locality aware finger selection. Then we describe the major components required to operate the topology plug-in: joining, leaving, routing, dealing with failures etc. Finally, we describe self tuning of the system parameters to make the topology plug-in adjust itself to changing network conditions. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. This document uses the terminology and definitions from the Concepts and Terminology for Peer-to-Peer SIP [1] draft. Chord ring: The Chord DHT orders identifiers on an identifier circle of size 2^m (m is the number of bits in peer identifiers). This identifier circle is called the Chord ring. DHT: Distributed Hash Tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table. Given a key, any participating peer can retrieve the value associated with that key. The responsibility for maintaining the mapping from keys to values is distributed among the peers. Maenpaa, et al. Expires January 7, 2010 [Page 6] Internet-Draft Topology Plug-in for RELOAD July 2009 Finger table: A data structure with up to m entries maintained by each peer in a Chord-based overlay. The ith entry in the finger table of peer n contains the identity of the first primary virtual server that succeeds n by at least 2^(m-i) on the Chord ring. This peer is called the ith finger of peer n. As an example, the first entry in the finger table of peer n contains a peer half-way around the Chord ring from peer n. The purpose of the finger table is to accelerate lookups. log2(N): Logarithm of N with base 2. N: The number of nodes in the overlay unless otherwise specified. Neighborhood set: Consists of successor and predecessor lists. O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means that f(n) is less than some constant multiple of g(n). Omega(g(n)): Informally, saying that some equation f(n) = Omega(g(n)) means that f(n) is more than some constant multiple of g(n). Predecessor list: A data structure containing the predecessors of a peer. Successor list: A data structure containing the successors of a peer. Virtual server: Each physical peer instantiates one or more virtual servers with random IDs that act as peers in the DHT. Primary virtual server: The primary ID assigned to the peer when it joins the network. This location is chosen uniformly at random among the available IDs in [0, 2^m-1]. Secondary virtual server: List of all identities owned by the physical peer excluding the primary virtual server. The location of the secondary virtual servers are selected so that the ith virtual server is distributed between [vp-i*delta*2^m, vp- (i+1)*delta*2^m] where vp is the primary virtual server. These locations may change to adapt to network dynamics. alpha: number of virtual servers per physical ID. This includes one primary virtual server and (alpha-1) secondary virtual servers. Maenpaa, et al. Expires January 7, 2010 [Page 7] Internet-Draft Topology Plug-in for RELOAD July 2009 delta: The value of delta defines the spacing between the virtual servers. The location of the secondary virtual servers are selected so that the ith virtual server is distributed between [vp-i*delta*2^m, vp-(i+1)*delta*2^m] where vp is the location of the primary virtual server. Routing table: The set of peers that a node can use to route overlay messages. The routing table consists of the finger table, successor list and predecessor list, all populated with primary virtual server IDs. 3. Need for Advanced Topology Plug-in This section details the need for specifying load balancing mechanisms, robust stabilization and locality awareness for a RELOAD topology plug-in to make it efficient and scalable. 3.1. Need for Load Balancing Most DHTs, such as Chord, Pastry, CAN, Tapestry, and the RELOAD base topology plug-in employ uniform hashing to generate object IDs so that the object IDs are uniformly distributed over the ID space (example 0 to D = 2^128 - 1). Objects are then assigned to the nodes based on their IDs and the exact way that this is done differs in different DHTs. Let us assume that there are N nodes in the network. Suppose the node IDs are generated uniformly at random, then the probability that the ith largest nodeID is equal to k is given by: P(nodeID(i) == k) = (N-1)C(i-1) * (k/D)^(i-1) * (N-i)C(1) * (1/D) * (1 - (k+1)/D)^(N-i+1) For large N and D, this distribution can be approximated as follows: P(nodeID(i) == k) = e^(-k*N/D) * (k*N/D)^1 * (1/i!) Therefore, ith largest node IDs follows a Poisson distribution. The inter-node distance then follows an exponential distribution with parameter N/D. As consequence of this property, the number of data items assigned to a node is proportional to the inter-node distance and, thus, follows an exponential distribution (c denotes an arbitrary constant) Pr(#items in node == x) = c * (N/D) * e^(c*(N/D)*x) The goodness of Chord for load balancing can be measured in terms of the imbalance metric defined as: Maenpaa, et al. Expires January 7, 2010 [Page 8] Internet-Draft Topology Plug-in for RELOAD July 2009 Maximum # of data items imbalance factor = ------------------------------- Average # of data items Here, the maximum value is calculated as the one which is largest with probability: O(1 - (1/N^l)) where l is any constant. The imbalance factor measures the storage load of a server; the longer the interval is, the more data has to be stored in the server. For the case of Chord, this imbalance factor is of the order of O(log2(N)) and there are nodes in the overlay that manage log2(N) times the average node's load. Further, this result suggests that the imbalance factor for Chord increases as the number of nodes in the network and as the number of nodes increase the maximum number of data items handled per node increases of the order of O(T*(log2(N)/N)) where T denotes the total number of objects on the overlay. Ideally, we would want the maximum value to be close to the average value of (T/N) giving an imbalance factor close to 1 for a good load balancing algorithm. Therefore, RELOAD base DHT is not very efficient for load balancing and there is a need for load balancing mechanisms in a topology plug-in that is widely usable. 3.2. Need for Robust Stabilization To ensure correct lookups in the presence of churn and to ensure optimal routing of queries, a node needs to ensure that its fingers, successor list, and predecessor list are up to date. However, stabilization of these features comes at a cost: peers in the overlay network need to consume network bandwidth to maintain routing state. DHTs use stabilization routines to counter the undesirable effects of churn on routing. The purpose of stabilization is to keep the routing information of each peer in the overlay consistent with the constantly changing overlay topology. The current RELOAD base topology plug-in proposes reactive stabilization. Research studies have shown that this approach may not be the most robust to a wide variety of network conditions. In this document, we revisit the choice of stabilization and recommend techniques likely to work efficiently across deployment types. 3.3. Need for Locality Awareness A major performance issue with structured peer-to-peer based topology plug-ins is the need for multi-overlay-hop routing to lookup a piece of data such as the SIP Address of Record (AoR )to IP mapping. Since the identifier space is completely random, a lookup for a data item stored 10ms RTT away can potentially take multiple intercontinental Maenpaa, et al. Expires January 7, 2010 [Page 9] Internet-Draft Topology Plug-in for RELOAD July 2009 hops before getting answered significantly affecting lookup latency. Additionally, the fact that routers in such peer-to-peer networks are expected to be constrained non-infrastructure nodes adds on additional delays. Locality aware routing aims to mitigate the routing stretch of the topology plug-in so that a lookup is answered by making progress in the identifier space while minimizing the amount of distance traveled in physical space at each hop. 3.4. Need for Self-Tuning of System Parameters Two main advantages of a self-tuning DHT are that users no longer need to tune every DHT parameter correctly for a given operating environment and that the system adapts to changing operating conditions. 4. Chord This document proposes a new topology plugin for RELOAD that is based on the Chord DHT algorithm. Topology plugins allow RELOAD to support a variety of overlay algorithms. The proposed topology plugin uses a load balanced self-tuning version of Chord. It can be used as an alternative to the default DHT specified by RELOAD. Chord [4] is a structured peer-to-peer algorithm that uses consistent hashing to build a DHT out of several independent peers. Consistent hashing assigns each peer and resource an m-bit identifier. Peers MUST use SHA-1 as the base hash fuction to generate the identifiers. The length of the identifiers MUST be m=128 bits. The identifiers are ordered on an identifier circle of size 2^m. On the identifier circle, key k MUST be assigned to the first peer whose identifier equals or follows the identifier of k in the identifier space. The identifier circle is called the Chord ring. Different DHTs differ significantly in performance when bandwidth is limited. It has been shown that when compared to other DHTs, the advantages of Chord include that it uses bandwidth efficiently and can achieve low lookup latencies at little cost [5]. A simple lookup mechanism could be implemented on a Chord ring by requiring each peer to only know how to contact its current successor on the identifier circle. Queries for a given identifier could then be passed around the circle via the successor pointers until they encounter the first peer whose identifier is equal to or larger than the desired identifier. Such a lookup scheme uses a number of messages that grows linearly with the number of peers. To reduce the cost of lookups, Chord also maintains additional routing information; each peer n MUST maintain a data structure with up to m entries, Maenpaa, et al. Expires January 7, 2010 [Page 10] Internet-Draft Topology Plug-in for RELOAD July 2009 called the finger table. The first entry in the finger table of peer n contains the peer half-way around the ring from peer n. The second entry contains the peer that is 1/4th of the way around, the third entry the peer that is 1/8th of the way around, etc. In other words, the ith entry in the finger table at peer n SHOULD contain the identity of the first peer s that succeeds n by at least 2^(m-i) on the Chord ring. This peer is called the ith finger of peer n. The interval between two consecutive fingers is called a finger interval. The ith finger interval of peer n covers the range [n.id + 2^(m-i), n.id + 2^(m-i+1)) on the Chord ring. In an N-peer network, each peer maintains information about O(log2(N)) other peers in its finger table. As an example, if N=1000, it is sufficient to maintain 10 fingers. To increase robustness in the event of peer failures, each Chord peer MUST maintain a successor list containing the peer's immediate successors on the Chord ring. The successor list will be further described in Section 5.3. Peers MUST also maintain a predecessor list containing the peer's immediate predecessors on the Chord ring. The recommeded value for the size of the predecessor list is 3, as is also specified in [1]. 5. Load Balancing in the Proposed Topology Plug-in for RELOAD 5.1. Basic Load Balancing scheme The basic Chord DHT suffers from drawbacks. The uniform hashing used in Chord to generate object IDs result in an imbalance factor close to O(log2(N)); this implies that there are peers in the Chord ring that would have to manage O(log2(N)) times the load that is to be handled by an average peer. In this section, we build upon the basic Chord DHT and present the virtual servers algorithm that results in an imbalance factor close to 1. Several research papers have proposed schemes to provide load balanced DHTs. Some of the schemes have general techniques while others are targeted towards optimizing a specific DHT. This solution proposed in this document takes into account results and ideas from previous ideas for load balancing in [6], [7], [8], [4], [9], [10], [11], [12], [13], [14], [15], [16]. The basic scheme for load balancing proposed in this document is an extension of the virtual servers approach [4] over Chord with the benefit of reducing the routing state. At a high level, virtual servers approach works in improving load balancing as follows. The work in [4] suggested that load balancing can be performed efficiently if each peer simulates a logarithmic number of "virtual Maenpaa, et al. Expires January 7, 2010 [Page 11] Internet-Draft Topology Plug-in for RELOAD July 2009 servers". The method works by associating keys with virtual servers, and mapping multiple virtual servers (with unrelated identifiers) to each real node. The authors show that this will provide a more uniform coverage of the identifier space. For example, if we allocate O(log2(N)) randomly chosen virtual servers to each real node, with high probability each of the N bins will contain O(log2(N)) nodes. The downside of this approach is the additional routing state and its maintenance cost. This document proposes load balancing based on a proposal in [14], and is an improvement over the basic virtual servers approach introduced in [4]. In our approach, each physical node instantiates one or more virtual servers that act as peers in the DHT. The locations of these virtual servers are chosen close to each other in the nodeID space allowing the node to share a single set of overlay links among the virtual servers. The main system parameters of the solution in this document are: how many virtual servers should be chosen exactly and what should the spacing between the node identifiers of those virtual servers. The next sections describe this document's recommendations on virtual nodes and routing for a load balanced DHT. Since the choices of the system parameters depend on network dynamics, this document further discusses dynamically adjusting the topology plug-in with network dynamics and recommends a periodic stabilization model to keep the system parameters up-to-date. 5.2. Recommendations on Virtual Servers Selection of virtual servers require two decisions to be made: (1) how many virtual servers (denoted by alpha) should we choose and (2) what should be the namespace spacing (represented by delta*2^m) in between these virtual servers. This section provides the present document's recommendation on choosing alpha and delta. Each physical peer maintains two sets of virtual server identities, namely, primary virtual server identities and secondary virtual server identities. The location of the primary virtual server (denoted as vp) is chosen using existing techniques in RELOAD. These locations remain fixed throughout the entire lifetime of the node. In addition to the primary virtual server, each node maintains secondary virtual servers. The locations of these secondary virtual servers are chosen such that the ith virtual server, vs_i, is uniformly distributed in [vp-i*delta*2^m, vp-(i+1)*delta*2^m]. The value of delta * 2^m defines the spacing between the virtual Maenpaa, et al. Expires January 7, 2010 [Page 12] Internet-Draft Topology Plug-in for RELOAD July 2009 nodes and this document requires that nodes MUST choose the value of delta as 1/N where N is the number of nodes in the network. This value was found to perform well in simulations and is consistent with the value reported in [14]. Each node MUST maintain a total of alpha = 2*log2(N) virtual server identities for load balancing where N is the number of nodes in the overlay. This value of alpha includes one primary virtual server and (alpha-1) secondary virtual server identities. The choice of 2*log2(N) for alpha has been found to work well in simulations [14]. While both these parameters, alpha and delta, depend on N (the size of the network), Section 6.8.1 describes an algorithm used to determine alpha and delta without performing explicit network size estimation. Note that explicit network size estimation only needs to be performed when self-tuning the DHT parameters, since in that case a more accurate estimate of the network size is needed. 5.3. Routing Assume an identifier space [0, 2^m-1]. As in the basic Chord approach, the recommended topology plug-in maintains two types of routing information such as finger tables and successor tables. The finger tables are constructed based on the location of the primary virtual server, i.e., the ith finger at peer v contains the identity of the first peer s that succeeds its primary virtual server, vp, by at least 2^(m-i) on the Chord ring. The neighbor table of peer v contains the entries of all nodes which have ownership of any part of the log2(N)/N space that node v owns. As in the case of Chord, each virtual server belonging to the node obtains its successor list of its immediate successor. If this total length is greater than log2(N), the clockwise-farthest entry is dropped to restrict the number of neighbor table entries to log2(N). Further, it can be shown that the maximum number of node IDs that fall within this space is O(log2(N)) resulting in those many neighbor table entries. The added benefit of having log2(N) successors is that if each peer fails independently with probability p, the probability that all log2(N) successors fail simultaneously is only p^log2(N); therefore, the system is much more resilient to failures. To obtain the entries in the peer's neighbor table, the peer starts with the alpha-th secondary virtual ID, denoted as vs_alpha. The peer then proceeds clockwise from vs_alpha on the Chord ring and identifies the virtual IDs of peers that fall in this region. When it encounters a virtual server vs', it stores the location of both the virtual server, vs', and the location of the corresponding primary virtual server, vp'. On the other hand, when it sees its own Maenpaa, et al. Expires January 7, 2010 [Page 13] Internet-Draft Topology Plug-in for RELOAD July 2009 virtual server, it is dropped and no information is stored. This process is followed until log2(N) distinct primary virtual servers are obtained to populate the neighbor table. It is to be noted that while the size of the successor list may be greater than log2(N) because it includes both the primary and the secondary virtual servers, the number of successor connections is limited to log2(N). To locate any object in the overlay, the nodes MUST first employ the finger tables to reach the virtual server ID that is closest to the query object ID. At the end of this step, the chosen physical node may or may not have the object. If the physical node does not have the object, then it forwards the object request to its neighbors using the neighbor table entries. Therefore the total number of message hops required for locating an object is of the order of O(log2(N)) + 1. Thus, this solution proposed in this document, based on [14], requires maintaining a lower number of connections than that of a pure virtual servers based approach for load balancing. This is because, while the basic Chord scheme with O(log2(N)) virtual servers per physical node requires maintaining O(log2(N)) sets of fingers (and O(log2(N)) routing tables) for each physical node for efficient routing, this document requires maintaining just one set of fingers (and routing table) per physical node. In Section 6, we discuss approaches to keep the fingers and successors up-to-date that is required to ensure correct look-ups. 5.4. Obtaining Virtual Server Identities This section describes the procedure for obtaining the primary and virtual server identities. 5.4.1. Without Enrollment Server When a new node, v, joins the system, the first primary node identity vp MUST be computed by applying the digest specified in the self- signed-permitted element of the overlay configuration document to the DER representation of the node's public key. The subsequent (alpha - 1) virtual node IDs MUST be computed as follows: the ith virtual ID is chosen as random(vp - i*delta*2^m, vp - (i+1)*delta*2^m) This ensures that the chosen virtual servers are close to each other and a maximum of log2(N)/N apart. A node's public key relates to the vp and can be verified by other nodes. However, all other virtual server IDs cannot be related to this public key so they need to be verified based on vp. There are Maenpaa, et al. Expires January 7, 2010 [Page 14] Internet-Draft Topology Plug-in for RELOAD July 2009 two options to do this verification: o a. Use a constant fixed offset for the ith virtual server as (vp - i*delta*2^m) exactly. This removes the random component of the virtual server ID selection. Since all virtual server IDs are at fixed intervals from vp, a node can weakly verify the node id. However node ID collisions may make this hard. o b. Use the current technique described in Section 5.2 with random selection in an interval. In this case a node can verify that the virtual server ID is in a small range near vp. OPEN ISSUE: At the moment we recommend option (b), but this issue needs further analysis. 5.4.2. With Enrollment Server When an enrollment server is present, the added security benefits of the enrollment server certified node identities are for virtual server identities. The enrollment server is provided with the values of the system parameters alpha and delta, and based on them, the enrollment server gives out a set of virtual server identities to a node. Similar to what a node itself does in a deployment without an enrollment server; the enrollment server MUST use a uniform hash function to randomly choose the primary virtual server ID from the space (0, 2^128 - 1); call this virtual server ID vp. The remaining (alpha - 1) node IDs MUST be then chosen by the enrollment server around vp in such a way that the ith virtual serverID of v is chosen to be uniformly distributed in (vp - i/N, vp - (i+1)/N). These node IDs are then passed on to the node. When a node initially joins, it does not have an estimate of alpha and delta since that is based on the number of fingers in the routing table. Thus, the enrollment server MUST use existing values of alpha and delta in requests it receives from nodes in the current overlay or based on its own diagnostic messages sent into the overlay to determine the number of virtual servers and the spacing in between the virtual servers and consequently the node IDs handed out to the joining node. If the overlay has recently formed the enrollment server MAY bootstrap the values of alpha and delta as 20 and 1/1000 respectively. The enrollment server may also choose to use past overlay size estimates it may possess to bootstrap alpha and delta as 2log2(Nestimated) and 1/Nestimated. For example, an enrollment server at a venue based overlay which is torn down can use past size estimates. Note that the enrollment procedure in RELOAD already defines providing multiple node identities to the enrolling node. The change Maenpaa, et al. Expires January 7, 2010 [Page 15] Internet-Draft Topology Plug-in for RELOAD July 2009 needed is to include the values of alpha and delta in the simple enrollment request. 5.5. Extensions to Overlays with Heterogeneous Nodes This load balancing solution can be extended for overlays with nodes with heterogeneous node capacities. Let the capacity of the node-v in the overlay be represented by Cv. If an overlay has nodes with heterogeneous nodes, nodes in the overlay MUST use alpha = 2*Cv*log2(N) virtual servers per physical node. The location of these (2 Cv log2(N)) virtual nodes MUST be chosen near the primary node location vp such that the location of the ith virtual serverID is uniformly distributed in (vp - i* delta * 2^m, vp - (i+1) * delta * 2^m). Each physical node then maintains a total of O(Cv log2(N)) neighbor entries and O(Cv log2(N)) routing fingers. The imbalance factor for this scenario, defined as, Maximum # of data items imbalance factor = --------------------------------- Cv * Average # of data items can be shown to be close to 1. Note that in addition to alpha and delta, the node capacity of the node MUST be passed to the enrollment server in the simple enrollment request. This capacity value is used by the enrollment server to calculate Cv as node_capacity(node n) Cv(node n) = ---------------------------------------- sum_{k=1}^N node_capacity(node k) In deployments with no enrollment servers, the node MUST estimate Cv(node n) by obtaining its neighbors' node capacities and building an estimate of average node capacity. 6. Stabilizing Fingers, Successors, and Predecessors in the Topology Plug-in 6.1. Choice of Approach to Stabilization There are two alternative approaches to stabilization: periodic and reactive. Periodic stabilization can either use a fixed stabilization rate or calculate the stabilization rate in an adaptive Maenpaa, et al. Expires January 7, 2010 [Page 16] Internet-Draft Topology Plug-in for RELOAD July 2009 fashion. In reactive stabilization, a peer reacts to the loss of a peer in its neighborhood set or to the appearance of a new peer that should be added to its neighborhood set by sending a copy of its neighbor table to all peers in the neighborhood set. Periodic recovery, in contrast, takes place independently of changes in the neighborhood set. In periodic recovery, a peer periodically shares its neighborhood set with each of the members of that set. The mandatory-to-implement Chord DHT algorithm in RELOAD [1] uses reactive stabilization for the neighborhood set, unlike the original Chord algorithm, which uses periodic stabilization. It has been shown in [17] that reactive stabilization works well for small neighborhood sets (i.e., small overlays) and moderate churn. However, in large-scale (e.g., 1000 peers or more [17]) or high-churn overlays, reactive stabilization runs the risk of creating a positive feedback cycle, which can eventually result in congestion collapse. In [17], it is shown that a 1000-peer overlay under churn uses significantly less bandwidth and has lower latencies when periodic stabilization is used than when reactive stabilization is used. Although in the experiments carried out in [17], reactive recovery performed well when there was no churn, its bandwidth use was observed to jump dramatically under churn. At higher churn rates and larger scale overlays, periodic stabilization uses less bandwidth and the resulting lower contention for the network leads to lower latencies. For this reason, most DHTs such as CAN, Chord, Pastry, Bamboo, etc. use periodic stabilization. As an example, the first version of Bamboo used reactive recovery, which caused Bamboo to suffer from degradation in performance under churn. To fix this problem, Bamboo was modified to use periodic stabilization. In Chord, periodic stabilization is typically done both for successors and fingers. An alternative strategy is analyzed in [18]. In this strategy, called the correction-on- change maintenance strategy, a peer periodically stabilizes its successors but does not do so for its fingers. Instead, finger pointers are stabilized in a reactive fashion. The results obtained in [18] imply that although the correction-on-change strategy works well when churn is low, periodic stabilization outperforms the correction-on-change strategy when churn is high. In this document, we propose to use periodic stabilization for fingers, successors, and predecessors based on these insights. Each peer MUST maintain a stabilization timer. When the stabilization timer fires, the peer MUST restart the timer and carry out the stabilization operations. The stabilization routine is described next and more details on computing the stabilization interval are Maenpaa, et al. Expires January 7, 2010 [Page 17] Internet-Draft Topology Plug-in for RELOAD July 2009 elaborated in Section 6.8.2. 6.2. Update Messages for Stabilization The stabilization procedures are implemented using Update requests and answers. To describe the contents of these messages, the syntax defined in [1] is used. A Chord Update request is defined as: enum { reserved (0), notify(1), succ_stab(2), pred_stab(3), full(4), virtualserver_stab_join(5), virtualserver_stab_leave(6), (255) } ChordUpdateType; struct { ChordUpdateType type; NodeId sender_id; select(type) { case notify: uint32 uptime; NodeId sender_virtual_ids <0..2^16-1>; case pred_stab: /* Empty */ ; case succ_stab: /* Empty */ ; case virtualserver_stab_join: NodeId sender_virtual_ids <0..2^16-1>; case virtualserver_stab_leave: NodeId sender_virtual_ids <0..2^16-1>; case full: uint32 uptime; uint32 alpha; uint32 delta; NodeId sender_virtual_ids <0..2^16-1>; NodeId predecessors <0..2^16-1>; NodeId successors <0..2^16-1>; NodeId fingers <0..2^16-1>; }; } UpdateReq; The "type" field MUST indicate the reason why the Update was sent: notify: the sender of the Update wishes to notify the recipient of the sender's existence. Upon receiving the Update, the recipient SHOULD insert the sender into its routing table, if appropriate. The 'notify' message MUST include the virtual server IDs of the sender in the 'sender_virtual_ids' field. Maenpaa, et al. Expires January 7, 2010 [Page 18] Internet-Draft Topology Plug-in for RELOAD July 2009 succ_stab: the Update request is related to the successor stabilization routine. pred_stab: the Update request is related to the predecessor stabilization routine. virtualserver_stab_join and virtualserver_stab_leave: the Update request is related to the DHT parameter stabilization routine. full: the Update request contains the entire routing and neighbor table of the sender. The sender_id field contains the sender's primary virtual ID. The sender_virtual_ids contains the list of all secondary virtual server IDs belonging to the sender. If the type of the Update request is 'pred_stab' or 'succ_stab', the request MUST NOT carry any additional information. If the type of the Update request is 'notify', the request MUST contain the sender's current uptime in seconds and the location of the sender's current virtual server IDs. If the type of the request is 'virtualserver_stab_join' or 'virtualserver_stab_leave', the contents of the message MUST include the list of primary and the secondary virtual server IDs of the sender. If the type of the request is 'full', the contents of the message MUST be: o uptime: The sender's current uptime in seconds; o alpha: The sender's current DHT parameter value - alpha; o delta: The sender's current DHT parameter value - delta; o sender_virtual_ids: The sender's list of current virtual server IDs; o predecessors: The sender's predecessor list; o successors: The sender's successor list; o fingers: The sender's finger table. In the introduced topology plug-in, each peer decides independently Maenpaa, et al. Expires January 7, 2010 [Page 19] Internet-Draft Topology Plug-in for RELOAD July 2009 the appropriate size for the successor list, predecessor list, finger table, and system parameters (alpha and delta). Thus, the 'predecessors', 'successors', and 'fingers' fields are of variable length. The number of virtual IDs assigned to a node along with the spacing between the virtual IDs are chosen based on the system parameters and may change as the network changes. In keeping with this change, the length of the 'sender_virtual_ids' field MUST also be of variable length and would include the list of its virtual server IDs assigned to the sender. As specified in RELOAD [1], variable length fields are on the wire preceded by length bytes. In the case of the successor list, predecessor list, sender_virtual_ids, and finger table, there are two length bytes (allowing lengths up to 2^16-1). The number of NodeId structures included in each field can be calculated based on the length bytes since the size of a single NodeId structure is 16 bytes. If a peer receives more entries than fit into its successor list, predecessor list or finger table, the peer SHOULD ignore the extra entries. If the peer is assigned more virtual IDs than fit into its ID list, it SHOULD reject the assignment. If a peer receives fewer entries than it currently has in its own data structure, the peer SHOULD NOT drop the extra entries from its data structure. If the Update request succeeds, the responding peer sends an UpdateAns message, which is defined as: enum { reserved (0), notify(1), succ_stab(2), pred_stab(3), full(4), virtualserver_stab_join(5), virtualserver_stab_leave(6), (255) } ChordUpdateType; struct { ChordUpdateType type; select(type) { case full: /* Empty */ ; case virtualserver_stab_join: /* Empty */ ; case virtualserver_stab_leave: /* Empty */ ; case notify: uint32 uptime; case pred_stab: NodeId predecessors <0..2^16-1>; case succ_stab: NodeId predecessors <0..2^16-1>; NodeId successors <0..2^16-1>; }; Maenpaa, et al. Expires January 7, 2010 [Page 20] Internet-Draft Topology Plug-in for RELOAD July 2009 } UpdateAns; If the type of the Update answer is 'full', 'virtualserver_stab_join', or 'virtualserver_stab_leave', the answer MUST NOT carry any additional information. If the type is 'notify', the answer MUST contain the sender's current uptime in seconds. If the type is 'pred_stab', the answer SHOULD carry the predecessor list of the responding peer. If the type is 'succ_stab', the answer SHOULD include the predecessor and successor lists of the responding peer. 6.3. Finger Stabilization The purpose of the finger stabilization procedure is to incorporate new peers into the finger table. In the procedure, peer v MUST maintain a counter 'next', which stores the index of the next finger that should be stabilized. The counter MUST be initialized to value one and it MUST be incremented by one after each finger stabilization procedure. When the stabilization timer fires, peer v MUST choose one finger interval i from the set of finger_table_size finger intervals it maintains: i = next % (finger_table_size + 1), and send a Probe request addressed to the first identifier belonging to the chosen finger interval i. The peer f responding to the Probe request SHOULD become the ith finger of v. Peer v SHOULD send an Attach request to peer f to initiate a new connection to it. This document defines a new ProbeInformationType value 'uptime'. When this value is present in the requested_info field of a Probe request, it indicates that the receiver MUST include in the Probe response its current uptime in a ProbeInformation structure. A Probe request that is sent as part of the finger stabilization procedure MUST contain the 'uptime' ProbeInformationType in its requested_info field. The extended ProbeInformation structure that is returned in the Probe response is defined as: Maenpaa, et al. Expires January 7, 2010 [Page 21] Internet-Draft Topology Plug-in for RELOAD July 2009 enum { responsible_set(1), num_resources(2), uptime(3), (255) } ProbeInformationType; struct { ProbeInformationType type; select (type) { case responsible_set: uint32 responsible_ppb; case num_resources: uint32 num_resources; case uptime: uint32 uptime; }; } ProbeInformation; The types "responsible_ppb" and "num_resources" have been specified in RELOAD [1]. The "uptime" is a new type and contains the sender's current uptime in seconds. 6.3.1. Locality-aware Finger Selection Making progress in the identifier space can be expensive in terms of network latency. The successor and predecessor lists can be used to optimize network latency by relaxing the requirement for finger selection. Specifically, for each finger table entry, a node (say v) first determines a node n that matches the identifier of the finger. It then retrieves the successors and predecessors of n from n. Node v then PINGs the successors and predecessors of node n and chooses the topologically closest node among these as the choice for the finger table entry. The sizes of the successor and predecessor lists have an impact on network latency; the greater the number of successors and predecessors, the higher the probability of finding a topologically close finger table entry. Our simulations of the basic Chord protocol with just three successors and three predecessors itself shows a reduction close to 41-46% in delay of lookup in the DHT. Another alternate simulation study reported in [19] confirm our results and show 31% to 40% lookup stretch reductions using 2^(16) nodes and a Euclidean and transit-stub model. We expect that the lookup delay performance would further reduce in the proposed topology plug-in with log2(N) successors. Maenpaa, et al. Expires January 7, 2010 [Page 22] Internet-Draft Topology Plug-in for RELOAD July 2009 6.4. Successor Stabilization Both the primary virtual IDs and secondary virtual IDs of the nodes are stored as part of the successor and predecessor tables. In the successor stabilization routine, a peer v asks the peer s that is the first entry in its successor table for the virtual ID of the successor's first predecessor p. If the successor's first predecessor pointer does not point to v's virtual ID but instead to p (for instance, because p has joined the overlay between v and s), the peer with virtual ID p should become v's first successor instead of s. Thus, v adds p to the front of its successor list and notifies p of v's existence, so that p can change its predecessor to v. Also successor lists are stabilized as part of the successor stabilization routine. In order to do this, peer v copies the successor list of its successor s, removing the last entry and prepending s to it. If peer v, as a result of running the successor stabilization routing, notices that its successor has failed, then it does the following: o Using the virtual ID, s, of its successor, it looks up its virtual ID to physical ID mapping to identify which physical node, say S, has failed. Here, the term 'physical ID' refers to the primary virtual ID of the peer. o The peer then uses the same virtual ID to physical ID mapping table to identify the location of other virtual IDs in its successor list that correspond to the physical node S. These IDs are marked for replacement. o The peer then replaces the successor with the first live entry in its successor list, say n, and contacts this node for its successor list. The peer synchronizes n's successor list with its own removing the IDs that are marked for replacement. This step is repeated by contacting the other live entries, one after the other, until all the IDs marked for replacement are updated. The successor stabilization routine is executed when the stabilization timer fires. To begin the successor stabilization routine, peer v MUST send an Update request to its first successor s. The type of the Update request MUST be 'succ_stab'. Upon receiving the Update request, peer s MUST send an Update answer to peer v. The Update answer SHOULD include the successor and predecessor lists of peer s. If v learns from the predecessor and successor lists included in the answer that new peers should be included in its neighborhood set, v MUST send Attach requests to the new peers. Once a direct connection has been established with each new peer as a Maenpaa, et al. Expires January 7, 2010 [Page 23] Internet-Draft Topology Plug-in for RELOAD July 2009 result of the Attach procedure, peer v MUST send an Update request of type 'notify' to each new peer. This allows the new peers to insert v into their neighborhood sets. 6.5. Predecessor Stabilization The predecessor stabilization routine is executed when the stabilization timer fires. To begin the predecessor stabilization routine, a peer v MUST send an Update request to its predecessor p. The type of the Update request MUST be 'pred_stab'. Upon receiving the Update request, peer p MUST send an Update answer to peer v. The Update answer SHOULD include the predecessor list of peer p. Peer v SHOULD use the predecessor list carried in the answer to update its own predecessor list. If new peers are inserted into the predecessor list, peer v MUST send Attach requests and Update requests of type 'notify' to the new peers in the same way as during the successor stabilization routine. 6.6. Joining the Overlay The process of joining an overlay is as follows: 1. The Joining Peer (JP) SHOULD connect to a bootstrap peer. 2. The JP SHOULD send an Attach request to the bootstrap peer, which SHOULD route the request towards the Admitting Peer (AP). Here, the AP is the node that is the successor of JP's primary virtual server. Once the Attach procedure is finished, there is a direct connection between the JP and the AP. 3. The JP SHOULD send a Join request to the AP. The AP returns a Join answer. 4. The AP MUST send an Update request of type 'full' to the JP. The Update request SHOULD contain the contents of AP's routing table. The JP SHOULD use the contents of the Update request to initialize its finger table and DHT parameters, i.e., alpha and delta. The JP SHOULD set the size of its successor list, predecessor list, finger table, to the same values that the AP uses. The values of alpha and delta are also set to the ones that AP uses. If an enrollment server is present in the overlay, the information about the choice of DHT parameters, alpha and delta, can be obtained from it directly. 5. The JP then chooses the virtual server locations, vs_i for i = 2, 3, ..., alpha. This step is not performed if there is an enrollment server in the overlay. In this scenario, the locations of the virtual servers are obtained from the Maenpaa, et al. Expires January 7, 2010 [Page 24] Internet-Draft Topology Plug-in for RELOAD July 2009 enrollment server apriori. 6. The JP SHOULD send an Attach request to the successor of vs_alpha, call it n_alpha. The node returns a Attach answer. 7. The peer n_alpha MUST send an Update request of type 'full' to the JP. The Update request SHOULD contain the contents of the node's successors. The JP SHOULD use the contents of the Update request to initialize its successor and predecessor lists. 8. The JP MUST send Attach requests to initiate connections to each of the peers in its predecessor list, successor list, and finger table. Since the JP is already connected to the AP and n_alpha, there is no need to send a new Attach request to these nodes. 9. The JP MUST send an Update request of type 'notify' to each of the peers in its predecessor and successor lists (except for the AP and n_alpha that are already aware of the JP). 10. The JP MUST send a Probe request carrying the 'uptime' ProbeInformationType value in the requested_info field to each of its fingers. This way the JP will learn the uptimes of its fingers (the uptimes of predecessors and successors are learned from Update responses in the previous step). The uptimes are needed when estimating the join rate of peers in the overlay. It should be noted that these Probe requests are not routed via the overlay but are sent on a direct connection. 11. Now, the JP takes ownership of regions in the overlay based on its virtual server locations, vs_i. In this step, the JP MUST send one Update message to the successor of each vs_i for all i. The type of the Update message MUST be set to 'virtualserver_stab_join'. The peer, say n_i, receiving this message makes a note of this change and locally updates its ownership locations. Peer n_i then sends an UpdateAns message with type 'virtualserver_stab_join' to the JP acknowledging the change. Peer n_i SHOULD then issue a series of Store requests to JP to transfer ownership of the resources. This step is repeated for all i = alpha, alpha-1, ..., 1. At the end of this step all the new virtual servers corresponding to JP have joined the overlay, and the JP stores the data for these virtual locations. 6.6.1. Contents of the Join Message This topology plug-in extends the Join request defined in RELOAD [1]. In addition to the joining_peer_id, nodes MAY choose to send its virtual server locations as part of the join message if they are Maenpaa, et al. Expires January 7, 2010 [Page 25] Internet-Draft Topology Plug-in for RELOAD July 2009 obtained apriori from the enrollment server as part of the enrollment process. The JoinReq message is defined as: struct { NodeId joining_peer_id; NodeId joining_peer_virtual_server_ids <0..2^16-1>; opaque overlay_specific_data <0..2^16-1>; } JoinReq; The JoinReq contains the Node-ID which the sending peer wishes to assume and MAY include other virtual server locations that the joining peer wishes to occupy. If the request succeeds, the responding peer responds with a JoinAns message; this is defined as in the case of RELOAD. 6.7. Leaving the Overlay The process of leaving the overlay is as follows: 1. If no replication is being performed in the overlay, leaving peer SHOULD issue a series of Store requests to the successor of each of its virtual servers, vs_i, to transfer the ownership of the resource records it is storing. Note that if replication is being used, the successor of peer is already storing replicas and the amount of data transferred can be minimized. 2. The leaving peer MUST send a Leave request to the predecessor and successor of each virtual server, vs_i. The Leave request that is sent to the vs_i's successor SHOULD contain the predecessor list of the leaving peer. The Leave request that is sent to the vs_i's predecessor SHOULD contain the successor list of the leaving peer. The first successor SHOULD use the predecessor list carried in the Leave request to update its own predecessor list. The first predecessor SHOULD use the successor list carried in the Leave request to update its own successor list. 6.7.1. Contents of the Leave Message This topology plug-in extends the Leave request defined in RELOAD [1]. In addition to the leaving_peer_id, a node MUST send its virtual server locations as part of the leave message as defined in: public struct { NodeId leaving_peer_id; NodeId leaving_peer_virtual_server_ids <0..2^16-1>; opaque overlay_specific_data <0..2^16-1>; Maenpaa, et al. Expires January 7, 2010 [Page 26] Internet-Draft Topology Plug-in for RELOAD July 2009 } LeaveReq; The overlay_specific_data field of the Leave request MUST contain a ChordLeaveData structure: enum { reserved (0), from_succ(1), from_pred(2), (255) } ChordLeaveType; struct { ChordLeaveType type; select(type) { case from_succ: NodeId successors <0..2^16-1>; case from_pred: NodeId predecessors <0..2^16-1>; }; } ChordLeaveData; The 'type' field indicates whether the Leave request was sent by a predecessor or a successor of the recipient: from_succ The Leave request was sent by a successor. from_pred The Leave request was sent by a predecessor. If the type of the request is 'from_succ', the contents will include the sender's successor list. If the type of the request is 'from_pred', the contents will include the sender's predecessor list. 6.8. Self Tuning System Parameters This section describes how the system parameters for load balancing and stabilization adapt to changing network conditions. Maenpaa, et al. Expires January 7, 2010 [Page 27] Internet-Draft Topology Plug-in for RELOAD July 2009 6.8.1. Self Tuning Load Balancing Algorithm Parameters All DHTs require constant updating as the size of the network grows. For example, in the case of Chord, each node starts out with m entries in the finger table (if node-IDs take values from 0 to 2^m-1). When the number of nodes in the network is small, some of these m entries collapse to the same node. As the number of nodes in the network increase, the finger table entries degenerate to pointing to different nodes and the number of fingers grows as O(log2(N)). Thus, the number of fingers that each node has to maintain grows as O(log2(N)) as N increases. A similar update step is required by the solution in this document scheme as well because the system parameters, alpha and delta, also depend upon the value of N. This section describes how nodes MUST update the values of alpha, delta, and the location of virtual servers as the network changes in size. This document proposes to update the system values each time the network size doubles or halves. In addition to finger, successor, and predecessor stabilization, each peer also needs to perform DHT parameter stabilization. Unlike the other stabilization routines that are done periodically when the stabilization timer fires, the parameter stabilization routine is done whenever the number of fingers is observed to change. Performing stabilization in this way is sufficient for updating the load balancing parameters (i.e., alpha and delta) because it is not the highest priority update and the overlay would function effectively even without this update. In the proposed topology plug-in, we employ lazy updates for stabilizing the parameters, alpha and delta. More specifically, we use the change in number of fingers as a trigger to perform DHT parameter update. The parameter stabilization routine is executed when the number of fingers in a node changes. We consider two possible scenarios that can arise: 1. Number of connections decrease: Let f be the lost connection. The peer v MUST first check the ID of the lost connection to see if there is any response. Let n be the ID of the node responding to the request. If n is the same as f, the peer must re-connect to f. If n is not the same as f and does not correspond to any node that is already in the finger table of v, then v MUST send an Attach request to n. In the above two cases, the parameter stabilization routine is not performed since the number of fingers does not change. On the other hand, if n is not the same as f and corresponds to some other node in the finger table of v, then v understands that the network size has reduced and invokes Maenpaa, et al. Expires January 7, 2010 [Page 28] Internet-Draft Topology Plug-in for RELOAD July 2009 the parameter stabilization routine. 2. Number of connections increase: In this case, v invokes the parameter stabilization routine. When the parameter stabilization routine is invoked, the following functions are performed: o Step 1: The (alpha-1) virtual servers corresponding to v leave the overlay. o Step 2: The value of alpha and delta are updated and the new values are obtained. The corresponding virtual server locations are also modified to get vs_i. o Step 3: The node v joins the overlay at the new virtual server locations, vs_i. o Step 4: The successors and predecessors of v are updated. During the parameter stabilization routine, the portion of the data space owned by the corresponding physical node changes and so data needs to be moved to match this change. The locations of successors and predecessors also change to adjust to the updates in virtual server locations. However, the fingers do not change because these are associated with the primary virtual ID and the location of the primary node-ID is fixed. In Step 1 of the stabilization routine (see above), the peer v MUST send one Update message to the successor of the virtual server, vs_i. The type of the Update message MUST be set to 'virtualserver_stab_leave'. The peer n_i receiving this message makes a note of this change and locally updates its ownership locations. Peer n_i then sends an UpdateAns message with type 'virtualserver_stab_leave' to vs_i acknowledging the change. On receiving the UpdateAns message, peer v SHOULD issue a series of Store requests to n_i to transfer ownership of the resources. This step is repeated for all i = 2, ... alpha. At the end of this step, all the virtual servers corresponding v have left the overlay and peer v no longer stores the data corresponding to these virtual locations. In Step 2, the peer v obtains its new virtual server locations, vs_i. In the presence of an enrollment server, the peer v MUST request the enrollment server for a new set of identities and stop participating in the overlay network with the old node identities. The enrollment server MAY choose to implement diagnostics using mechanisms in [3] to ascertain that the node requesting identities is not requesting more Maenpaa, et al. Expires January 7, 2010 [Page 29] Internet-Draft Topology Plug-in for RELOAD July 2009 or less than what it should based on the finger table sizes of a random sampling of other nodes in the overlay. In the absence of the enrollment server, the following protocol is implemented to obtain the new values of alpha, delta, and vs_i. Maenpaa, et al. Expires January 7, 2010 [Page 30] Internet-Draft Topology Plug-in for RELOAD July 2009 UpdateAlphaDelta() { If (number_of_fingers increases by 1) { // Can be detected as in the case of Chord when finger tables // are updated. This implies that the network size has about // doubled. delta = delta/2; // update delta // Update existing virtual server locations // vs_1 = vp as the location of the primary virtual server // does not change. for i = 2 to alpha vs_i = vp - (vp - vs_i)/2; // Choose new virtual server location between // (vp - alpha*delta*2^m, vp - (alpha+1)*delta*2^m); vs_(alpha+1) = random (vp - alpha*delta*2^m, vp - (alpha+1)*delta*2^m); vs_(alpha+2) = random (vp (alpha+1)*delta*2^m, vp - (alpha+2)*delta*2^m); alpha = alpha + 2; } If (number_of_fingers decreases by 1) { // Can be detected as in the case of Chord when finger tables // are updated. This implies that the network size has about // doubled. // Leave network with those virtual ids Remove virt_server_(alpha-1); Remove virt_server_(alpha); delta = delta * 2; // update delta alpha = alpha - 2; // update alpha // Update existing virtual server locations // vs_1 = vp as the location of the primary virtual server // does not change. for i = 2 to alpha vs_i = vp - (vp - vs_i) * 2; } } Once the location of the virtual servers are obtained, the peer re- joins the overlay at these positions. During this step, the peer v Maenpaa, et al. Expires January 7, 2010 [Page 31] Internet-Draft Topology Plug-in for RELOAD July 2009 MUST send one Update message to the successor of vs_i in Step 3. The type of the Update message MUST be set to 'virtualserver_stab_join'. The peer n_i receiving this message makes a note of this change and locally updates its ownership locations. Peer n_i then sends an UpdateAns message with type 'virtualserver_stab_join' to v acknowledging the change. Peer n_i SHOULD then issue a series of Store requests to v to transfer ownership of the resources. In this process, the virtual server's n_i are added to the successor list of v. This step is repeated for all i = alpha, alpha-1, ..., 2. At the end of this step all the new virtual servers corresponding to v have joined the overlay, and peer v stores the data for these virtual locations. If replication is performed on the overlay, some of the data may already be present in the node v and this would reduce the amount of Store requests. Successor and predecessor stabilization routines are invoked as described in Section 6.4 and Section 6.5, respectively, in Step 4 and this completes the parameter stablization routine. 6.8.2. Self Tuning the Stabilization Interval To ensure that lookups produce correct results as the set of participating peers changes and to ensure that all peers' connections be up to date, each peer MUST run a stabilization protocol periodically in the background. The stabilization protocol uses three operations: finger stabilization, successor stabilization, and predecessor stabilization as defined earlier. Each peer MUST maintain a stabilization timer. When the stabilization timer fires, the peer MUST restart the timer and carry out the stabilization operations. In this section, we present methods to compute the stabilization timer. When periodic stabilization is used, one faces the problem of selecting an appropriate execution rate for the stabilization procedure. If the execution rate of periodic stabilization is high, changes in the system can be quickly detected, but at the disadvantage of increased communication overhead. On the other hand, if the stabilization rate is low and the churn rate is high, routing tables become inaccurate and DHT performance deteriorates. Thus, the problem is setting the parameters so that the overlay achieves the desired reliability and performance even in challenging conditions, such as under heavy churn. This naturally results in high cost during periods when the churn level is lower than expected, or alternatively, poor performance or even network partitioning in worse than expected conditions. Maenpaa, et al. Expires January 7, 2010 [Page 32] Internet-Draft Topology Plug-in for RELOAD July 2009 The current approach is to configure overlays statically. This works in situations where perfect information about the future is available. In situations where the operating conditions of the network are known in advance and remain static throughout the lifetime of the system, it is possible to choose fixed optimal values for parameters such as stabilization rate, neighborhood set size, and routing table size. However, if the operating conditions (e.g., the size of the overlay and its churn rate) do not remain static but evolve with time, it is not possible to achieve both a low lookup failure rate and a low communication overhead by using fixed parameters [20]. As an example, to configure the Chord DHT algorithm, one needs to select appropriate values for the size of successor list and the stabilization interval. To select an appropriate value for the stabilization interval, one needs to know the expected churn rate and overlay size. According to [21], a Chord network in a ring-like state remains in a ring-like state as long as peers send Omega(log2^2(N)) messages before N new peers join or N/2 peers fail. Thus, in a 500-peer overlay churning at a rate such that one peer joins and one peer leaves the network every 30 seconds, an appropriate stabilization interval would be on the order of 93s. According to [4], the size of the successor list should be on the order of log2(N). Having a successor list of size O(log2(N)) makes it unlikely that a peer will lose all of its successors, which would cause the Chord ring to become disconnected. Thus, in a 500-peer network each peer should maintain on the order of nine successors. However, if the churn rate doubles and the network size remains unchanged, the stabilization rate should double as well. That is, the appropriate maintenance interval would now be on the order of 46s. On the other hand, if the churn rate becomes e.g. six-fold and the size of the network grows to 2000 peers, on the order of eleven successors should be maintained and the stabilization interval should be on the order of 42s. If one continued using the old values, this could result in inaccurate routing tables, network partitioning, and deteriorating performance. The proposed topology plug-in takes into consideration the continuous evolution of network conditions and adapts to them. Each peer collects statistical data about the network and adaptively adjusts its stabilization rate and successor list size based on the analysis of the data [20]. Reference [22] shows that by using a self-tuning mechanism, it is possible to achieve high reliability and performance even in adverse conditions with low maintenance cost. Adaptive stabilization has been shown to outperform periodic stabilization in terms of both lookup failure and communication overhead [20]. The following sub-sections specify methods to determine the Maenpaa, et al. Expires January 7, 2010 [Page 33] Internet-Draft Topology Plug-in for RELOAD July 2009 appropriate stabilization rate in an adaptive fashion. The proposed mechanism is based on [22][21][20]. To calculate an appropriate stabilization rate, the values of three parameters MUST be estimated: overlay size N, failure rate U, and join rate L. Peers in the overlay MUST re-calculate the values of the parameters to self-tune the algorithm at the end of each stabilization period before re-starting the stabilization timer. 6.8.2.1. Estimating Overlay Size Techniques for estimating the size of an overlay network have been proposed for instance in [22] [23] [24] [25] and [20]. In Chord, the density of peer identifiers in the successor set can be used to produce an estimate of the size of the overlay, N [22]. Since peer identifiers are picked randomly with uniform probability from the m-bit identifier space, the average distance between peer identifiers in the successor set is (2^m)/N. To estimate the overlay network size, a peer MUST compute the average inter-peer distance d between the successive peers starting from the most distant predecessor and ending to the most distant successor in the successor list. The estimated network size MUST be calculated as: 2^m N = --------- d This estimate has been found to be accurate within 15% of the real network size [20]. Of course, the size of the neighborhood set affects the accuracy of the estimate. When a peer joins the network, the admitting peer sends the joining peer a copy of its neighborhood set. Thus, a joining peer immediately has enough information at its disposal to calculate an estimate of the network size. 6.8.2.2. Estimating Failure Rate A typical approach is to assume that peers join the overlay according to a Poisson process with rate L and leave according to a Poisson process with rate parameter U [22]. The value of U can be estimated using peer failures in the finger table and neighborhood set [22]. If peers fail with rate U, a peer with M unique peer identifiers in its routing table should observe K failures in time K/(M*U). Every peer in the overlay MUST maintain a history of the last K failures. The current time MUST be inserted into the history when the peer Maenpaa, et al. Expires January 7, 2010 [Page 34] Internet-Draft Topology Plug-in for RELOAD July 2009 joins the overlay. The estimate of U MUST be calculated as: k U = ----------, M * Tk where M is the number of unique peer identifiers in the routing table, Tk is the time between the first and the last failure in the history, and k is the number of failures in the history. If k is smaller than K, the estimate is computed as if there was a failure at the current time. It has been shown that an estimate calculated in a similar manner is accurate within 17% of the real value of U [20]. The size of the failure history K affects the accuracy of the estimate of U. One can increase the accuracy by increasing K. However, this has the side effect of decreasing responsiveness to changes in the failure rate. On the other hand, a small history size may cause a peer to overreact each time a new failure occurs. In [20], K is set 25% of the routing table size. 6.8.2.2.1. Estimating Join Rate Reference [20] proposes that a peer can estimate the peer join rate based on the uptime of the peers in its routing table. An increase in peer join rate will be reflected by a decrease in the average age of peers in the routing table. Thus, each peer MUST maintain an array of the ages of the peers in its routing table sorted in increasing order. Using this information, an estimate of the global peer join rate L MUST be calculated as: N 1 L = --- * ---------------, 4 Ages[rsize/4] where Ages is an array containing the ages of the peers in the routing table sorted in increasing order and rsize is the size of the routing table. Only the ages of the 25% of the youngest peers in the routing table SHOULD be used to reduce the bias that a small number of peers with very old ages can cause [20]. It has been shown that the estimate obtained by using this method is accurate within 22% of the real join rate [20]. Of course, the size of the routing table affects the accuracy. In order for this mechanism to work, peers need to exchange information about the time they have been present in the overlay. Peers learn the uptimes of their successors and predecessors when Maenpaa, et al. Expires January 7, 2010 [Page 35] Internet-Draft Topology Plug-in for RELOAD July 2009 adding the successors and predecessors to their routing tables since Update requests and answers that are of type 'notify' carry uptime values. Peers learn the uptimes of their fingers because the Probe responses sent as part of the finger stabilization routine carry uptime values. A joining peer learns the admitting peer's uptime since an Update request of type 'full' contains uptime information. 6.8.2.2.2. Calculating the Stabilization Interval According to [21], a Chord network in a ring-like state remains in a ring-like state as long as peers send Omega(log2^2(N)) messages before N new peers join or N/2 peers fail. We can use the estimate of peer failure rate, U, to calculate the time Tf in which N/2 peers fail: 1 Tf = -------- 2*U Based on this estimate, a stabilization interval Tstab-1 is calculated as: Tf Tstab-1 = ----------- log2^2(N) Further, the estimated join rate L can be used to calculate the time in which N new peers join the overlay. Based on the estimate of L, a stabilization interval Tstab-2 is calculated as: N Tstab-2 = --------------- L * log2^2(N) Finally, the actual stabilization interval Tstab that SHOULD be used can be obtained by taking the minimum of Tstab-1 and Tstab-2. The results obtained in [26] indicate that making the stabilization interval too small has the effect of making the overlay less stable (e.g., in terms of detected loops and path failures). Thus, a lower limit should be used for the stabilization period. Based on the results in [26], a lower limit of 15s is proposed, since using a stabilization period smaller than this will with a high probability cause too much traffic in the overlay. Maenpaa, et al. Expires January 7, 2010 [Page 36] Internet-Draft Topology Plug-in for RELOAD July 2009 7. Security Considerations One important concern for the virtual servers approach is that it is not easily enforceable. Given a set of virtual serverIDs for a physical node, it must be ensured that the physical node actually takes ownership of all the virtual serverIDs assigned to it. In the presence of a centralized enrollment server, this server can ensure that each physical node gets its share of the number of virtual servers. In the absence of the enrollment server, the verification can be performed during the join process. For instance, in the case of Chord, when a new node joins the overlay, it contacts its neighbors to obtain neighbor relations and finger table entries. A similar approach can be adopted in the case of virtual servers approach. In this scenario, each physical node provides its neighbors a list of 2 log2(N) virtual server IDs. The neighbors first verify that the number of virtual server locations received is close to the number of virtual server locations that it owns before sending the finger table entries to the joining node. In this way, it can be ensured that each node has O(log2(N)) virtual locations on the overlay. 8. IANA Considerations (a) A new overlay algorithm type should be defined for the proposed new topology plug-in. (b) This document defines one new Probe Information Type value: +-----------------+------+---------------+ | Probe Option | Code | Specification | +-----------------+------+---------------+ | uptime | 3 | RFC-AAAA | +-----------------+------+---------------+ (c) Other IANA considerations are TBD. 9. Acknowledgments This document benefited from design discussions with Vidya Narayanan from Qualcomm Inc. 10. Appendix This appendix lists a few performance results of the load balancing Maenpaa, et al. Expires January 7, 2010 [Page 37] Internet-Draft Topology Plug-in for RELOAD July 2009 solution proposed in this document. 10.1. Comparison of the Load Balancing Algorithm with Chord Compared to the virtual markers approach in [7], the solution proposed in this document does not require O(log2(N)) nodes to change their location upon each node arrival. For instance, in the case of the virtual markers approach, if O(log2(N)) nodes change location, then O(log2(N)/N) data objects need to be reassigned and at least O(2 log2(N)) nodes are involved in this step. This is in addition to the messages required for updating routing tables which is O(log2^3(N)). In contrast to the virtual markers approach, the proposed solution requires O(1/N) data objects to be reassigned and around O(log2(N)) nodes send data to the joining node. The number of routing messages required is still O(log2^2(N)) similar to the case of Chord with no virtual servers; this is because the number of connections is still O(log2(N)) links per node. We performed some simulation studies to compare the performance of the solution in this document with Chord. In order to study the percentage of nodes with significant load imbalance, we look at the q-percentile load imbalance factor defined as the ratio of the q-percentile load and the average load. For example, a 99-percentile imbalance factor of 5 implies that less than 1 percent of nodes in the system have a load more than the value 5 times the average load. We tested the solution in this document under N = 2^(15) and compared the results with Chord. We found that the 99-percentile imbalance factor for the solution in this document is around 2. In comparison, our results with Chord suggest that the 100-percentile imbalance factor is around 9, the 99-percentile imbalance factor is around 4.5 and so on. This result suggests that in Chord, around 1% of the nodes have an imbalance between 4.5 and 9, 2% have an imbalance greater than 4 and so on compared to the solution in this document where less than 1% of the nodes has an imbalance greater than 2. With regard to routing, our simulations compared the average route length for Chord (with one virtual server) with this document's solution (2 log2(N)) servers; the results showed that the both these DHT implementations require around the same number of hops. 10.2. Performance of the Load Balancing Algorithm as Network Grows In this section, we take a closer look at the performance of the solution in this document as network size grows in terms of load imbalance factor and routing state maintenance. Maenpaa, et al. Expires January 7, 2010 [Page 38] Internet-Draft Topology Plug-in for RELOAD July 2009 (1) Load Imbalance: In our simulations, we fix the value of the estimated network size to be Nest = 2^9, and chose alpha = 18 and delta = 2^(-9). The actual network size is then increased from 2^2 to 2^(14). When N < Nest, the spacing between virtual servers (chosen as 1/Nest) is very small and the solution in this document becomes similar to Chord. In this case, the load imbalance factor is of the order of O(log2(N)). However, since the numerical value of N is small, the imbalance factor is still low and around 3 for most nodes as demonstrated by our simulations. As the value of network size, N, increases, the spacing, delta, comes into effect and the virtual servers help balance the load. Our simulations indicate that the imbalance is around its lowest value of 2 when N = Nest, and increases slowly as N becomes larger than Nest. (2) Routing State: Here, we examine the performance of this solution with regard to the amount of routing state that it needs to maintain as the network grows. We analyze the performance of the solution in this document analytically to determine the number of hops required to reach a destination as a function of N and Nest. We perform the analysis in two steps: * [Step 1] Routing within a distance of 1/N from destination: If nodes in this solution employ only their primary virtual servers (and the corresponding O(log2(N)) connections) for routing as in the case of Chord, it can be shown that any discretization built upon the solution in this document would be able to get within a distance of 1/N from the destination node in O(log2(N)) hops with O(log2(N)) fingers per node. This results is unaffected by the relative values of N and Nest. * [Step 2] Last-mile: The number of hops required to route messages from within a distance of 1/N to the exact destination (referred to as the last-mile problem) would depend upon the exact realization of DHT and its construction of neighbor tables. In this document, each node has alpha virtual servers chosen uniformly at random separated by a distance O(delta*2^m). Therefore, the number of nodes within a 1/N distance from a chosen location would be of the order of O(log2(N) + N*log2(1+s alpha delta)), where 's' is a constant independent of N. If alpha and delta are chosen using Nest, then the number of nodes within a 1/N distance can be approximated to be of the order of O(log2(N) + N* log2(Nest)/Nest) for large Nest. Therefore, a random walk would lead to the final destination within O(log2(N) + N* log2(Nest)/Nest) hops. Maenpaa, et al. Expires January 7, 2010 [Page 39] Internet-Draft Topology Plug-in for RELOAD July 2009 This result indicates that when N < Nest, the solution in this document performs similar to Chord and messages can be routed to any node in O(log2(N)) hops by maintaining O(log2(N)) fingers and O(log2(N)) successors. On the other hand, when N is much larger compared to Nest, O(N) hops might be required to reach the destination if the number of fingers per node is O(log2(N)). However, this situation arises only when the network size estimation is very much lower than N and the value of alpha and delta are not updated as the network grows. 11. References 11.1. Normative References [1] Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base Protocol", draft-ietf-p2psip-base-02 (work in progress), March 2009. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [3] Yongchao, S., Jiang, X., Even, R., and D. Bryan, "P2PSIP Overlay Diagnostics", draft-ietf-p2psip-diagnostics-01 (work in progress), June 2009. 11.2. Informative References [4] Stoica, I., Morris, R., Karger, D., Kaashoek, M., and H. Balakrishnan, "Chord: A scalable peer-to-peer lookup service for internet applications", In Proc. of the ACM SIGCOMM, 2001. [5] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, "Comparing the performance of distributed hash tables under churn", In Proc. of the 3rd International Workshop on Peer-to-Peer Systems, 2004. [6] Bienkowski, M., Korzeniowski, M., and F. auf der Heide, "Dynamic load balancing in distributed hash tables", In Proc. of IPTPS, 2005. [7] Karger, D. and M. Ruhl, "Simple efficient load balancing algorithms for peer-to-peer systems", In 3rd International Workshop on Peer-to-Peer Systems (IPTPS), 2004. [8] Awerbuch, B. and C. Scheideler, "Group spreading: A protocol for provably secure distributed name service", In Proc. of the Maenpaa, et al. Expires January 7, 2010 [Page 40] Internet-Draft Topology Plug-in for RELOAD July 2009 31st Int. Colloquium on Automata, Languages, and Programming (ICALP), 2004. [9] Fraigniaud, P. and C. Gavoille, "The content-addressable network d2b", Technical Report 1349, LRI, Univ. Paris-Sud, France, 2003. [10] Kaashoek, F. and D. Karger, "Koorde: A simple degree-optimal hash table", In Proc. 2nd International Workshop on Peer-to- Peer Systems (IPTPS), 2003. [11] Naor, M. and U. Wieder, "Novel architectures for peer to peer applications: the continuous-discrete approach.", In Proc. of the 15th ACM Symp. on Parallel Algorithms and Architectures (SPAA), 2003. [12] Byers, J., Considine, J., and M. Mitzenmacher, "Simple load balancing for distributed hash tables", In 2nd International Workshop on Peer-to-Peer Systems (IPTPS), 2003. [13] Manku, G., "Routing Networks for Distributed Hash Tables", In Proc. of the Principles of Distributed Computing (PODC), 2003. [14] Godfrey, B. and I. Stoica, "Heterogenity and load balance in Distributed Hash Tables", IEEE INFOCOM, 2005. [15] Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., and I. Stoica, "Load balancing in dynamic structured peer to peer systems", In 23rd Conference of the IEEE Communications Society (INFOCOM), 2004. [16] Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., and I. Stoica, "Load balancing in structured peer to peer systems", In 2nd International Workshop on Peer-to-Peer Systems (IPTPS), 2004. [17] Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, "Handling churn in a DHT", In Proc. of the USENIX Annual Techincal Conference, 2004. [18] Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. Haridi, "Comparing maintenance strategies for overlays", In Proc. of 16th Euromicro Conference on Parallel, Distributed and Network- Based Processing, 2008. [19] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A scalable peer-to- peer lookup protocol for internet applications", IEEE/ACM Maenpaa, et al. Expires January 7, 2010 [Page 41] Internet-Draft Topology Plug-in for RELOAD July 2009 Transactions on Networking, 2003. [20] Ghinita, G. and Y. Teo, "An adaptive stabilization framework for distributed hash tables", 20th International Parallel and Distributed Processing Symposium, 2006. [21] Liben-Nowell, D., Balakrishnan, H., and D. Karger, "Observations on the dynamic evolution of peer-to-peer networks", In Proc. of the First International Workshop on Peer-to-Peer Systems, 2002. [22] Mahajan, R., Castro, M., and A. Rowstron, "Controlling the cost of reliability in peer-to-peer overlays", In Proceedings of the 2nd International Workshop on Peer-to- Peer Systems, 2003. [23] Horowitz, K. and D. Malkhi, "Estimating Network Size from Local Information", Information Processing Letters, Volume 88, Issue 5, pp. 237-243, 2003. [24] Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and A. Demers, "Decentralized schemes for size estimation in large and dynamic groups", Fourth IEEE International Symposium on Network Computing and Applications, pp. 41-48, 2005. [25] Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable algorithm to monitor chord-based peer to peer systems at runtime", 20th International Parallel and Distributed Processing Symposium, 2006. [26] Maenpaa, J. and G. Camarillo, "A study on maintenance operations in a Chord-based Peer-to-Peer Session Initiation Protocol overlay network", Accepted to Sixth International Workshop on Hot Topics in P2P Systems (HotP2P 2009), 2009. [27] Ktari, S., Zoubert, M., Hecker, A., and H. Labiod, "Performance evaluation of replication strategies in DHTs under churn", In Proc. of the 6th International Conference on Mobile and Ubiquitous Multimedia, 2007. Maenpaa, et al. Expires January 7, 2010 [Page 42] Internet-Draft Topology Plug-in for RELOAD July 2009 Authors' Addresses Jouni Maenpaa Ericsson Hirsalantie 11 Jorvas 02420 Finland Email: Jouni.Maenpaa@ericsson.com Ashwin Swaminathan Qualcomm, Inc. 5775 Morehouse Dr San Diego, CA USA Phone: +1 858-845-8775 Email: sashwin@qualcomm.com Saumitra M. Das Qualcomm, Inc. 3195 Kifer Road Santa Clara, CA USA Phone: +1 408-533-9529 Email: saumitra@qualcomm.com Gonzalo Camarillo Ericsson Hirsalantie 11 Jorvas 02420 Finland Email: Gonzalo.Camarillo@ericsson.com Jani Hautakorpi Ericsson Hirsalantie 11 Jorvas 02420 Finland Email: Jani.Hautakorpi@ericsson.com Maenpaa, et al. Expires January 7, 2010 [Page 43]