CCAMP | C. Villamizar, Ed. |
Internet-Draft | Infinera Corporation |
Intended status: Informational | March 07, 2011 |
Expires: September 08, 2011 |
Use of Multipath with MPLS-TP and MPLS
draft-villamizar-mpls-tp-multipath-01
Many MPLS implementations have supported multipath techniques and many MPLS deployments have used multipath techniques, particularly in very high bandwidth applications, such as provider IP/MPLS core networks. MPLS-TP has discouraged the use of multipath techniques. Some degradation of MPLS-TP OAM performance cannot be avoided when operating over current high bandwidth multipath implementations.
The tradeoffs involved in using multipath techniques with MPLS and MPLS-TP are described. Requirements are discussed which enable full MPLS-TP compliant LSP including full OAM capability to be carried over MPLS LSP which are traversing multipath links. Other means of supporting MPLS-TP coexisting with MPLS and multipath are discussed.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 08, 2011.
Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Today the requirement to handle large aggregations of traffic, can be handled by a number of techniques which we will collectively call multipath. Multipath applied to parallel links between the same set of nodes includes Ethernet Link Aggregation [IEEE-802.1AX], link bundling [RFC4201], or other aggregation techniques some of which may be vendor specific. Multipath applied to diverse paths rather than parallel links includes Equal Cost MultiPath (ECMP) as applied to OSPF, ISIS, or BGP, and equal cost LSP, as described in Section 4. Various multipath techniques have strengths and weaknesses described in Section 4.2.
The term composite link is more general than terms such as link aggregation (which is specific to Ethernet) or ECMP (which implies equal cost paths within a routing protocol). The use of the term composite link here is consistent with the broad definition in [ITU-T.G.800]. Multipath is very similar to composite link, but specifically excludes inverse multiplexing.
Identical load balancing techniques are used for multipath both over parallel links (for example IP/MPLS over Ethernet link aggregation) and over diverse paths (for example, IP ECMP, IP/MPLS ECMP over multiple LSP or link bundling over LSP component links).
Large aggregates of IP traffic do not provide explicit signaling to indicate the expected traffic loads. Large aggregates of MPLS traffic are carried in MPLS tunnels supported by MPLS LSP. LSP which are signaled using RSVP-TE extensions do provide explicit signaling which includes the expected traffic load for the aggregate. LSP which are signaled using LDP do not provide an expected traffic load.
MPLS LSP may contain other MPLS LSP arranged hierarchically. When an MPLS LSR serves as a midpoint LSR in an LSP carrying other LSP as payload, there is no signaling associated with these client (inner) LSP. Therefore even when using RSVP-TE signaling there may be insufficient information provided by signaling to adequately distribute load across a multipath link.
A set of label stack entries that is unique across the ordered set of label numbers can safely be assumed to contain a group of (one or more) flows. The reordering of MPLS traffic (except MPLS-TP) can therefore be considered to be acceptable unless reordering occurs within traffic containing a common unique set of label stack entries. Existing load splitting techniques take advantage of this property in addition to looking beyond the bottom of the label stack and determining if the payload is IPv4 or IPv6 to load balance traffic based on IP addresses.
A large aggregate of IP traffic may be subdivided into groups of flows using a hash on the IP source and destination addresses. IP microflows are described in [RFC2475] and clarified in [RFC3260]. For MPLS traffic that is not carrying IP, a similar hash can be performed on the set of labels in the label stack. These techniques subdivide traffic into groups of flows for the purpose of load balancing traffic across the aggregated capacity of a multipath link.
Attempting to resolve years of discussion as to whether a hash based approach provides a sufficiently even load balance using any particular hashing algorithm or method of distributing traffic across a set of component links is outside of the scope of this document. For the purpose of discussing existing widely deployed implementations, it is sufficient to say that hash based techniques have proven to be at least satisfactory through their widespread deployment (and its increase in deployment for more than two decades).
The current load balancing techniques are referenced in [RFC4385] and [RFC4928], though few specifics are provided in these two RFCs. The use of three hash based approaches are described in [RFC2991] and [RFC2992], though other techniques with very similar outcome are used. A means to identify flows within pseudowires (when flows are present, since not all PW types contain discernible flows) is described in [I-D.ietf-pwe3-fat-pw].
MPLS-TP OAM violates the assumption made in prior multipath implementations that it is safe to reorder traffic within an LSP. This assumption is common (if not universal) in multipath implementations which use hashing techniques for load balancing. The use of multipath can impact CC/CV (connectivity check, connectivity verification) and LM (loss measurement) and DM (delay measurement) [I-D.ietf-mpls-tp-oam-framework].
MPLS-TP CC/CV, DM, and LM OAM packets must take the same path as the payload. If the label stack for the payload contains an LSP and a PW label beneath it (one of one or more additional PW labels), then the payload will be load split over the multipath. The OAM packets will have a GAL label beneath the LSP label [RFC5586]. With no other label beneath the GAL label, the OAM traffic will take only one path and the set of PW will take multiple paths (though any one PW will take one path if a flow label is not used).
With the current OAM CC/CV definition and current multipath practices, OAM CC/CV functionality may not cover the forwarding path for a particular PW within the LSP at any given multipath along the path. The existing OAM CC/CV will provide a check for the condition where the entire multipath becomes unavailable (goes down or the particular LSP is preempted due to reduced multipath capacity).
There is no assurance that DM OAM is measuring the delay of the forwarding path for a particular PW within the LSP with the current OAM DM definition and current multipath practices. In addition, if packets are reordered, OAM LM accuracy can be (and generally is) affected.
The existing multipath techniques address specific requirements. MPLS-TP requirements are in conflict with multipath, at least as currently implemented.
The underlying requirements that motivated the current use of multipath are not in conflict with the use of MPLS-TP. Section 3 described these requirements in greater detail. Section 4 described current practices in greater detail. Section 5 describes means of better supporting both MPLS-TP and multipath requirements.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
This section enumerates two sets of requirements. The first set includes those requirements imposed by the need for scalability and very large capacity links and very large capacity LSP and are enumerated in Section 3.1. The second set of requirements are those imposed by the needs of MPLS-TP and are enumerated in Section 3.2. Discussion of these requirements is provided in Section 3.3.
Section 4 describes multipath techniques which are implemented and deployed today. Section 5 enumerates derived requirements which focus on means to support the requirements in Section 3.1 and Section 3.2 with minimal modifications to existing multipath techniques. A summary of recommendations is provided in Section 6.
Networks today may support thousands or tens of thousands of nodes in total. This large number of nodes is typically arranged in tiers to improve scalability through aggregation of signaling and aggregation of traffic. The innermost tier, most commonly referred to at the network core, may support interconnection of adjacent sites with hundreds of gigabits or terabits of capacity.
The physical interface of choice today is 10GbE with migration toward 100GbE expected to begin in the near future. SONET and OTN are also in use, but are today also limited to 10Gb/s or 40Gb/s, with 100Gb/s availability (OTN ODU4) expected in the near future. With core link capacities of terabits today and tens of terabits expected in the near future, multipath is needed.
MPLS-TP requirements related to multipath are primarily related to prohibiting out-of-order delivery of traffic for reasons of OAM fate sharing. Specific requirements related to OAM are provided in "MPLS-TP OAM Framework", Section 4.6, Section 5.5.3, and Section 6.2.3 [I-D.ietf-mpls-tp-oam-framework].
The following requirement is currently met with no changes to existing multipath implementations.
The following requirement can only be met with existing multipath techniques using MPLS link bundling [RFC4201] if LSR are configured to place an LSP on only a single component rather than spliting some or all LSP across the set of components. Using link bundling with all LSP constrained to use a single component has well known disadvantages (see Section 4.2.3). Other forms of multipath as currently defined do not meet this requirement (see Section 4.2).
The remaining MPLS-TP requirements are related to the scale of a deployed MPLS-TP network and have the greatest impact on the network core. These are practical requirements mostly related to scalability but specific to MPLS-TP.
LSP which are configured entirely from the management plane rather than through use of a control plane need not use the MPLS PSC portion of the hierarchy as specified in RFC 4206, however hierarchy is still needed in the label stack.
There is a tradeoff between making use of MPLS-TP as a server layer for the benefits of MPLS-TP and the benefits of using MPLS. The benefits of MPLS-TP include the ability to run without the OSPF-TE, ISIS-TE, and RSVP-TE control protocols, and MPLS-TP OAM. The benefits of MPLS include more efficient use of multipath capacity due to removal of MPLS-TP constraints.
A requirements for very large server layer traffic flow within the network core can be accommodated using multiple parallel MPLS-TP LSP. This increases the number of LSP required which itself is a drawback. This also results in a bin packing problem if the service bearing MPLS-TP LSP do not require the same capacity and are not all small multiples of a common capacity increment. For example, if LSP are not all 10Gb/s, or they are not only 10Gb/s and 40 Gb/s then bin packing problems can occur. This use of MPLS-TP can also result in less opportunity for statistical multiplexing with very large aggregates of lower priority non-TP IP/MPLS traffic (see Section 4.2.3 and Section 5.2.2 for further details on bin packing problems and loss of efficiency with MPLS-TP as a server layer).
The following subsections provide further detail related to the requirements enumerated in Section 3.1 and Section 3.2.
Midpoint LSR must support a very large number of LSP. This places requirements on the ILM size. If a control plane is used this also places requirements on the speed of processing RSVP-TE messages. As long as RSVP-TE ERO contain only strict hops, the processing is limited to connection admission, label assignment, and forwarding hardware programming of the label swap operation.
The MPLS label entry is 32 bits of which the label itself is 20 bits [RFC3032]. This allows 2^20 or 1,048,576 values minus the 16 reserved label values. The Incoming Label Map (ILM) (see RFC 3031, Section 1.11 [RFC3031]) is generally much smaller. Circa 2000, ILM sizes of 4K-32K were common. Circa 2010, ILM sizes of 64K-256K are more common in core LSR.
Putting a bound on ILM size has two effects. It allows LSR that offer higher power and space density. For deployments which use a control plane and support restoration, speed of restoration is dramatically improved when a smaller number of LSP are supported.
For some architectures, bounding the ILM size allows the ILM to be supported without forwarding memory external to the forwarding IC. This is a practical consideration as the power reduction and board space reduction can allow an LSR to achieve higher power and space density.
Reducing external memories reduces power consumed and therefore reduces cooling problems. In addition there are board space reductions. This results in reduced space as well as power.
In today's networks, which predominantly use MPLS/GMPLS OSPF-TE or ISIS-TE and RSVP-TE signaling, the computational limitations described in Section 3.3.2.2 are the limiting factor. Reduction in space and power due to smaller ILM are then a secondary consequence of the signaling scaling issue.
In a network tier with N nodes, a worst case cutset has N/2 nodes on either side of the cutset. Given that a full mesh of LSP connectivity is needed in the network core, the cutset therefore carries N^2/4 LSP. For example, if N is 400, the cutset carries a minimum of 40,000 LSP to achieve a full mesh. If the core has over 2,000 nodes, then the cutset carries over 1,000,000 LSP. Since the MPLS label space is only 20 bits, a full mesh within an entire provider network with no hierarchy could easily exceed the MPLS label number space. Use of Hierarchy can solve this problem.
Typically there are more than one LSP between any pair of LSR in the network core. Protection is one source of additional LSP. More than one LSP may be required to carry traffic with very different requirements. See Section 3.3.1.4.
The result is that even considering only the ILM size, the number of nodes in a full mesh of LSP must be limited to well under 1,000. If two links in a cutset supporting a large number of LSP incur a fault, then the nodes bordering the remaining links in the cutset must process a very large number of RSVP-TE PATH and RESV messages and the connection admission requests and ILM allocation operations that are required as a result.
A full mesh of N nodes will have N*(N-1) unidirectional LSP or N*(N-1)/2 bidirectional LSP if there is only one LSP with any given pair of nodes as ingress and egress. There may be more than one LSP with any given pair of nodes as ingress and egress to meet protection requirements or to meet certain quality of service requirements.
If GMPLS protection [RFC4426] protection is used, the number of LSP is doubled with end-to-end (path) protection, but more than doubled with span protection. If MPLS FRR [RFC4090] is used, the number of LSP is increased only slightly with the (more common) facilities backup technique, but more than doubled with the one-to-one backup technique.
All services between a pair of core nodes may be carried over a single unsignaled E-LSP [RFC3270] if the eight TC values [RFC5462] are sufficient and the requirements of these services is sufficiently similar. If more than eight PHB are required, more LSP will be required. If services require preemption, or have different protection needs, then multiple LSP per pair of core nodes is required. If services have different delay requirements, this too may require multiple LSP per pair of core nodes.
The total number of LSP at a cutset needs to be constrained for two reasons. First the number of LSP must fit in the 20 bit label field or the smaller number of labels supported by most LSR. Second is a need to reduce the amount of signaling that would be required if restoration was needed to cover a multiple fault (if restoration is not supported multiple faults can result in otherwise avoidable outages which persist until a physical repair or manual intervention is completed).
Where traffic enters a provider network tier such as the core, LSR serve as ingress to PSC LSP if hierarchy is used. If RSVP-TE signaling is used, ingress must perform CSPF if fully dynamic MPLS routing is used. Even when working and protection paths are configured with explicit paths computed offline, when a multiple fault occurs, if restoration is supported, then CSPF must be run. It is this multiple fault scenario which generally dictates scalability.
Dynamic routing is necessary in order to provide restoration which is as robust as possible in the presence of multiple faults while still providing efficient utilization of resources.
Legacy transport networks offer protection which requires dedicated protection resources. If resources are allocated through the management plane, then restoration support is either not provided at all or extremely slow at best. More modern transport equipment which supports fast restoration requires signaling which is generally provided using GMPLS.
IP/MPLS networks typically make use of protection which offers sharing or protection resources or more commonly make use of zero bandwidth allocation on protection paths. The use zero bandwidth allocation provides robust protection of preferred traffic as long as preferred traffic is given queuing priority and preferred traffic levels are low enough that adequate protection resources are available for preferred traffic regardless of the protection path taken. This assumption is not violated in network which are dominated by Internet traffic and carry a minority of preferred traffic.
When a single fault occurs, protection should restore traffic flow quickly, with a typical target being 45 msec. Many deployments are configured such that LSR run CSPF after a fault to obtain a new protection path for what is now effectively the working path, or reroute the working LSP and then create a new protection LSP.
Multiple faults which are not accounted for by SRLG are fairly common. In many cases, such as earthquake, bridge collapse, train wreck, flood, it is impractical to account for the specific multiple fault in the SLRG set. When this does occur, fast restoration is often required for a large number of LSP for which both the working and protect paths are affected. In this case, a long convergence time would result in a more lengthy outage for those LSP for which the multiple fault was service affecting.
For core Internet services and for many non-Internet core services, an inability to reach any one point in the network from another for a significant length of time due to a fault which is correctable, even if it is a multiple fault, is unacceptable. These services require restoration at some layer.
For most core networks MPLS/GMPLS signaling is required at some layer for reasons described in Section 3.3.2.1. In order for restoration to occur quickly, scaling issues must be considered and addressed, including network topology impacts on scaling. These scaling issues are dominated by CSPF computations and OSPF or ISIS flooding impact.
For a given ingress in a full mesh of LSR, a fault can result in a very large number of affected LSP. At midpoint LSR the worst case number of connection acceptance decisions can be very large. The computational load per LSP on connection acceptance at midpoint LSR is small but the reflooding of available bandwidth can also contribute significant load.
At LSP ingress, the number of CSPF computations imposes scaling limitations. CSPF computation time is proportional to the number of nodes in a mesh and the total number of links. If the average node degree remains constant, then the total number of links is proportional to the number of nodes. The result is a single CSPF time with order N*log2(N) time complexity (where N is the number of nodes in the mesh). If the worst case number of LSP affected by a fault also grows proportionally to N, then the total amount of computation is order N^2*log2(N). The amount of computation grows at a rate of greater than the square of the growth in the number of nodes.
If restoration is not supported, any multiple fault will result in a lengthy outage. If restoration is supported, constraining the size of a full mesh will very significantly reduce the CSPF computation load and the reflooding overhead and very significantly improve the worst case restoration time.
Multipath load split based on hashing the IP addresses or MPLS labels is far from perfect, though it is widely implemented and widely deployed. For the vast majority of traffic, which is predominantly Internet traffic, the underlying assumption that traffic is quite evenly distributed across a hash space is valid. For a mix of Internet traffic and fairly persistent large microflows, adaptive multipath has proven effective (see Section 4.1.2).
The bandwidth reservations of LSP carrying Internet traffic are merely predictions of required capacity. Often a significant percentage of traffic can shift among a set of LSP. A great deal of efficiency is gained in the presence of such shifts through the ability to dynamically share the available capacity on a multipath.
The introduction of a minority of higher priority (and higher gross margin) services to predominantly Internet traffic yields an additional opportunity to make more efficient use of capacity. These higher priority services on average significantly underutilize their guaranteed capacities. The average over the entire set of such services is fairly predictable. The capacity allocated to these services but unused can be used as Internet capacity. Some small probability exists that these services will make use of significantly more capacity than predicted, up to their guaranteed capacities, but the consequences of this unlikely occupance is a reduction in capacity available to the Internet traffic for which capacity is not guaranteed. This practice allows high margin services to be delivered at substantially lower cost with very little risk to Internet traffic and no risk at all to the higher priority services.
For the reasons above, current multipath techniques offer efficient use of multipath capacity. Changes to multipath MUST NOT sacrifice this efficiency where it is not necessary to meet other requirements.
Multipath take many forms. These include the use of ECMP in various protocols, Ethernet Link Aggregation, and Link Bundling. The specifications for each of these forms of multipath provide limited characterization of external behavior, where any guidance is provided at all. This section summarizes current practices among products which are currently or have in the past been deployed successfully in Internet service provider networks and content provider networks.
Much of the existing information on multipath current practices is summarized in Section 1.1. With the exception of the work in PWE3 and minimal mention in LDP very little consideration for multipath impact on new protocols has been documented.
This section is divided into two parts. First is documentation of techniques common to all forms of multipath in Section 4.1. Second is application of these techniques and unique characteristics of specific forms of multipath in Section 4.2.
There is a dramatic difference between the multipath techniques used for pure Layer-2 Ethernet switches intended for enterprise networks and the multipath techniques used for large provider core networks. Many enterprise switches use only the Ethernet MAC in load balancing, thought the argument that such networks may not be carrying IP or MPLS traffic at all is rarely cited as a reason today. The routers and/or LSR used in large provider networks are assumed to be carrying IP traffic and/or MPLS traffic where the MPLS traffic is predominantly carrying IP traffic as its payload.
Most of the multipath techniques used for large provider core networks are common across all types of multipath. This is because the traffic being handled by multipath in large provider networks is predominantly IP or IP over MPLS. The following paragraph is quoted from RFC 4928, Section 2, "Current ECMP Practices" [RFC4928]:
This observation led to the specification of the PW Control Word [RFC4385] such that the values 4 and 6 which could be mistaken for IPv4 or IPv6 were avoided. More accurately, [RFC4928] was written to document the reasons for this decision made in [RFC4385].
IP traffic in a large provider core network contains a very large number of very short lived microflows (refer to the definition of microflow in [RFC2475]). The number of flows has in the past been estimated as many millions or many tens of millions. Many of the flows exchange as few as two packet (DNS for example). Most contain only tens of packets. Most flows exist for a few seconds and some less than a second. A much smaller number of flows (though still a large number) are longer in duration and exchange larger amounts of data.
Attempts to isolate individual IP flows in large provider core networks for the purpose of routing them individually have met with resounding failure. Current practice does not attempt to isolate individual flows, but instead isolates groups of flows. If reordering is minimized or eliminated for groups of flows, then reordering is minimized or eliminated for any single flow with a group.
The method of subdividing IP traffic into groups of flows that has been used successfully for more than two decades (since the T1-NSFNET in 1987 or possibly prior to that) is to use a hash function over the IP source address and destination address. Including the TCP or UDP port numbers might be beneficial for enterprise networks but is not necessary for large provider networks. Omitting port number is large provider networks has the desirable characteristic of better enforcing fairness among flows by eliminating or reducing the potential of end users using multiple port numbers to defeat any tendency toward fairness among flows.
In large provider core networks, MPLS LSP (in contrast to IP) are very long lived, generally provide a large to very large amounts of traffic, and are relatively few in number. In many large provider core networks LSP which carry Internet traffic from one major core node to another major core node, can very substantially exceed the capacity of a multipath component link.
For MPLS traffic carrying Internet IP traffic, "taking the liberty of guessing the payload" (as described in RFC 4928) was a matter of necessity. The label stack simply did not provide adequate diversity. Initially some LSR did not support this capability. Splitting very large LSP by configuring two or more provided a workaround (which only moved the hashing and load splitting out of the core), however hashing based on label stack was highly ineffective and packing LSP individually into link bundle component links has substantial disadvantages (see Section 4.2.3).
For MPLS that is not carrying IP, the MPLS label stack is used as the basis for the load split hash. Generally the entire label stack is used or as few as three of the bottom labels are used. Using only the bottom label (or only the top label) has proven unsatisfactory in terms of splitting the load. Some forms of PW can be subdivided which has motivated the introduction of a PW flow label [I-D.ietf-pwe3-fat-pw].
Simple multipath generally relies on the mathematical probability that given a very large number of small microflows, these microflows will tend to be distributed evenly across a hash space. A common simple multipath implementation assumes that all component links are of equal capacity and perform a modulo operation across the hashed value. An alternate simple multipath technique uses a table generally with a power of two size, and distributes the table entries proportionally among component links according to the capacity of each component link.
An adaptive multipath technique is one where the traffic bound to each component link is measured and the load split is adjusted accordingly. As long as the adjustment is done within a single network element, then no protocol extensions are required and there are no interoperability issues.
Specific adaptive multipath techniques are outside of the scope of this document.
The load splitting techniques defined in Section 4.1 and those defined in Section 4.1.2 are both used in splitting traffic over parallel links between the same pair of nodes. The best known technique, though far from being the first, is Ethernet Link Aggregation [IEEE-802.1AX]. This same technique had been applied much earlier using OSPF or ISIS Equal Cost MultiPath (ECMP) over parallel links between the same nodes. Multilink PPP [RFC1717] uses a technique that provides inverse multiplexing. A number of vendors had provided proprietary extensions to PPP over SONET/SDH [RFC2615] that predated Ethernet Link Aggregation but are no longer used.
Link bundling [RFC4201] provides yet another means of handling parallel LSP. RFC4201 explicitly allow a special value of all ones to indicate a split across all component links of the bundle. Use of link bundling is discussed in Section 4.2.3.
All of these techniques, including ECMP, may be used over two or more links between a pair of nodes. The most primitive load split algorithms may require that all links be of the same capacity and may attempt to load balance equally. Somewhat less primitive techniques may allow links to be unequal in capacity. Any of these techniques can also use an adaptive multipath algorithm as described in Section 4.1.2.
OSPF or ISIS Equal Cost MultiPath (ECMP) is a well known form of traffic split over multiple paths that may traverse intermediate nodes. ECMP is often incorrectly equated to only this case, and multipath over multiple diverse paths is often incorrectly equated to an equal division of traffic.
Many implementations are able to create more than one LSP between a pair of nodes, where these LSP are routed diversely to better make use of available capacity. The load on these LSP can be distributed proportionally to the reserved bandwidth of the LSP. These multiple LSP may be advertised as a single PSC FA and any LSP making use of the FA may be split over these multiple LSP.
Link bundling [RFC4201] component links may themselves be LSP. When this technique is used, any LSP which specifies the link bundle may be split across the multiple paths of the LSP that comprise the bundle.
Other forms of multipath may use what appear to be physical component links that are provided by a server layer. For example, the components of an Ethernet LAG may be provided by Ethernet PW [RFC4448].
Techniques which spread traffic over multiple paths may use simple multipath or adaptive multipath as described in Section 4.1.2. When ECMP is used over an IP link or MPLS LDP LSP, visibility of available capacity along the path is limited to the next hop only, therefore load which is split proportionally to the capacity of the immediate hop may not be split optimally for the entire path, even using an adaptive multipath capable forwarding. For techniques which split traffic over one or more LSP, the available capacity along the path to the destination is assumed to be known through the bandwidth reservations of the LSP.
Three forms of multipath are considered here.
Of these types of multipath, the latter two can be applied to MPLS with RSVP-TE signaling or static configurations.
Equal Cost Multipath has been available in the ISIS and OSPF link state routing protocols for two decades or more. For example, see [RFC1247]. ECMP is also available in BGP. ECMP is declared out of scope in LDP, though widely implemented.
Although ECMP is not applicable to MPLS LSP setup with RSVP-TE signaling, ECMP can be applied at an LER.
At an MPLS LER ECMP can be applied over two or more MPLS LSP with traffic split proportionally to the LSP reserved bandwidth. This could also be considered to be IP ECMP with an underlying MPLS LSP server layer.
The equivalent to ECMP for an LSP setup can be achieved by creating PSC LSP and concatenating them using link bundling, and using the "all ones" link bundle component (see Section 4.2.3.
Ethernet link aggregation ([IEEE-802.1AX]) concatenates a set of Ethernet member links below the Ethernet link layer, such that the link aggregation group (LAG) appears as a single link with a single Ethernet MAC address. The link aggregation control protocol (LACP) coordinates membership in the LAG such that the member links can be made unavailable to upper layers and added to the LAG on both nodes.
For IP using a link state protocol with ECMP, Ethernet link aggregation had little effect. The load balancing on a LAG was identical to the load balancing using ECMP over the set of member links. ISIS only advertises the adjacencies between nodes. OSPF advertises each link between nodes, so for IP using OSPF, link aggregation only resulted in a reduction in routing protocol overhead and simplification of the SPF.
For MPLS, some vendors had already implemented proprietary extensions to PPP over SONET/SDH [RFC2615] that predated the earliest IEEE work on link aggregation (IEEE 802.3ad) with capabilities similar to LACP. It was not until 10GbE became widely available (about 5 years later) that LAG was used in provider core networks, and began replacing OC-192. MPLS link bundling implementations (prior to RFC status) also predated Ethernet link aggregation.
A network deployment circa 2005 could either configure many Ethernet links and use MPLS link bundling, or configure an Ethernet LAG. If an MPLS link bundle was configured to split load over all link bundle component links the functionality was equivalent to configuring the set of links as a LAG. In core LSR implementations, the load split in these two cases was identical.
MPLS link bundling [RFC4201] was conceived at about the time that it was clear that OC-48 was too slow for IP core links, OC-192 was just becoming available and would soon be too slow, and MPLS had strong support among multiple providers. Link bundling initially solved two problems. A few individual vendors had proprietary extensions to PPP over SONET/SDH [RFC2615]. Link bundling could offer equivalent capability and offer vendor interoperability. Second, some vendor hardware was not capable of load splitting and therefore required that each top level LSP be assigned a single path. Further, each side of a link bundle could be configured differently, one could load split and the other could place LSP on individual component link.
If LSP are place on individual links rather than split over the entire bundle, then bin packing problems can occur. LSP are often large making this packing error significant. In addition, LSP bandwidth reservations in most IP/MPLS deployments are only predictions of expected bandwidth. With link bundling, as specified, LSP cannot be moved from one link bundle component link to another. If LSP are assigned to links rather than split based on IP address pairs, there is less opportunity for one LSP to make use of unused capacity due to other LSP being utilized. The bin packing and loss of opportunity to share capacity both reduce the efficiency of capacity utilization.
MPLS link bundling does not currently offer an ability to select which LSP are assigned to a single component link and which LSP are split over the entire set of component links. Most forwarding hardware can support this. Although an LSR could in principle be configured to use some other attribute of an LSP to infer the decision to load split, such as holding priority or an affinity for an administrative attribute, no LSR software provides this capability. Until MPLS-TP there was never a need for that capability.
The purpose of this section is to describe how MPLS-TP and multipath could coexist and to define simple changes to accomplish this.
Three different methods to support MPLS-TP and multipath are described. One method requires simple changes to link bundle and LAG. One method requires no changes but has disadvantages. One method involves no change to multipath but requires relaxation to MPLS-TP OAM requirements.
The best solution makes MPLS over multipath a fully compliant server layer for MPLS-TP meeting all of the requirements stated in the prior sections but cannot be fully supported by most existing LSR without hardware changes. The other two solutions have disadvantages but require little or no change to existing hardware that would otherwise support MPLS-TP. The changes are specified at the level of detail of requirements and/or framework rather than as specific protocol changes.
The largest contributor of provider traffic today is the Internet. All of this traffic is IP with some providers, but not all, using IP over MPLS. IP is used without MPLS with ECMP and LAG and IP is used with MPLS with all three forms 0f multipath described in Section 4.2, ECMP, LAG, and link bundling.
In addition to Internet services, many providers currently offer layer-2 and layer-3 VPN services over MPLS today. Other providers offer native layer-2 services with an intention to migrate to MPLS-TP for these services.
A primary purpose of migrating VPN and circuit services from layer-2 to MPLS-TP is to reduce cost relative to a dedicated layer-2 infrastructure for these services. Much of that reduction comes from making use of infrastructure in place to support Internet traffic.
Using the capacity in place for Internet, predictive reservations can be made for higher priority services, with guarantees possible by transferring the risk of exceeding the predictions to the Internet traffic through use of priority queuing. With Internet loads being much larger, the unlikely event of predictive reservations being exceeded would easily be absorbed. This architecture allows VPN and circuit services to be delivered at lower cost.
IP/MPLS requires the use of multipath due to the high traffic levels. MPLS-TP requires a single path for each LSP. With no changes, these two requirements are in conflict. Three possible approaches are examined in the following sections.
Each of these are separate solutions. For example, if changes to MPLS forwarding enable MPLS with multipath to support fully compliant MPLS-TP LSP, then relaxing MPLS-TP OAM is not needed. Conversely, if MPLS forwarding cannot be changed on specific existing equipment to accommodate MPLS-TP, then one of the other two solutions is required. Supporting MPLS-TP OAM at high rates also requires hardware change to most existing LSR, therefore all of these solutions require some form of hardware change.
A desirable solution is one that meets all requirements and is highly cost effective. An undesirable solution is one that either does not meet all requirements or is not cost effective. The ability to use existing hardware is also desirable. A number of solutions and the necessary changes are discussed in the following subsections.
MPLS, which requires multipath, and MPLS-TP, which requires a single path, could potentially coexist in the following ways.
Three solutions are described. As noted in Section 5.1.1 these are three separate solutions. Each can be deployed independently. Most important neither of the first two solutions requires relaxing MPLS-TP OAM requirements. On the other hand, these solutions are not mutually exclusive.
Using MPLS with multipath as a server layer for MPLS-TP has the most advantages with respect to the requirements, and with the exception of inability to run on some (or most) existing hardware, has no disadvantages. This is assuming that the protocol changes suggested in this subsection are implemented in later IETF documents.
Supporting fully conformant MPLS-TP LSP over MPLS LSP which are making use of multipath, requires special treatment of the MPLS-TP LSP such that those LSP only are not subject to the multipath load slitting.
Some hardware which exists today can support requirement MP#2. For example, if a table is used to support multipath and produces satisfactory results given existing traffic patterns, and the number of component links or members is smaller than the table by a factor or N, then an allocation of a multiple of 1/N of a component or member link can be set aside for MPLS-TP traffic. The MPLS-TP traffic can be protected from an degraded performance due to an imperfect load split if the MPLS-TP traffic is given queuing priority (using strict priority and policing or shaping at ingress or locally or weighted queuing locally).
Most existing hardware cannot support requirement MP#5 but some may be able to partially support this requirements by fixing the label stack inspection depth to a fixed number of LSP from the top. Full support for requirement MP#5 requires that the depth over which the hash is computed can be derived from the label number of the label on which a label swap operation is performed.
Carrying MPLS LSP which are larger than a component link over an MPLS-TP server layer requires that the large MPLS client layer LSP be accommodated by multiple MPLS-TP server layer LSPs. MPLS multipath can be used in the client layer MPLS as described in Section 4.1.4.
Creating multiple MPLS-TP server layer LSP places a greater ILM scaling burden on the LSR (see Section 3.3.1.1 and the examples in Section 3.3.1.3). High bandwidth MPLS cores with a smaller amount of nodes have the greatest tendency to require LSP in excess of component links, therefore the reduction in number of nodes offsets the impact of increasing the number of server layer LSP in parallel. Today, only in cases where the ILM is small would this be an issue.
The most significant disadvantage of MPLS-TP as a Server Layer for MPLS is that the MPLS LSP reduces the efficiency of carrying the MPLS client layer. The service which provides by far the largest offered load today is Internet, for which the LSP capacity reservations are predictions of expected load. Many of these MPLS LSP may be smaller than component link capacity. Using MPLS-TP as a server layer results in bin packing problems for these smaller LSP. For those LSP that are larger than component link capacity, their capacity are not increments of convenient capacity increments such as 10Gb/s. Using MPLS-TP as an underlying server layer greatly reduces the ability of the client layer MPLS LSP to share capacity. For example, when one MPLS LSP is underutilizing its predicted capacity, the fixed allocation of MPLS-TP to component links may not allow another LSP to exceed its predicted capacity. A solution which makes less efficient use of resources may result in a less cost effective solution, due to the amount of capital equipment cost required and an increase in space and power required.
No additional requirements beyond MPLS-TP as it is now currently defined are required to support MPLS-TP as a Server Layer for MPLS. It is therefore viable but has some undesirable characteristics discussed above.
If MPLS-TP OAM requirements are not fully met, as currently specified, an LSP is not fully MPLS-TP conformant. That may be little more than a semantic inconvenience and can not prevent implementations from allowing LSP which are otherwise MPLS-TP compliant to optionally use multipath with some reduction in OAM capability.
Regardless as to whether relaxing MPLS-TP OAM requirements makes an LSP no longer an MPLS-TP LSP, this section discusses the consequence of using multipath with regard to MPLS-TP OAM.
If MPLS-TP over multipath is supported by relaxing MPLS-TP OAM requirements, the requirements listed below will improve the behavior of MPLS-TP OAM over multipath.
MPLS-TP CC/CV as currently defined has no means to exercise all paths of a multipath. The label stack is fixed, followed by a GAL label [RFC5586]. As is, only one path along a multipath can be exercised when the ingress to the multipath is not also the ingress to the LSP. For example, if the LSP is carrying PW, the PW themselves can be spread across the multipath, but not the OAM traffic.
If CC/CV OAM is allowed to place a label below the GAL label, the entire set of paths can be tested, though not in a deterministic manner. This is called an entropy label. Using a different random number in this entropy label for each OAM packet allows all links to be exercised on a probabilistic basis.
The loss of a isolated OAM CC/CV packet currently has no effect. If the loss of a single OAM packet can be noted by the sender, then the sender can repeatedly use the same value in the entropy label. This requires either a two way OAM or feedback to the ingress. If OAM packets can be reordered, then a sliding window of outstanding OAM packets is required. If OAM CC/CV packets are given high priority (as currently specified), then delay difference should be minimal and reordering may be non-existent if the send interval is longer than the delay difference.
If a multipath component link failure had been detected locally (at a node adjacent to the failure) and the failure corrected locally (ie: segment protection) or the component link taken out of service, the client LSP would either no longer be affected or it would be preempted. If the client LSP has been preempted, MPLS-TP OAM unmodified would be sufficient to detect this condition. The existing BFD [RFC5884] provides this functionality.
Only in the case where a component link has failed and the server layer has not been able to detect and correct the failure or take the component link out of service would CC/CV OAM on the client LSP serve any purpose. For this purpose, a relaxed OAM may be sufficient. If the client LSP has no control over the multipath itself, the entire multipath must be considered down if any uncorrected component link failure is occurring at the multipath.
The CC/CV as described here can be handled by an OAM mechanism which is bidirectional. LSP Ping provides such a mechanism [RFC4379]. Because the condition being handled by LSP ping should be quite rare, it may be acceptable to use a combination of BFD and MPLS ping to provide OAM with full coverage of all types of fault, but with a slower response to a component link failure which is not detected at the point of the fault.
For LSR implementations which support BFD and MPLS ping "as is", these may be viable as an optional MPLS-TP form of CC/CV OAM. A deployment may use this option if the reliance on IP is acceptable to the provider. Alternately MPLS-TP OAM could take such requirements into consideration and provide an additional capability in BFD or provide MPLS-TP extensions to MPLS ping.
A further small complication may occur at the OAM egress. If the egress to the LSP is a multipath egress, then the OAM may arrive at any of the component links at the egress. This requires that the CC/CV OAM be forwarded within the LSR to a common packet processor in order to be handled in hardware (or forwarded to a common CPU). This is also true of other types of OAM.
MPLS-TP LM OAM makes use of the count of payload packets at an egress. If the payload is reordered, even with no consequence to the payload itself, some inaccuracy is introduced to the LM. Some number of payload packets which were transmitted before the LM OAM packet was sent may arrive after the LM packet is received and some payload packets transmitted after the LM OAM packet may arrive before the LM packet.
If the LSP egress is a multipath, then the LM packets may arrive at any packet processor over which the multipath resides. The counters from each of the egress packet processors will have to be sampled. During the sampling interval, addition packet arrive and will be counted. This creates an equivalent out of order problem with respect to the LM OAM and the payload it is counting.
This error is bounded and is not cumulative. For example, if one LM interval counts too few packets, the next LM interval will tend to count too many. Over longer measurement periods the total error retains the same bounds, which over longer intervals becomes less significant.
These errors are most significant when a substantial amount of queuing delay is present (generally an indication of light congestion) and when the queues at various component links differ in delay. Queuing delay differences are generally milliseconds. Delay differences of tens of milliseconds requires persistent queues and significant congestion.
The worst case errors over long intervals are reasonably well bounded. For example, with A 10 msec delay difference, a one minute sampling yields less than a 0.02% uncertainty and over a 15 minute interval loss uncertainty is just over 0.001%. Given that congestion is required to achieve these uncertainties, the loss due to congestion is likely to significantly exceed these uncertainties for all but very short measurement intervals.
When loss is zero but short term queues are formed, the queuing delay difference is likely to be under one millisecond for the common case of parallel links that are routed along the same fiber (using WDM). The uncertainty for 1 minute and 15 minute samples are under 0.002% and just over 0.0001% (10^-6). The uncertainty over a 24 hour period is 0.00000011% or just over 10^-9. An SLA could easily be supported where loss was guaranteed not to exceed 10^-6 in any hour or 10^-8 in any 24 hour period. Such a guarantee would require that the MPLS-TP LSP be given priority over non-policed or shaped traffic and itself is policed or shaped.
This measurement uncertainty may or may not be acceptable to a given deployment. Providing an option to support MPLS-TP over multipath does introduce a bounded error to LM but it does not remove a providers option not to use MPLS-TP over multipath.
Section 3 enumerates functional requirements. Section 4 describes current practices. Section 5 enumerates functional changes to better meet these requirements. This section provides specific recommendations.
To support MPLS with multipath as a server layer for MPLS-TP the following changes are required.
The current framework documents could be improved with the following additions.
Forwarding changes to multipath necessary to support MPLS with multipath as a server layer for fully compliant MPLS-TP are the following:
This memo includes no request to IANA.
This document specifies requirements with discussion of framework for solutions. The requirements and framework are related to the coexistence of MPLS/GMPLS (without MPLS-TP) when used over a packet network, MPLS-TP, and multipath. The combination of MPLS, MPLS-TP, and multipath does not introduce any new security threats. The security considerations for MPLS/GMPLS and for MPLS-TP are documented in [RFC5920] and [I-D.ietf-mpls-tp-security-framework].
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |