CCAMP C. Villamizar, Ed.
Internet-Draft Infinera Corporation
Intended status: Informational March 07, 2011
Expires: September 08, 2011

Use of Multipath with MPLS-TP and MPLS
draft-villamizar-mpls-tp-multipath-01

Abstract

Many MPLS implementations have supported multipath techniques and many MPLS deployments have used multipath techniques, particularly in very high bandwidth applications, such as provider IP/MPLS core networks. MPLS-TP has discouraged the use of multipath techniques. Some degradation of MPLS-TP OAM performance cannot be avoided when operating over current high bandwidth multipath implementations.

The tradeoffs involved in using multipath techniques with MPLS and MPLS-TP are described. Requirements are discussed which enable full MPLS-TP compliant LSP including full OAM capability to be carried over MPLS LSP which are traversing multipath links. Other means of supporting MPLS-TP coexisting with MPLS and multipath are discussed.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on September 08, 2011.

Copyright Notice

Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

Today the requirement to handle large aggregations of traffic, can be handled by a number of techniques which we will collectively call multipath. Multipath applied to parallel links between the same set of nodes includes Ethernet Link Aggregation [IEEE-802.1AX], link bundling [RFC4201], or other aggregation techniques some of which may be vendor specific. Multipath applied to diverse paths rather than parallel links includes Equal Cost MultiPath (ECMP) as applied to OSPF, ISIS, or BGP, and equal cost LSP, as described in Section 4. Various multipath techniques have strengths and weaknesses described in Section 4.2.

The term composite link is more general than terms such as link aggregation (which is specific to Ethernet) or ECMP (which implies equal cost paths within a routing protocol). The use of the term composite link here is consistent with the broad definition in [ITU-T.G.800]. Multipath is very similar to composite link, but specifically excludes inverse multiplexing.

1.1. Multipath Behavior of Widely Deployed Equipment

Identical load balancing techniques are used for multipath both over parallel links (for example IP/MPLS over Ethernet link aggregation) and over diverse paths (for example, IP ECMP, IP/MPLS ECMP over multiple LSP or link bundling over LSP component links).

Large aggregates of IP traffic do not provide explicit signaling to indicate the expected traffic loads. Large aggregates of MPLS traffic are carried in MPLS tunnels supported by MPLS LSP. LSP which are signaled using RSVP-TE extensions do provide explicit signaling which includes the expected traffic load for the aggregate. LSP which are signaled using LDP do not provide an expected traffic load.

MPLS LSP may contain other MPLS LSP arranged hierarchically. When an MPLS LSR serves as a midpoint LSR in an LSP carrying other LSP as payload, there is no signaling associated with these client (inner) LSP. Therefore even when using RSVP-TE signaling there may be insufficient information provided by signaling to adequately distribute load across a multipath link.

A set of label stack entries that is unique across the ordered set of label numbers can safely be assumed to contain a group of (one or more) flows. The reordering of MPLS traffic (except MPLS-TP) can therefore be considered to be acceptable unless reordering occurs within traffic containing a common unique set of label stack entries. Existing load splitting techniques take advantage of this property in addition to looking beyond the bottom of the label stack and determining if the payload is IPv4 or IPv6 to load balance traffic based on IP addresses.

A large aggregate of IP traffic may be subdivided into groups of flows using a hash on the IP source and destination addresses. IP microflows are described in [RFC2475] and clarified in [RFC3260]. For MPLS traffic that is not carrying IP, a similar hash can be performed on the set of labels in the label stack. These techniques subdivide traffic into groups of flows for the purpose of load balancing traffic across the aggregated capacity of a multipath link.

Attempting to resolve years of discussion as to whether a hash based approach provides a sufficiently even load balance using any particular hashing algorithm or method of distributing traffic across a set of component links is outside of the scope of this document. For the purpose of discussing existing widely deployed implementations, it is sufficient to say that hash based techniques have proven to be at least satisfactory through their widespread deployment (and its increase in deployment for more than two decades).

The current load balancing techniques are referenced in [RFC4385] and [RFC4928], though few specifics are provided in these two RFCs. The use of three hash based approaches are described in [RFC2991] and [RFC2992], though other techniques with very similar outcome are used. A means to identify flows within pseudowires (when flows are present, since not all PW types contain discernible flows) is described in [I-D.ietf-pwe3-fat-pw].

1.2. New Requirements imposed by MPLS-TP

MPLS-TP OAM violates the assumption made in prior multipath implementations that it is safe to reorder traffic within an LSP. This assumption is common (if not universal) in multipath implementations which use hashing techniques for load balancing. The use of multipath can impact CC/CV (connectivity check, connectivity verification) and LM (loss measurement) and DM (delay measurement) [I-D.ietf-mpls-tp-oam-framework].

MPLS-TP CC/CV, DM, and LM OAM packets must take the same path as the payload. If the label stack for the payload contains an LSP and a PW label beneath it (one of one or more additional PW labels), then the payload will be load split over the multipath. The OAM packets will have a GAL label beneath the LSP label [RFC5586]. With no other label beneath the GAL label, the OAM traffic will take only one path and the set of PW will take multiple paths (though any one PW will take one path if a flow label is not used).

With the current OAM CC/CV definition and current multipath practices, OAM CC/CV functionality may not cover the forwarding path for a particular PW within the LSP at any given multipath along the path. The existing OAM CC/CV will provide a check for the condition where the entire multipath becomes unavailable (goes down or the particular LSP is preempted due to reduced multipath capacity).

There is no assurance that DM OAM is measuring the delay of the forwarding path for a particular PW within the LSP with the current OAM DM definition and current multipath practices. In addition, if packets are reordered, OAM LM accuracy can be (and generally is) affected.

1.3. Apparantly Conflicting Requirements

The existing multipath techniques address specific requirements. MPLS-TP requirements are in conflict with multipath, at least as currently implemented.

The underlying requirements that motivated the current use of multipath are not in conflict with the use of MPLS-TP. Section 3 described these requirements in greater detail. Section 4 described current practices in greater detail. Section 5 describes means of better supporting both MPLS-TP and multipath requirements.

1.4. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

2. Definitions

Multipath

The term multipath includes all techniques in which
  1. Traffic can take more than one path from one node to a destination.
  2. Individual packets take one path only.
  3. Packets are neither resequenced or subdivided and reassembled at the receiving end.
  4. The paths may be:
    1. parallel links between two nodes, or
    2. may be specific paths across a network to a destination node, or
    3. may be links or paths to a next hop hop used to reach a common destination.

Link Bundle

Link bundling is a multipath technique specific to MPLS [RFC4201]. Link bundling supports two modes of operations. Either an LSP can be placed on one component link of a link bundle, or an LSP can be load split across all members of the bundle. There is no signaling defined which allows a per LSP preference regarding load split, therefore whether to load split is generally configured per bundle and applied to all LSP across the bundle.
Link Aggregation

The term "link aggregation" generally refers to Ethernet Link Aggregation [IEEE-802.1AX] as defined by the IEEE. Ethernet Link Aggregation defines a Link Aggregation Control Protocol (LACP) which coordinates inclusion of LAG members in the LAG.
Link Aggregation Group (LAG)

A group of physical Ethernet interfaces that are treated as a logical link when using Ethernet Link Aggregation is referred to as a Link Aggregation Group (LAG).
Equal Cost Multipath (ECMP)

Equal Cost Multipath (ECMP) is a specific form of multipath in which the costs of the links or paths must be equal in a given routing protocol. The load may be split equally across all available links (or available paths), or the load may be split proportionally to the capacity of each link (or path).
Loop Free Alternate Paths

"Loop-free alternate paths" (LFA) are defined in RFC 5714, Section 5.2 [RFC5714] as follows. "Such a path exists when a direct neighbor of the router adjacent to the failure has a path to the destination that can be guaranteed not to traverse the failure." Further detail can be found in [RFC5286]. LFA as defined for IPFRR can be used to load balance by relaxing the equal cost criteria of ECMP, though IPFRR defined LFA for use in selecting protection paths. When used with IP, proportional split is generally not used. LFA use in load balancing may be implemented though rare or non-existent in deployments.
Composite Link

The term Composite Link had been a registered trademark of Avici Systems, but was abandoned in 2007. The term composite link is now defined by the ITU in [ITU-T.G.800]. The ITU definition includes multipath as defined here, plus inverse multiplexing which is explicitly excluded from the definition of multipath.
Inverse Multiplexing

Inverse multiplexing either transmits whole packets and resequences the packets at the receiving end or subdivides packets and reassembles the packets at the receiving end. Inverse multiplexing requires that all packets be handled by a common egress packet processing element and is therefore not useful for very high bandwidth applications.
Component Link

The ITU definition of composite link in [ITU-T.G.800] and the IETF definition of link bundling in [RFC4201] both refer to an individual link in the composite link or link bundle as a component link. The term component link is applicable to all multipath.
LAG Member

Ethernet Link Aggregation as defined in [IEEE-802.1AX] refers to an individual link in a LAG as a LAG member.

3. Multipath Requirements

This section enumerates two sets of requirements. The first set includes those requirements imposed by the need for scalability and very large capacity links and very large capacity LSP and are enumerated in Section 3.1. The second set of requirements are those imposed by the needs of MPLS-TP and are enumerated in Section 3.2. Discussion of these requirements is provided in Section 3.3.

Section 4 describes multipath techniques which are implemented and deployed today. Section 5 enumerates derived requirements which focus on means to support the requirements in Section 3.1 and Section 3.2 with minimal modifications to existing multipath techniques. A summary of recommendations is provided in Section 6.

3.1. Scalability and Large Capacity Requirements

Networks today may support thousands or tens of thousands of nodes in total. This large number of nodes is typically arranged in tiers to improve scalability through aggregation of signaling and aggregation of traffic. The innermost tier, most commonly referred to at the network core, may support interconnection of adjacent sites with hundreds of gigabits or terabits of capacity.

The physical interface of choice today is 10GbE with migration toward 100GbE expected to begin in the near future. SONET and OTN are also in use, but are today also limited to 10Gb/s or 40Gb/s, with 100Gb/s availability (OTN ODU4) expected in the near future. With core link capacities of terabits today and tens of terabits expected in the near future, multipath is needed.

R#12
Multipath MUST support multipath links that are in well in excess of the largest component link and well in excess of the capacity of a single packet processing element.
R#13
Multipath SHOULD support direct service bearing LSP carrying Internet traffic within the network core with capacity in excess of the largest component link and in excess of the capacity of a single packet processing element.
R#14
Aggregation of LSP using hierarchy (as defined in [RFC4206]) may be necessary to reduce the number of MPLS labels in use within a network tier containing a large number of nodes. This aggregation SHOULD NOT be constrained by multipath limitations.
R#15
LSP containing the aggregate of other LSP SHOULD be capable of exceeding the capacity of the largest component link and in excess of the capacity of a single packet processing element.
R#16
It SHOULD be possible to support load split of traffic which is very efficient in its utilization of available capacity, subject to some limitations due to conflicting requirements. The load split SHOULD support sharing of total capacity across the entire multipath where some LSP may make use of unused capacity set aside for other LSP but unused. This load split SHOULD be as free of bin packing issues as possible except when moving LSP to other component links would conflict with other requirements.

3.2. MPLS-TP Requirements

MPLS-TP requirements related to multipath are primarily related to prohibiting out-of-order delivery of traffic for reasons of OAM fate sharing. Specific requirements related to OAM are provided in "MPLS-TP OAM Framework", Section 4.6, Section 5.5.3, and Section 6.2.3 [I-D.ietf-mpls-tp-oam-framework].

The following requirement is currently met with no changes to existing multipath implementations.

R#17
Traffic within an MPLS-TP PW MUST NOT be reordered unless specifically allowed. This is met if a PW control word is used [RFC4385]. Reordering may be specifically allowed using a PW flow label [I-D.ietf-pwe3-fat-pw].

The following requirement can only be met with existing multipath techniques using MPLS link bundling [RFC4201] if LSR are configured to place an LSP on only a single component rather than spliting some or all LSP across the set of components. Using link bundling with all LSP constrained to use a single component has well known disadvantages (see Section 4.2.3). Other forms of multipath as currently defined do not meet this requirement (see Section 4.2).

R#18
Traffic within an MPLS-TP LSP MUST NOT be reordered if full OAM capability is required of the MPLS-TP LSP [I-D.ietf-mpls-tp-oam-framework].

The remaining MPLS-TP requirements are related to the scale of a deployed MPLS-TP network and have the greatest impact on the network core. These are practical requirements mostly related to scalability but specific to MPLS-TP.

R#19
Service PWs and/or service bearing LSPs may form a fairly dense mesh of LSPs from edge to edge over a very large set of nodes. Some means MUST be available to support such usage of MPLS-TP. See Section 3.3.1.1 for a discussion of ILM size limitations that are relevant to this requirement.
R#20
For an MPLS-TP LSP to be fully compliant, all payload and OAM traffic on the MPLS-TP LSP MUST traverse the same physical path. OAM traffic taking the same path as payload (service bearing) traffic is known as the "fate sharing" requirement (see RFC 5860, Section 2.1.3 [RFC5860]).
R#21
For large networks, MPLS hierarchy [RFC4206] can be used to reduce the number of LSP from the large number which would be needed to carry all service bearing MPLS-TP LSP through the network core. For networks configured through the management plane, label stacking can be used to aggregate LSP, though the signaling described in [RFC4206] is not used. Any MPLS-TP constraints which impact this ability to aggregate LSP SHOULD be optional. If MPLS-TP constraints must be relaxed in some deployments, such deployments MAY be referred to as partially MPLS-TP compliant.
R#22
For large networks using link bundling to support large aggregations of MPLS-TP traffic, and using MPLS hierarchy, PSC LSP (see [RFC4206]) or label stacking which are providing a server layer within the network core and carrying many service bearing MPLS-TP LSP SHOULD be capable of supporting capacity in excess of any single link bundle component. In meeting this requirement the server layer LSP need not be an MPLS-TP LSP as long as it is capable of providing a server layer which can support fully compliant MPLS-TP LSP.

LSP which are configured entirely from the management plane rather than through use of a control plane need not use the MPLS PSC portion of the hierarchy as specified in RFC 4206, however hierarchy is still needed in the label stack.

3.3. Discussion of Requirements

There is a tradeoff between making use of MPLS-TP as a server layer for the benefits of MPLS-TP and the benefits of using MPLS. The benefits of MPLS-TP include the ability to run without the OSPF-TE, ISIS-TE, and RSVP-TE control protocols, and MPLS-TP OAM. The benefits of MPLS include more efficient use of multipath capacity due to removal of MPLS-TP constraints.

A requirements for very large server layer traffic flow within the network core can be accommodated using multiple parallel MPLS-TP LSP. This increases the number of LSP required which itself is a drawback. This also results in a bin packing problem if the service bearing MPLS-TP LSP do not require the same capacity and are not all small multiples of a common capacity increment. For example, if LSP are not all 10Gb/s, or they are not only 10Gb/s and 40 Gb/s then bin packing problems can occur. This use of MPLS-TP can also result in less opportunity for statistical multiplexing with very large aggregates of lower priority non-TP IP/MPLS traffic (see Section 4.2.3 and Section 5.2.2 for further details on bin packing problems and loss of efficiency with MPLS-TP as a server layer).

The following subsections provide further detail related to the requirements enumerated in Section 3.1 and Section 3.2.

3.3.1. Requirements related to midpoint LSR

Midpoint LSR must support a very large number of LSP. This places requirements on the ILM size. If a control plane is used this also places requirements on the speed of processing RSVP-TE messages. As long as RSVP-TE ERO contain only strict hops, the processing is limited to connection admission, label assignment, and forwarding hardware programming of the label swap operation.

3.3.1.1. MPLS Incoming Label Map (ILM) Size

The MPLS label entry is 32 bits of which the label itself is 20 bits [RFC3032]. This allows 2^20 or 1,048,576 values minus the 16 reserved label values. The Incoming Label Map (ILM) (see RFC 3031, Section 1.11 [RFC3031]) is generally much smaller. Circa 2000, ILM sizes of 4K-32K were common. Circa 2010, ILM sizes of 64K-256K are more common in core LSR.

Putting a bound on ILM size has two effects. It allows LSR that offer higher power and space density. For deployments which use a control plane and support restoration, speed of restoration is dramatically improved when a smaller number of LSP are supported.

3.3.1.2. ILM Size Impact on Equipment Density

For some architectures, bounding the ILM size allows the ILM to be supported without forwarding memory external to the forwarding IC. This is a practical consideration as the power reduction and board space reduction can allow an LSR to achieve higher power and space density.

Reducing external memories reduces power consumed and therefore reduces cooling problems. In addition there are board space reductions. This results in reduced space as well as power.

In today's networks, which predominantly use MPLS/GMPLS OSPF-TE or ISIS-TE and RSVP-TE signaling, the computational limitations described in Section 3.3.2.2 are the limiting factor. Reduction in space and power due to smaller ILM are then a secondary consequence of the signaling scaling issue.

3.3.1.3. Topology Impact on ILM Size

In a network tier with N nodes, a worst case cutset has N/2 nodes on either side of the cutset. Given that a full mesh of LSP connectivity is needed in the network core, the cutset therefore carries N^2/4 LSP. For example, if N is 400, the cutset carries a minimum of 40,000 LSP to achieve a full mesh. If the core has over 2,000 nodes, then the cutset carries over 1,000,000 LSP. Since the MPLS label space is only 20 bits, a full mesh within an entire provider network with no hierarchy could easily exceed the MPLS label number space. Use of Hierarchy can solve this problem.

Typically there are more than one LSP between any pair of LSR in the network core. Protection is one source of additional LSP. More than one LSP may be required to carry traffic with very different requirements. See Section 3.3.1.4.

The result is that even considering only the ILM size, the number of nodes in a full mesh of LSP must be limited to well under 1,000. If two links in a cutset supporting a large number of LSP incur a fault, then the nodes bordering the remaining links in the cutset must process a very large number of RSVP-TE PATH and RESV messages and the connection admission requests and ILM allocation operations that are required as a result.

3.3.1.4. Multiple LSP Between Node Pairs

A full mesh of N nodes will have N*(N-1) unidirectional LSP or N*(N-1)/2 bidirectional LSP if there is only one LSP with any given pair of nodes as ingress and egress. There may be more than one LSP with any given pair of nodes as ingress and egress to meet protection requirements or to meet certain quality of service requirements.

If GMPLS protection [RFC4426] protection is used, the number of LSP is doubled with end-to-end (path) protection, but more than doubled with span protection. If MPLS FRR [RFC4090] is used, the number of LSP is increased only slightly with the (more common) facilities backup technique, but more than doubled with the one-to-one backup technique.

All services between a pair of core nodes may be carried over a single unsignaled E-LSP [RFC3270] if the eight TC values [RFC5462] are sufficient and the requirements of these services is sufficiently similar. If more than eight PHB are required, more LSP will be required. If services require preemption, or have different protection needs, then multiple LSP per pair of core nodes is required. If services have different delay requirements, this too may require multiple LSP per pair of core nodes.

The total number of LSP at a cutset needs to be constrained for two reasons. First the number of LSP must fit in the 20 bit label field or the smaller number of labels supported by most LSR. Second is a need to reduce the amount of signaling that would be required if restoration was needed to cover a multiple fault (if restoration is not supported multiple faults can result in otherwise avoidable outages which persist until a physical repair or manual intervention is completed).

3.3.2. Requirements related to Ingress LSR

Where traffic enters a provider network tier such as the core, LSR serve as ingress to PSC LSP if hierarchy is used. If RSVP-TE signaling is used, ingress must perform CSPF if fully dynamic MPLS routing is used. Even when working and protection paths are configured with explicit paths computed offline, when a multiple fault occurs, if restoration is supported, then CSPF must be run. It is this multiple fault scenario which generally dictates scalability.

3.3.2.1. Reasons to Use MPLS/GMPLS Signaling

Dynamic routing is necessary in order to provide restoration which is as robust as possible in the presence of multiple faults while still providing efficient utilization of resources.

Legacy transport networks offer protection which requires dedicated protection resources. If resources are allocated through the management plane, then restoration support is either not provided at all or extremely slow at best. More modern transport equipment which supports fast restoration requires signaling which is generally provided using GMPLS.

IP/MPLS networks typically make use of protection which offers sharing or protection resources or more commonly make use of zero bandwidth allocation on protection paths. The use zero bandwidth allocation provides robust protection of preferred traffic as long as preferred traffic is given queuing priority and preferred traffic levels are low enough that adequate protection resources are available for preferred traffic regardless of the protection path taken. This assumption is not violated in network which are dominated by Internet traffic and carry a minority of preferred traffic.

When a single fault occurs, protection should restore traffic flow quickly, with a typical target being 45 msec. Many deployments are configured such that LSR run CSPF after a fault to obtain a new protection path for what is now effectively the working path, or reroute the working LSP and then create a new protection LSP.

Multiple faults which are not accounted for by SRLG are fairly common. In many cases, such as earthquake, bridge collapse, train wreck, flood, it is impractical to account for the specific multiple fault in the SLRG set. When this does occur, fast restoration is often required for a large number of LSP for which both the working and protect paths are affected. In this case, a long convergence time would result in a more lengthy outage for those LSP for which the multiple fault was service affecting.

For core Internet services and for many non-Internet core services, an inability to reach any one point in the network from another for a significant length of time due to a fault which is correctable, even if it is a multiple fault, is unacceptable. These services require restoration at some layer.

3.3.2.2. MPLS Fault Response and CSPF Scaling

For most core networks MPLS/GMPLS signaling is required at some layer for reasons described in Section 3.3.2.1. In order for restoration to occur quickly, scaling issues must be considered and addressed, including network topology impacts on scaling. These scaling issues are dominated by CSPF computations and OSPF or ISIS flooding impact.

For a given ingress in a full mesh of LSR, a fault can result in a very large number of affected LSP. At midpoint LSR the worst case number of connection acceptance decisions can be very large. The computational load per LSP on connection acceptance at midpoint LSR is small but the reflooding of available bandwidth can also contribute significant load.

At LSP ingress, the number of CSPF computations imposes scaling limitations. CSPF computation time is proportional to the number of nodes in a mesh and the total number of links. If the average node degree remains constant, then the total number of links is proportional to the number of nodes. The result is a single CSPF time with order N*log2(N) time complexity (where N is the number of nodes in the mesh). If the worst case number of LSP affected by a fault also grows proportionally to N, then the total amount of computation is order N^2*log2(N). The amount of computation grows at a rate of greater than the square of the growth in the number of nodes.

If restoration is not supported, any multiple fault will result in a lengthy outage. If restoration is supported, constraining the size of a full mesh will very significantly reduce the CSPF computation load and the reflooding overhead and very significantly improve the worst case restoration time.

3.3.3. Efficient Use of Multipath Capacity

Multipath load split based on hashing the IP addresses or MPLS labels is far from perfect, though it is widely implemented and widely deployed. For the vast majority of traffic, which is predominantly Internet traffic, the underlying assumption that traffic is quite evenly distributed across a hash space is valid. For a mix of Internet traffic and fairly persistent large microflows, adaptive multipath has proven effective (see Section 4.1.2).

The bandwidth reservations of LSP carrying Internet traffic are merely predictions of required capacity. Often a significant percentage of traffic can shift among a set of LSP. A great deal of efficiency is gained in the presence of such shifts through the ability to dynamically share the available capacity on a multipath.

The introduction of a minority of higher priority (and higher gross margin) services to predominantly Internet traffic yields an additional opportunity to make more efficient use of capacity. These higher priority services on average significantly underutilize their guaranteed capacities. The average over the entire set of such services is fairly predictable. The capacity allocated to these services but unused can be used as Internet capacity. Some small probability exists that these services will make use of significantly more capacity than predicted, up to their guaranteed capacities, but the consequences of this unlikely occupance is a reduction in capacity available to the Internet traffic for which capacity is not guaranteed. This practice allows high margin services to be delivered at substantially lower cost with very little risk to Internet traffic and no risk at all to the higher priority services.

For the reasons above, current multipath techniques offer efficient use of multipath capacity. Changes to multipath MUST NOT sacrifice this efficiency where it is not necessary to meet other requirements.

4. Multipath Current Practices

Multipath take many forms. These include the use of ECMP in various protocols, Ethernet Link Aggregation, and Link Bundling. The specifications for each of these forms of multipath provide limited characterization of external behavior, where any guidance is provided at all. This section summarizes current practices among products which are currently or have in the past been deployed successfully in Internet service provider networks and content provider networks.

Much of the existing information on multipath current practices is summarized in Section 1.1. With the exception of the work in PWE3 and minimal mention in LDP very little consideration for multipath impact on new protocols has been documented.

This section is divided into two parts. First is documentation of techniques common to all forms of multipath in Section 4.1. Second is application of these techniques and unique characteristics of specific forms of multipath in Section 4.2.

4.1. Techniques Common to Multipath in Provider Networks

There is a dramatic difference between the multipath techniques used for pure Layer-2 Ethernet switches intended for enterprise networks and the multipath techniques used for large provider core networks. Many enterprise switches use only the Ethernet MAC in load balancing, thought the argument that such networks may not be carrying IP or MPLS traffic at all is rarely cited as a reason today. The routers and/or LSR used in large provider networks are assumed to be carrying IP traffic and/or MPLS traffic where the MPLS traffic is predominantly carrying IP traffic as its payload.

Most of the multipath techniques used for large provider core networks are common across all types of multipath. This is because the traffic being handled by multipath in large provider networks is predominantly IP or IP over MPLS. The following paragraph is quoted from RFC 4928, Section 2, "Current ECMP Practices" [RFC4928]:

This observation led to the specification of the PW Control Word [RFC4385] such that the values 4 and 6 which could be mistaken for IPv4 or IPv6 were avoided. More accurately, [RFC4928] was written to document the reasons for this decision made in [RFC4385].

4.1.1. Flow Identification

IP traffic in a large provider core network contains a very large number of very short lived microflows (refer to the definition of microflow in [RFC2475]). The number of flows has in the past been estimated as many millions or many tens of millions. Many of the flows exchange as few as two packet (DNS for example). Most contain only tens of packets. Most flows exist for a few seconds and some less than a second. A much smaller number of flows (though still a large number) are longer in duration and exchange larger amounts of data.

Attempts to isolate individual IP flows in large provider core networks for the purpose of routing them individually have met with resounding failure. Current practice does not attempt to isolate individual flows, but instead isolates groups of flows. If reordering is minimized or eliminated for groups of flows, then reordering is minimized or eliminated for any single flow with a group.

The method of subdividing IP traffic into groups of flows that has been used successfully for more than two decades (since the T1-NSFNET in 1987 or possibly prior to that) is to use a hash function over the IP source address and destination address. Including the TCP or UDP port numbers might be beneficial for enterprise networks but is not necessary for large provider networks. Omitting port number is large provider networks has the desirable characteristic of better enforcing fairness among flows by eliminating or reducing the potential of end users using multiple port numbers to defeat any tendency toward fairness among flows.

In large provider core networks, MPLS LSP (in contrast to IP) are very long lived, generally provide a large to very large amounts of traffic, and are relatively few in number. In many large provider core networks LSP which carry Internet traffic from one major core node to another major core node, can very substantially exceed the capacity of a multipath component link.

For MPLS traffic carrying Internet IP traffic, "taking the liberty of guessing the payload" (as described in RFC 4928) was a matter of necessity. The label stack simply did not provide adequate diversity. Initially some LSR did not support this capability. Splitting very large LSP by configuring two or more provided a workaround (which only moved the hashing and load splitting out of the core), however hashing based on label stack was highly ineffective and packing LSP individually into link bundle component links has substantial disadvantages (see Section 4.2.3).

For MPLS that is not carrying IP, the MPLS label stack is used as the basis for the load split hash. Generally the entire label stack is used or as few as three of the bottom labels are used. Using only the bottom label (or only the top label) has proven unsatisfactory in terms of splitting the load. Some forms of PW can be subdivided which has motivated the introduction of a PW flow label [I-D.ietf-pwe3-fat-pw].

4.1.2. Simple Multipath and Adaptive Multipath

Simple multipath generally relies on the mathematical probability that given a very large number of small microflows, these microflows will tend to be distributed evenly across a hash space. A common simple multipath implementation assumes that all component links are of equal capacity and perform a modulo operation across the hashed value. An alternate simple multipath technique uses a table generally with a power of two size, and distributes the table entries proportionally among component links according to the capacity of each component link.

An adaptive multipath technique is one where the traffic bound to each component link is measured and the load split is adjusted accordingly. As long as the adjustment is done within a single network element, then no protocol extensions are required and there are no interoperability issues.

Specific adaptive multipath techniques are outside of the scope of this document.

4.1.3. Traffic Split over Parallel Links

The load splitting techniques defined in Section 4.1 and those defined in Section 4.1.2 are both used in splitting traffic over parallel links between the same pair of nodes. The best known technique, though far from being the first, is Ethernet Link Aggregation [IEEE-802.1AX]. This same technique had been applied much earlier using OSPF or ISIS Equal Cost MultiPath (ECMP) over parallel links between the same nodes. Multilink PPP [RFC1717] uses a technique that provides inverse multiplexing. A number of vendors had provided proprietary extensions to PPP over SONET/SDH [RFC2615] that predated Ethernet Link Aggregation but are no longer used.

Link bundling [RFC4201] provides yet another means of handling parallel LSP. RFC4201 explicitly allow a special value of all ones to indicate a split across all component links of the bundle. Use of link bundling is discussed in Section 4.2.3.

All of these techniques, including ECMP, may be used over two or more links between a pair of nodes. The most primitive load split algorithms may require that all links be of the same capacity and may attempt to load balance equally. Somewhat less primitive techniques may allow links to be unequal in capacity. Any of these techniques can also use an adaptive multipath algorithm as described in Section 4.1.2.

4.1.4. Traffic Split over Multiple Paths

OSPF or ISIS Equal Cost MultiPath (ECMP) is a well known form of traffic split over multiple paths that may traverse intermediate nodes. ECMP is often incorrectly equated to only this case, and multipath over multiple diverse paths is often incorrectly equated to an equal division of traffic.

Many implementations are able to create more than one LSP between a pair of nodes, where these LSP are routed diversely to better make use of available capacity. The load on these LSP can be distributed proportionally to the reserved bandwidth of the LSP. These multiple LSP may be advertised as a single PSC FA and any LSP making use of the FA may be split over these multiple LSP.

Link bundling [RFC4201] component links may themselves be LSP. When this technique is used, any LSP which specifies the link bundle may be split across the multiple paths of the LSP that comprise the bundle.

Other forms of multipath may use what appear to be physical component links that are provided by a server layer. For example, the components of an Ethernet LAG may be provided by Ethernet PW [RFC4448].

Techniques which spread traffic over multiple paths may use simple multipath or adaptive multipath as described in Section 4.1.2. When ECMP is used over an IP link or MPLS LDP LSP, visibility of available capacity along the path is limited to the next hop only, therefore load which is split proportionally to the capacity of the immediate hop may not be split optimally for the entire path, even using an adaptive multipath capable forwarding. For techniques which split traffic over one or more LSP, the available capacity along the path to the destination is assumed to be known through the bandwidth reservations of the LSP.

4.2. Specific Types of Multipath

Three forms of multipath are considered here.

Of these types of multipath, the latter two can be applied to MPLS with RSVP-TE signaling or static configurations.

4.2.1. ECMP Current Practices

Equal Cost Multipath has been available in the ISIS and OSPF link state routing protocols for two decades or more. For example, see [RFC1247]. ECMP is also available in BGP. ECMP is declared out of scope in LDP, though widely implemented.

Although ECMP is not applicable to MPLS LSP setup with RSVP-TE signaling, ECMP can be applied at an LER.

At an MPLS LER ECMP can be applied over two or more MPLS LSP with traffic split proportionally to the LSP reserved bandwidth. This could also be considered to be IP ECMP with an underlying MPLS LSP server layer.

The equivalent to ECMP for an LSP setup can be achieved by creating PSC LSP and concatenating them using link bundling, and using the "all ones" link bundle component (see Section 4.2.3.

4.2.2. Ethernet Link Aggregation Current Practices

Ethernet link aggregation ([IEEE-802.1AX]) concatenates a set of Ethernet member links below the Ethernet link layer, such that the link aggregation group (LAG) appears as a single link with a single Ethernet MAC address. The link aggregation control protocol (LACP) coordinates membership in the LAG such that the member links can be made unavailable to upper layers and added to the LAG on both nodes.

For IP using a link state protocol with ECMP, Ethernet link aggregation had little effect. The load balancing on a LAG was identical to the load balancing using ECMP over the set of member links. ISIS only advertises the adjacencies between nodes. OSPF advertises each link between nodes, so for IP using OSPF, link aggregation only resulted in a reduction in routing protocol overhead and simplification of the SPF.

For MPLS, some vendors had already implemented proprietary extensions to PPP over SONET/SDH [RFC2615] that predated the earliest IEEE work on link aggregation (IEEE 802.3ad) with capabilities similar to LACP. It was not until 10GbE became widely available (about 5 years later) that LAG was used in provider core networks, and began replacing OC-192. MPLS link bundling implementations (prior to RFC status) also predated Ethernet link aggregation.

A network deployment circa 2005 could either configure many Ethernet links and use MPLS link bundling, or configure an Ethernet LAG. If an MPLS link bundle was configured to split load over all link bundle component links the functionality was equivalent to configuring the set of links as a LAG. In core LSR implementations, the load split in these two cases was identical.

4.2.3. MPLS Link Bundling Current Practices

MPLS link bundling [RFC4201] was conceived at about the time that it was clear that OC-48 was too slow for IP core links, OC-192 was just becoming available and would soon be too slow, and MPLS had strong support among multiple providers. Link bundling initially solved two problems. A few individual vendors had proprietary extensions to PPP over SONET/SDH [RFC2615]. Link bundling could offer equivalent capability and offer vendor interoperability. Second, some vendor hardware was not capable of load splitting and therefore required that each top level LSP be assigned a single path. Further, each side of a link bundle could be configured differently, one could load split and the other could place LSP on individual component link.

If LSP are place on individual links rather than split over the entire bundle, then bin packing problems can occur. LSP are often large making this packing error significant. In addition, LSP bandwidth reservations in most IP/MPLS deployments are only predictions of expected bandwidth. With link bundling, as specified, LSP cannot be moved from one link bundle component link to another. If LSP are assigned to links rather than split based on IP address pairs, there is less opportunity for one LSP to make use of unused capacity due to other LSP being utilized. The bin packing and loss of opportunity to share capacity both reduce the efficiency of capacity utilization.

MPLS link bundling does not currently offer an ability to select which LSP are assigned to a single component link and which LSP are split over the entire set of component links. Most forwarding hardware can support this. Although an LSR could in principle be configured to use some other attribute of an LSP to infer the decision to load split, such as holding priority or an affinity for an administrative attribute, no LSR software provides this capability. Until MPLS-TP there was never a need for that capability.

5. Improving Support for MPLS-TP and Multipath Requirements

The purpose of this section is to describe how MPLS-TP and multipath could coexist and to define simple changes to accomplish this.

5.1. Characteristics of MPLS-TP Multipath Solutions

Three different methods to support MPLS-TP and multipath are described. One method requires simple changes to link bundle and LAG. One method requires no changes but has disadvantages. One method involves no change to multipath but requires relaxation to MPLS-TP OAM requirements.

The best solution makes MPLS over multipath a fully compliant server layer for MPLS-TP meeting all of the requirements stated in the prior sections but cannot be fully supported by most existing LSR without hardware changes. The other two solutions have disadvantages but require little or no change to existing hardware that would otherwise support MPLS-TP. The changes are specified at the level of detail of requirements and/or framework rather than as specific protocol changes.

5.1.1. Coexistance of MPLS and MPLS-TP

The largest contributor of provider traffic today is the Internet. All of this traffic is IP with some providers, but not all, using IP over MPLS. IP is used without MPLS with ECMP and LAG and IP is used with MPLS with all three forms 0f multipath described in Section 4.2, ECMP, LAG, and link bundling.

In addition to Internet services, many providers currently offer layer-2 and layer-3 VPN services over MPLS today. Other providers offer native layer-2 services with an intention to migrate to MPLS-TP for these services.

A primary purpose of migrating VPN and circuit services from layer-2 to MPLS-TP is to reduce cost relative to a dedicated layer-2 infrastructure for these services. Much of that reduction comes from making use of infrastructure in place to support Internet traffic.

Using the capacity in place for Internet, predictive reservations can be made for higher priority services, with guarantees possible by transferring the risk of exceeding the predictions to the Internet traffic through use of priority queuing. With Internet loads being much larger, the unlikely event of predictive reservations being exceeded would easily be absorbed. This architecture allows VPN and circuit services to be delivered at lower cost.

IP/MPLS requires the use of multipath due to the high traffic levels. MPLS-TP requires a single path for each LSP. With no changes, these two requirements are in conflict. Three possible approaches are examined in the following sections.

  1. Supporting MPLS and MPLS-TP over a common server layer with multipath support as well as MPLS-TP over an MPLS server layer over a multipath capable server layer.
  2. Supporting MPLS over an MPLS-TP server layer using multiple MPLS-TP LSP as MPLS component links where multipath is needed.
  3. Relaxing MPLS-TP OAM and documenting the limitations such that MPLS-TP could be supported over an existing multipath server layer.

Each of these are separate solutions. For example, if changes to MPLS forwarding enable MPLS with multipath to support fully compliant MPLS-TP LSP, then relaxing MPLS-TP OAM is not needed. Conversely, if MPLS forwarding cannot be changed on specific existing equipment to accommodate MPLS-TP, then one of the other two solutions is required. Supporting MPLS-TP OAM at high rates also requires hardware change to most existing LSR, therefore all of these solutions require some form of hardware change.

5.1.2. Advantages and Disadvangates of Solutions

A desirable solution is one that meets all requirements and is highly cost effective. An undesirable solution is one that either does not meet all requirements or is not cost effective. The ability to use existing hardware is also desirable. A number of solutions and the necessary changes are discussed in the following subsections.

MPLS, which requires multipath, and MPLS-TP, which requires a single path, could potentially coexist in the following ways.

MPLS as a Server Layer for MPLS-TP

(Section 5.2.1)
Advangates:
MPLS-TP can be fully accommodated with small signaling changes and forwarding changes. Efficient use of capacity can be achieved.
Disadvangates:
Changes to the fields over which a hash is computed is required and therefore this method may no be supportable with some existing hardware.

MPLS-TP as a Server Layer for MPLS

(Section 5.2.2)
Advangates:
Some transport providers prefer to offer MPLS-TP due to its ability to support familiar management and operations procedures, involving static configuration of network elements and inband performance monitoring and protection activation.
Disadvangates:
Multipath is moved to the client layer. High bandwidth MPLS LSP must be supported through smaller parallel MPLS-TP LSP. The opportunity to dynamically share capacity of MPLS LSP is diminished when large MPLS LSP are run over smaller MPLS-TP LSP. The use of MPLS-TP LSP across a high bandwidth core will increase the number of LSP required and may impact scalability.

Relax MPLS-TP OAM Requirements

(Section 5.2.3)
Advangates:
Relaxing OAM requirements would allow MPLS-TP LSP to exceed the capacity of a single component (or member) link. MPLS over MPLS-TP becomes more practical.
Disadvangates:
CC/CV requires enhancement to exercise all parts of a multipath and would benefit from further enhancements (see Section 5.2.3). CC/CV must be coordinated across multiple packet processing elements. Reordering of MPLS-TP traffic, even if not harmful to the payload itself, would result in significant short term inaccuracy in loss reported by OAM LM.

5.2. MPLS-TP Multipath Solution Set

Three solutions are described. As noted in Section 5.1.1 these are three separate solutions. Each can be deployed independently. Most important neither of the first two solutions requires relaxing MPLS-TP OAM requirements. On the other hand, these solutions are not mutually exclusive.

5.2.1. MPLS as a Server Layer for MPLS-TP

Using MPLS with multipath as a server layer for MPLS-TP has the most advantages with respect to the requirements, and with the exception of inability to run on some (or most) existing hardware, has no disadvantages. This is assuming that the protocol changes suggested in this subsection are implemented in later IETF documents.

Supporting fully conformant MPLS-TP LSP over MPLS LSP which are making use of multipath, requires special treatment of the MPLS-TP LSP such that those LSP only are not subject to the multipath load slitting.

MP#7
It MUST be possible to identify MPLS-TP LSP.
MP#8
It MUST be possible to completely exclude MPLS-TP LSP from the multipath hash and load split, statically assign it to a component link or member, and compensate for this assignment in the MPLS multipath load split.
MP#9
In order to support one or more MPLS-TP LSP contained in an MPLS LSP, it MUST be possible to signal the presence of MPLS-TP LSP within an MPLS LSP.
MP#10
In order to support an MPLS LSP carrying other MPLS LSP some of which in turn carry MPLS-TP LSP, it MUST be possible to determine the minimum depth within the label stack at which an MPLS-TP LSP exists and provide this depth in signaling.
MP#11
The depth within the label stack of the multipath hash for any MPLS LSP that is carrying MPLS-TP LSP MUST be constrained for that MPLS LSP so that the hashing does not include any information past an MPLS-TP label.
MP#12
It must be possible for an LSR which is setting up an MPLS-TP or MPLS LSP to determine at CSPF time whether a link can support the MPLS-TP requirements of the LSP.

Some hardware which exists today can support requirement MP#2. For example, if a table is used to support multipath and produces satisfactory results given existing traffic patterns, and the number of component links or members is smaller than the table by a factor or N, then an allocation of a multiple of 1/N of a component or member link can be set aside for MPLS-TP traffic. The MPLS-TP traffic can be protected from an degraded performance due to an imperfect load split if the MPLS-TP traffic is given queuing priority (using strict priority and policing or shaping at ingress or locally or weighted queuing locally).

Most existing hardware cannot support requirement MP#5 but some may be able to partially support this requirements by fixing the label stack inspection depth to a fixed number of LSP from the top. Full support for requirement MP#5 requires that the depth over which the hash is computed can be derived from the label number of the label on which a label swap operation is performed.

5.2.2. MPLS-TP as a Server Layer for MPLS

Carrying MPLS LSP which are larger than a component link over an MPLS-TP server layer requires that the large MPLS client layer LSP be accommodated by multiple MPLS-TP server layer LSPs. MPLS multipath can be used in the client layer MPLS as described in Section 4.1.4.

Creating multiple MPLS-TP server layer LSP places a greater ILM scaling burden on the LSR (see Section 3.3.1.1 and the examples in Section 3.3.1.3). High bandwidth MPLS cores with a smaller amount of nodes have the greatest tendency to require LSP in excess of component links, therefore the reduction in number of nodes offsets the impact of increasing the number of server layer LSP in parallel. Today, only in cases where the ILM is small would this be an issue.

The most significant disadvantage of MPLS-TP as a Server Layer for MPLS is that the MPLS LSP reduces the efficiency of carrying the MPLS client layer. The service which provides by far the largest offered load today is Internet, for which the LSP capacity reservations are predictions of expected load. Many of these MPLS LSP may be smaller than component link capacity. Using MPLS-TP as a server layer results in bin packing problems for these smaller LSP. For those LSP that are larger than component link capacity, their capacity are not increments of convenient capacity increments such as 10Gb/s. Using MPLS-TP as an underlying server layer greatly reduces the ability of the client layer MPLS LSP to share capacity. For example, when one MPLS LSP is underutilizing its predicted capacity, the fixed allocation of MPLS-TP to component links may not allow another LSP to exceed its predicted capacity. A solution which makes less efficient use of resources may result in a less cost effective solution, due to the amount of capital equipment cost required and an increase in space and power required.

No additional requirements beyond MPLS-TP as it is now currently defined are required to support MPLS-TP as a Server Layer for MPLS. It is therefore viable but has some undesirable characteristics discussed above.

5.2.3. Relax MPLS-TP OAM Requirements

If MPLS-TP OAM requirements are not fully met, as currently specified, an LSP is not fully MPLS-TP conformant. That may be little more than a semantic inconvenience and can not prevent implementations from allowing LSP which are otherwise MPLS-TP compliant to optionally use multipath with some reduction in OAM capability.

Regardless as to whether relaxing MPLS-TP OAM requirements makes an LSP no longer an MPLS-TP LSP, this section discusses the consequence of using multipath with regard to MPLS-TP OAM.

If MPLS-TP over multipath is supported by relaxing MPLS-TP OAM requirements, the requirements listed below will improve the behavior of MPLS-TP OAM over multipath.

OAM#4
There MUST be a means of introducing entropy to MPLS-TP OAM.
OAM#5
There SHOULD be a means to focus CC/CV testing on a specific multipath component link.
OAM#6
There MUST be a means to support LM over multipath, even if at best a bounded long term inaccuracy is achieved.

5.2.3.1. MPLS-TP CC/CV OAM with Multipath

MPLS-TP CC/CV as currently defined has no means to exercise all paths of a multipath. The label stack is fixed, followed by a GAL label [RFC5586]. As is, only one path along a multipath can be exercised when the ingress to the multipath is not also the ingress to the LSP. For example, if the LSP is carrying PW, the PW themselves can be spread across the multipath, but not the OAM traffic.

If CC/CV OAM is allowed to place a label below the GAL label, the entire set of paths can be tested, though not in a deterministic manner. This is called an entropy label. Using a different random number in this entropy label for each OAM packet allows all links to be exercised on a probabilistic basis.

The loss of a isolated OAM CC/CV packet currently has no effect. If the loss of a single OAM packet can be noted by the sender, then the sender can repeatedly use the same value in the entropy label. This requires either a two way OAM or feedback to the ingress. If OAM packets can be reordered, then a sliding window of outstanding OAM packets is required. If OAM CC/CV packets are given high priority (as currently specified), then delay difference should be minimal and reordering may be non-existent if the send interval is longer than the delay difference.

If a multipath component link failure had been detected locally (at a node adjacent to the failure) and the failure corrected locally (ie: segment protection) or the component link taken out of service, the client LSP would either no longer be affected or it would be preempted. If the client LSP has been preempted, MPLS-TP OAM unmodified would be sufficient to detect this condition. The existing BFD [RFC5884] provides this functionality.

Only in the case where a component link has failed and the server layer has not been able to detect and correct the failure or take the component link out of service would CC/CV OAM on the client LSP serve any purpose. For this purpose, a relaxed OAM may be sufficient. If the client LSP has no control over the multipath itself, the entire multipath must be considered down if any uncorrected component link failure is occurring at the multipath.

The CC/CV as described here can be handled by an OAM mechanism which is bidirectional. LSP Ping provides such a mechanism [RFC4379]. Because the condition being handled by LSP ping should be quite rare, it may be acceptable to use a combination of BFD and MPLS ping to provide OAM with full coverage of all types of fault, but with a slower response to a component link failure which is not detected at the point of the fault.

For LSR implementations which support BFD and MPLS ping "as is", these may be viable as an optional MPLS-TP form of CC/CV OAM. A deployment may use this option if the reliance on IP is acceptable to the provider. Alternately MPLS-TP OAM could take such requirements into consideration and provide an additional capability in BFD or provide MPLS-TP extensions to MPLS ping.

A further small complication may occur at the OAM egress. If the egress to the LSP is a multipath egress, then the OAM may arrive at any of the component links at the egress. This requires that the CC/CV OAM be forwarded within the LSR to a common packet processor in order to be handled in hardware (or forwarded to a common CPU). This is also true of other types of OAM.

5.2.3.2. MPLS-TP LM OAM with Multipath

MPLS-TP LM OAM makes use of the count of payload packets at an egress. If the payload is reordered, even with no consequence to the payload itself, some inaccuracy is introduced to the LM. Some number of payload packets which were transmitted before the LM OAM packet was sent may arrive after the LM packet is received and some payload packets transmitted after the LM OAM packet may arrive before the LM packet.

If the LSP egress is a multipath, then the LM packets may arrive at any packet processor over which the multipath resides. The counters from each of the egress packet processors will have to be sampled. During the sampling interval, addition packet arrive and will be counted. This creates an equivalent out of order problem with respect to the LM OAM and the payload it is counting.

This error is bounded and is not cumulative. For example, if one LM interval counts too few packets, the next LM interval will tend to count too many. Over longer measurement periods the total error retains the same bounds, which over longer intervals becomes less significant.

These errors are most significant when a substantial amount of queuing delay is present (generally an indication of light congestion) and when the queues at various component links differ in delay. Queuing delay differences are generally milliseconds. Delay differences of tens of milliseconds requires persistent queues and significant congestion.

The worst case errors over long intervals are reasonably well bounded. For example, with A 10 msec delay difference, a one minute sampling yields less than a 0.02% uncertainty and over a 15 minute interval loss uncertainty is just over 0.001%. Given that congestion is required to achieve these uncertainties, the loss due to congestion is likely to significantly exceed these uncertainties for all but very short measurement intervals.

When loss is zero but short term queues are formed, the queuing delay difference is likely to be under one millisecond for the common case of parallel links that are routed along the same fiber (using WDM). The uncertainty for 1 minute and 15 minute samples are under 0.002% and just over 0.0001% (10^-6). The uncertainty over a 24 hour period is 0.00000011% or just over 10^-9. An SLA could easily be supported where loss was guaranteed not to exceed 10^-6 in any hour or 10^-8 in any 24 hour period. Such a guarantee would require that the MPLS-TP LSP be given priority over non-policed or shaped traffic and itself is policed or shaped.

This measurement uncertainty may or may not be acceptable to a given deployment. Providing an option to support MPLS-TP over multipath does introduce a bounded error to LM but it does not remove a providers option not to use MPLS-TP over multipath.

6. Summary of Recommendations

Section 3 enumerates functional requirements. Section 4 describes current practices. Section 5 enumerates functional changes to better meet these requirements. This section provides specific recommendations.

To support MPLS with multipath as a server layer for MPLS-TP the following changes are required.

Recommendation #10
Provide a means in RSVP-TE for an LSP to self identify its requirement to be treated as fully compliant MPLS-TP (disallow reordering).
Recommendation #11
Provide a means in RSVP-TE for an LSP that is not an MPLS-TP LSP but is directly carrying MPLS-TP LSP to indicate that hashing may only be performed on the first two labels and indicate the largest MPLS-TP LSP being carried (the largest potential microflow).
Recommendation #12
Provide a means in RSVP-TE for an LSP that is not an MPLS-TP LSP but is carrying MPLS-TP at some depth to indicate the maximum depth in the label stack that hashing can operate on, and the largest MPLS-TP LSP being carried (the largest potential microflow).
Recommendation #13
Provide a means in OSPF-TE and ISIS-TE to indicate the largest microflow that a multipath can accommodate independent of the largest LSP that can accommodated with load splitting. An extension to [RFC4201] which separates Maximum LSP into two variables, with backward compatibility may be the most desirable solution.

The current framework documents could be improved with the following additions.

Recommendation #14
Relax GAL specification in [RFC5586] to allow a label below GAL to provide entropy in OAM traffic over multipath.
Recommendation #15
Preferably in the OAM framework, acknowledge the need for entropy in OAM in some circumstances. Note that if no multipath exists along a path, the entropy is not needed but does no harm. Support optional entropy in MPLS-TP OAM through use of a label under the GAL label.
Recommendation #16
Document the need for MPLS Ping or other two way mechanism to keep a sliding window of outstanding packets at the sender which records the entropy value used, note any single loss, and send repeated packets for an entropy value which has experienced a loss.
Recommendation #17
Preferably in the OAM framework, document the need for CC/CV at a multipath egress to forward OAM packets for an LSP that is load split through an out of band means to a common packet processor or CPU.
Recommendation #18
Preferably in the OAM framework, document the need for LM at multipath egress to collect packet counts on all packet processors that could potentially receive packets for a given LSP.

Forwarding changes to multipath necessary to support MPLS with multipath as a server layer for fully compliant MPLS-TP are the following:

Forwarding #5
Store the maximum depth of multipath hash (or zero for unconstrained depth) in the ILM.
Forwarding #6
Do not hash using the IP stack on an LSP which is carrying MPLS-TP. An LSP where IP headers can be used in the stack can be identified by noting that a maximum depth equal zero cannot be carrying MPLS-TP or it can be explicitly indicated, independently of depth. If a CW is not used with PW, then this indication must be explicit.
Forwarding #7
When hashing on the MPLS label stack do not hash beyond the maximum depth of hash for a given LSP.
Forwarding #8
Exclude reserved labels from the hash on label stack. In particular, the GAL [RFC5586] and OAM Alert Label [RFC3429] should be skipped.

7. IANA Considerations

This memo includes no request to IANA.

8. Security Considerations

This document specifies requirements with discussion of framework for solutions. The requirements and framework are related to the coexistence of MPLS/GMPLS (without MPLS-TP) when used over a packet network, MPLS-TP, and multipath. The combination of MPLS, MPLS-TP, and multipath does not introduce any new security threats. The security considerations for MPLS/GMPLS and for MPLS-TP are documented in [RFC5920] and [I-D.ietf-mpls-tp-security-framework].

9. References

9.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2. Informative References

[RFC1717] Sklower, K., Lloyd, B., McGregor, G. and D. Carr, "The PPP Multilink Protocol (MP)", RFC 1717, November 1994.
[RFC1247] Moy, J., "OSPF Version 2", RFC 1247, July 1991.
[RFC2475] Blake, S., Black, D.L., Carlson, M.A., Davies, E., Wang, Z. and W. Weiss, "An Architecture for Differentiated Services", RFC 2475, December 1998.
[RFC2615] Malis, A. and W. Simpson, "PPP over SONET/SDH", RFC 2615, June 1999.
[RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and Multicast Next-Hop Selection", RFC 2991, November 2000.
[RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, November 2000.
[RFC3031] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001.
[RFC3032] Rosen, E., Tappan, D., Fedorkow, G., Rekhter, Y., Farinacci, D., Li, T. and A. Conta, "MPLS Label Stack Encoding", RFC 3032, January 2001.
[RFC3260] Grossman, D., "New Terminology and Clarifications for Diffserv", RFC 3260, April 2002.
[RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, P., Krishnan, R., Cheval, P. and J. Heinanen, "Multi-Protocol Label Switching (MPLS) Support of Differentiated Services", RFC 3270, May 2002.
[RFC3429] Ohta, H., "Assignment of the 'OAM Alert Label' for Multiprotocol Label Switching Architecture (MPLS) Operation and Maintenance (OAM) Functions", RFC 3429, November 2002.
[RFC4090] Pan, P., Swallow, G. and A. Atlas, "Fast Reroute Extensions to RSVP-TE for LSP Tunnels", RFC 4090, May 2005.
[RFC4201] Kompella, K., Rekhter, Y. and L. Berger, "Link Bundling in MPLS Traffic Engineering (TE)", RFC 4201, October 2005.
[RFC4206] Kompella, K. and Y. Rekhter, "Label Switched Paths (LSP) Hierarchy with Generalized Multi-Protocol Label Switching (GMPLS) Traffic Engineering (TE)", RFC 4206, October 2005.
[RFC4385] Bryant, S., Swallow, G., Martini, L. and D. McPherson, "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for Use over an MPLS PSN", RFC 4385, February 2006.
[RFC4379] Kompella, K. and G. Swallow, "Detecting Multi-Protocol Label Switched (MPLS) Data Plane Failures", RFC 4379, February 2006.
[RFC4426] Lang, J., Rajagopalan, B. and D. Papadimitriou, "Generalized Multi-Protocol Label Switching (GMPLS) Recovery Functional Specification", RFC 4426, March 2006.
[RFC4448] Martini, L., Rosen, E., El-Aawar, N. and G. Heron, "Encapsulation Methods for Transport of Ethernet over MPLS Networks", RFC 4448, April 2006.
[RFC4928] Swallow, G., Bryant, S. and L. Andersson, "Avoiding Equal Cost Multipath Treatment in MPLS Networks", BCP 128, RFC 4928, June 2007.
[RFC5286] Atlas, A. and A. Zinin, "Basic Specification for IP Fast Reroute: Loop-Free Alternates", RFC 5286, September 2008.
[RFC5462] Andersson, L. and R. Asati, "Multiprotocol Label Switching (MPLS) Label Stack Entry: "EXP" Field Renamed to "Traffic Class" Field", RFC 5462, February 2009.
[RFC5586] Bocci, M., Vigoureux, M. and S. Bryant, "MPLS Generic Associated Channel", RFC 5586, June 2009.
[RFC5714] Shand, M. and S. Bryant, "IP Fast Reroute Framework", RFC 5714, January 2010.
[RFC5860] Vigoureux, M., Ward, D. and M. Betts, "Requirements for Operations, Administration, and Maintenance (OAM) in MPLS Transport Networks", RFC 5860, May 2010.
[RFC5884] Aggarwal, R., Kompella, K., Nadeau, T. and G. Swallow, "Bidirectional Forwarding Detection (BFD) for MPLS Label Switched Paths (LSPs)", RFC 5884, June 2010.
[RFC5920] Fang, L., "Security Framework for MPLS and GMPLS Networks", RFC 5920, July 2010.
[I-D.ietf-pwe3-fat-pw] Bryant, S, Filsfils, C, Drafz, U, Kompella, V, Regan, J and S Amante, "Flow Aware Transport of Pseudowires over an MPLS PSN", Internet-Draft draft-ietf-pwe3-fat-pw-05, October 2010.
[I-D.ietf-mpls-tp-oam-framework] Allan, D, Busi, I, Niven-Jenkins, B, Fulignoli, A, Hernandez-Valencia, E, Levrau, L, Sestito, V, Sprecher, N, Helvoort, H, Vigoureux, M, Weingarten, Y and R Winter, "Operations, Administration and Maintenance Framework for MPLS-based Transport Networks", Internet-Draft draft-ietf-mpls-tp-oam-framework-11, February 2011.
[I-D.ietf-mpls-tp-security-framework] Bitar, N, Fang, L, Niven-Jenkins, B, Zhang, R, Mansfield, S, Daikoku, M and L Wang, "MPLS-TP Security Framework", Internet-Draft draft-ietf-mpls-tp-security-framework-00, February 2011.
[IEEE-802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE Standard for Local and Metropolitan Area Networks - Link Aggregation", 2006.
[ITU-T.G.800] ITU-T, "Unified functional architecture of transport networks", 2007.

Author's Address

Curtis Villamizar editor Infinera Corporation 169 W. Java Drive Sunnyvale, CA, 94089 EMail: cvillamizar@infinera.com