Internet-Draft | IS-IS Optimal Distributed Flooding for D | January 2023 |
White, et al. | Expires 20 July 2023 | [Page] |
In dense topologies (such as data center fabrics based on the Clos and butterfly topologies, though not limited to those exclusively), IGP flooding mechanisms designed originally for sparse topologies can "overflood," or in other words generate too many identical copies of topology and reachability information arriving at a given node from other devices. This normally results in slower convergence times and higher resource utilization to process and discard the superfluous copies. The modifications to the flooding mechanism in the Intermediate System to Intermediate System (IS-IS) link state protocol described in this document reduce resource utilization significantly, while increaseing convergence performance in dense topologies. Beside reducing the extraneous copies it uses the dense topologies to "load-balance" flooding across different possible paths in the network to prevent build up of flooding hot-spots.¶
Note that a Clos fabric is used as the primary example of a dense flooding topology throughout this document. However, the flooding optimizations described in this document apply to any arbitrary topology.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 20 July 2023.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The goal of this draft is to solve one of the problems occurring when operating a link state protocol in a densely meshed topology. Such topologies with high average fanout, causes too many copies of identical information to be flooded within the network. Analysis and experiments show, for instance, that in a butterfly fabric of around 2'500 intermediate systems, each intermediate system will receive over 40 copies of any changed LSP fragment. This not only wastes bandwidth and processor time, this dramatically slows convergence speed under topological changes.¶
This document describes a set of modifications to the existing IS-IS flooding mechanisms which will minimize the number of LSP fragments received by individual intermediate systems. In its extreme version the change leads to only one copy per intermediate system being processed. The mechanisms described in this document are similar to and based on those implemented in OSPF to support mobile ad-hoc networks, as described in[RFC5449],[RFC5614], and[RFC7182]. These solutions have been widely implemented and deployed.¶
The following people have contributed to this draft and are mentioned without any particular order: Abhishek Kumar, Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, and Rodny Molina.¶
Laboratory tests based on a well known open source codebase show that modifications similar to the ones described in this draft reduce flooding in a large scale emulated butterfly network topology signficantly. Under unmodified flooding procedurs intermediate systems receive, on average, 40 copies of any changed LSP fragment in a 2'500 nodes butterfly network. With the changes described in this document said systems received, on average, two copies of any changed LSP fragment. In many cases, only a single copy of each changed LSP was received and processed per node. In terms of performance, overall convergence times were cut in roughly half.¶
An early version of mechanisms described in this document has been implemented in the FR Routing open source routing stack as part of `fabricd` daemon.¶
Following spine and leaf fabric will be used in further description of the introduced modifications.¶
The above picture does not contain the connections between devices for readability purposes. The reader should assume that each device in a given layer is connected to every device in the layer above it in a butterfly network fashion. For instance:¶
The tiers or stages of the fabric are marked for easier reference. Alternate representation of this topology is a "folded Clos" with T2 being the "top of the fabric" and T0 representing the leaves.¶
This section describes detailed modifications to the IS-IS flooding process to reduce flooding load in a densely meshed topology. It does at the same time distribute the reduced flooding across the whole topology to prevent hot-spots.¶
The simplest way to conceive of the solution presented here is in two stages:¶
The first stage is best explained through an illustration. In the network above, if 5A transmits a modified Link State Protocol Data Unit (LSP) to 4A-4F, each of 4A-4F nodes will, in turn, flood this modified LSP to 3A (for instance). With this, 3A will receive 6 copies of the modified LSP, while only one copy is necessary for the intermediate systems shown to converge on the same view of the topology. If 4A-4F could determine that all of them will all flood identical copies of the modified LSP to 3A, it would be possible for all of them except one to decide not to flood the changed LSP to 3A.¶
The technique used in this draft to determine such flooding group is for each intermediate system to calculate a special SPT (shortest-path spanning tree) from the point of view of the transmitting neighbor. As next step, by setting the metric of all links to 1 and truncating the SPT to two hops, the local IS can find the group of neighbors it will flood any changed LSP towards and the set of intermediate systems (not necessarily neighbors) which will also flood to this same set of neighbors. If every intermediate system in the flooding set performs this same calculation, they will all obtain the same flooding group.¶
Once such a flooding group is determined, the members of the flooding group will each (independently) choose which of the members should re-flood the received information. A common hash function is used across a set of shared variables so each member of the group comes to the same conclusion as to the designated flooding nodes. The group member which is in such a way `selected` to flood the changed LSP does so normally; the remaining group members suppress the flooding of the LSP initially.¶
Note that there is no signaling between the intermediate systems running this flooding reduction mechanism for the solution to work. Each IS calculates the special, truncated SPT separately, and determines which IS should flood any changed LSPs independently based on a common hash function. Because these calculations are performed using a shared view of the network, however (based on the common link state database) and such a shared hash function, each member of the flooding group will make the same decision under converged conditions. In the transitory state of nodes having potentially different view of topologies the flooding may either overflood or in worse case not flood enough for which we introduce a 'quick-patching' mechanism later but ultimately will converge due to periodic CSNP origination per normal protocol operation.¶
The second stage is simpler, consisting of a single rule: do not flood modified LSPs along the shortest path towards the origin of the modified LSP. This rule relies on the observation that any IS between the origin of the modified LSP and the local IS should receive the modified LSP from some other IS closer to the source of the modified LSP. It is worth to observe that if all the nodes that should be designated to flood within a peer group are pruned by the second stage the receiving node is at the `tail-end` of the flooding chain and no further flooding will be necessary. Also, per normal protocol procedures flooding to the node from which the LSP has been received will not be performed.¶
This section provides normative description of the specification. Any node implementing this solution MUST exhibit external behavior that conforms to the algorithms provided.¶
Each intermediate system will determine whether it should re-flood LSPs as described below. When a modified LSP arrives from a Transmitting Neighbor (TN), the result of the following algorithm obtains the necessary decision:¶
Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) by:¶
For each IS that is two hops away (has a metric of two in the truncated SPT) from TN:¶
Step 2: Sort nodes in RNL by system IDs, from the least value to the greatest.¶
Step 3: Calculate a number, N, by adding first each byte in LSP-ID under consideration (without using the fragment ID) and then adding value of its fragment ID MOD 2 (footnote 1: this allows for some balancing of LSPs coming from same system ID without introducing excessive amount of state in an implementation per originator). Consequently, set N to the MOD of N when divided by number of neighbors in RNL. With that N will be less than the number of members of RNL.¶
Step 4: Starting with the Nth member of RNL:¶
Note 1: This description is leaning towards clarity rather than optimal performance when implemented.¶
Note 2: An implementation in a node MAY choose independently of others to provide a configurable parameter to allow for more than one node in RNL to reflood, e.g. it may reflood even if it's only the member that would be chosen from the RNL if a double coverage of THL is required. The modifications to the algorithm are simple enough to not require further text.¶
It is possible that during initial convergence or in some failure modes the flooding will be incomplete due to the optimizations outlined. Specifically, if a reflooder fails, or is somehow disconnected from all the links across which it should be reflooding, an LSP could be only partially distributed through the topology. To speed up convergence under such partition failures (observe that periodic CSNPs will under any circumstances converge the topology though at a slower pace), an intermediate system which does not reflood a specific LSP (or fragment) SHOULD:¶
A node deploying this algorithm SHOULD advertise algorithm value <TBD> in the IS-IS Dynamic Flooding sub-TLV of the Router Capability TLV (242) [RFC7981] as specified in [I-D.ietf-lsr-dynamic-flooding]. It bares repeating again that in case the hashing algorithm a node uses is different from this draft a different algorithm number must be assigned and used.¶
A node deploying this algorithm on point-to-point links MUST send CSNPs on such links. This does not represent a dramatic change given most deployed implementations today already exhibit this behavior to prevent possible slow synchronization of IS-IS database across such links and to provide additional periodic consistency guarantees.¶
Assume, in the network specified, that 5A floods some modified LSP towards 4A-4F and we only use a single node to reflood. To determine whether 4A should flood this LSP to 3A-3F:¶
The calculations described here seem complex, which might lead the reader to conclude that the cost of calculation is so much higher than the cost of flooding that this optimization is counter-productive. First, The description provided here is designed for clarity rather than optimal calculation. Second, many of the involved calculations can be easily performed in advance and stored, rather than being performed for each LSP occurence and each neighbor. Optimized versions of the process described here have been implemented, and do result in strong convergence speed gains.¶
This document outlines modifications to the IS-IS protocol for operation on high density network topologies. Implementations SHOULD implement IS-IS cryptographic authentication, as described in [RFC5304], and should enable other security measures in accordance with best common practices for the IS-IS protocol.¶