Internet-Draft | IS-IS Optimal Distributed Flooding for D | March 2022 |
White, et al. | Expires 3 September 2022 | [Page] |
In dense topologies (such as data center fabrics based on the Clos and butterfly, though not limited to these topologies), flooding mechanisms designed for sparse topologies can flood excessively, or carry too many copies of topology and reachability to fabric devices. This results in slower convergence times and higher resource utilization. The modifications to the flooding mechanism in the Intermediate System to Intermediate System (IS-IS) link state protocol described in this document reduce resource utilization to a minimum, while increasing convergence performance in dense topologies.¶
Note that a Clos fabric is used as the primary example of a dense flooding topology throughout this document. However, the flooding optimizations described in this document apply to any dense topology.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 September 2022.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The goal of this draft is to solve one specific set of problems involved in operating a link state protocol in a dense mesh topology. The problem with such topologies is the connectivity density, which causes too much information to be flooded (or too much repeated state to be flooded). Analysis and experiment show, for instance, that in a butterfly fabric of around 2500 intermediate systems, each intermediate system will receive 40+ copies of any changed LSP fragment. This not only wastes bandwidth and processor time, this dramatically slows convergence speed.¶
This document describes a set of modifications to existing IS-IS flooding mechanisms which minimize the number of LSP framgents received by individual intermediate systems, potentially to one copy per intermediate system. The mechanism described here chooses different flooding intermediates using a hash across the system ID to prevent single failures from having a large impact on flooding, and to spread the processing load of flooding across many systems and prevent bottlenecks.¶
The mechanisms described in this document are similar to those implemented in OSPF to support mobile ad-hoc networks, as described in [RFC5449], [RFC5614], and [RFC7182]. Mechanisms similar to these have been widely implemented and deployed.¶
The following people have contributed to this draft: Abhishek Kumar, Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, and Rodny Molina.¶
Lab testing shows modifications similar to these reduce flooding in a large scale emulated butterfly network topology; without these modifications, intermediate systems receive, on average, 40 copies of any changed LSP fragment. With these modifications, intermediate systems recieve, on average, two copies of any changed LSP fragment. In many cases, each intermediate system receives one copy of each changed LSP. In terms of performance, the modifications described here reduce convergence times by around 50%. A network that converges in about 30-40 seconds without these modifications converged in 15-20 seconds with these modifications. Processor load times were not checked, as this was an emulated environment.¶
A mechanism similar to the one described in this document has been implemented in the FR Routing open source routing stack as part of fabricd.¶
The following spine and leaf fabric will be used to describe these modifications.¶
To reduce confusion (spine and leaf fabrics are difficult to draw in plain text art), this diagram does not contain the connections between devices. The reader should assume that each device in a given layer is connected to every device in the layer above it. For instance:¶
The tiers or stages of the fabric are also marked for easier reference. T0 is assumed to be connected to application servers, or rather they are Top of Rack (ToR) intermediate systems. The remaining tiers, T1 and T2, are connected only to other devices in the fabric itself.¶
Flooding is perhaps the most challenging scaling issue for a link state protocol running on a dense, large scale fabric. This section describes modifications to the IS-IS flooding process to reduce flooding load on a dense or mesh topology.¶
The simplest way to conceive of the solution presented here is in two stages:¶
The first stage is best explained through an illustration. In the network above, if 5A transmits a modified Link State Protocol Data Unit (LSP) to 4A-4F, each of 4A-4F will, in turn, flood this modified LSP to 3A (for instance). 3A will receive 6 copies of the modified LSP, while only one copy is necessary for the intermediate systems shown to converge on a single view of the topology. If 4A-4F could determine they will all flood identical copies of the modified LSP to 3A, it is possible for all of them except one to decide not to flood the changed LSP to 3A.¶
The technique used in this draft to determine the flooding group is for each intermediate system to calculate a special SPT from the point of view of the transmitting neighbor. By setting the metric of all links to 1 and truncating the SPT to two hops, the local IS can find the group of neighbors it will flood any changed LSP towards and the set of intermediate systems (not necessarily neighbors) which will also flood to this same set of neighbors. If every intermediate system in the flooding set performs this same calculation, they will all obtain the same flooding group.¶
Once this flooding group is determined, the members of the flooding group will each (independently) choose which of the members should flood. Each member of the flooding group calculates this independently of all the other members, using a common hash across a set of shared variables so each member of the group comes to the same conclusion. The group member which is selected to flood the changed LSP does so normally; the remaining group members do not flood the LSP.¶
Note there is no signaling between the intermediate systems running this flooding reduction mechanism. Each IS calculates a special, truncated SPT separately, and determines which IS should flood any changed LSPs independently. Because these calculations are performed using a shared view of the network, however (based on the common link state database) and a shared sorting hash, each member of the flooding group will make the same decision.¶
The second stage is simpler, consisting of a single rule: do not flood modified LSPs along the shortest path towards the origin of the modified LSP. This rule relies on the observation that any IS between the origin of the modified LSP and the local IS should receive the modified LSP from some other IS closer to the source of the modified LSP.¶
Each intermediate system will determine whether it should reflood as described below, when a modified LSP arrives from a Transmitting Neighbor (TN), by:¶
Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) by:¶
Step 2: Sort RNL by system IDs, from the least to the greatest.¶
Step 3: Calculate a number, N, by adding each byte in the LSP originator's System-ID and then taking MOD on the number of neighbors. N MUST be less than the number of members of RNL.¶
Step 4: Starting with the Nth member of RNL:¶
Note: This description is geared to clarity rather than optimal performance.¶
It is possible in some failure modes for flooding to be incomplete because of the flooding optimizations outlined. Specifically, if a reflooder fails, or is somehow disconnected from all the links across which it should be reflooding, it is possible an LSP is only partially flooded through the fabric. To prevent faliures, an intermediate system which does not reflood an LSP (or fragment) should:¶
Assume, in the network above, 5A floods some modified LSP towards 4A-4F. To determine whether 4A should flood this LSP to 3A-3F:¶
The calculations described here are complex, which might lead the reader to conclude that the cost of calculation is so much higher than the cost of flooding that this optimization is counter-productive. The description provided here is designed for clarity rather than optimal calculation, however. Many of the calculations can be performed in advance and stored, rather than being performed for each LSP and each neighbor. Optimized versions of the process described here have been implemented, and do result in strong convergence speed gains.¶
This document outlines modifications to the IS-IS protocol for operation on high density network topologies. Implementations SHOULD implement IS-IS cryptographic authentication, as described in [RFC5304], and should enable other security measures in accordance with best common practices for the IS-IS protocol.¶