Internet-Draft IS-IS Distributed Flooding Reduction August 2024
White, et al. Expires 15 February 2025 [Page]
Workgroup:
Network Working Group
Published:
Intended Status:
Standards Track
Expires:
Authors:
R. White
Akamai
S. Hegde
Juniper Networks
T. Przygienda
Juniper Networks
L. Jalil
Verizon
D. Voyer
Bell Canada

IS-IS Distributed Flooding Reduction

Abstract

In dense topologies (such as data center fabrics based on the Clos and butterfly though not limited to those; in fact any topology with relatively high degree of connectivity qualifies here) IGP flooding mechanisms designed originally for rather sparse topologies can "overflood", or in other words generate too many identical copies of same information arriving at a given node from other devices. This normally results in slower convergence times and higher resource utilization to process and discard the superfluous copies. Distributed algorithms that restrict the amount of flooding performed can be constructed, as long as they result in a flooding subgraph connecting all nodes on the network in terms of flooding still. Such algorithms can reduce resource utilization significantly, while improving convergence performance. We denote such algorithm as "distributed flooding prunners" (or "prunner" for short) while requiring them to follow some simple, additional rules. The rules presented in detail later allow to deploy mix of nodes any prunning algorithm and multiple prunners at the same time if necessary while ensuring correct flood coverage for the whole network. Additionally, node by node migration, without flag day, from one algorithm to another if necessary is possible. And assuming the algorithms are behaving correctly, the blast radius on algorithm change is normally contained to a single node performing the switch and obviously the convergence of an algorithm on introduction or removal of node running such algorithm.

One such algorithm (modification of previous art), deployable even without configuration, is described in this document. Beside reducing the extraneous copies, the proposed solution does "load-balance" flooding across different possible paths in the network to prevent build up of flooding hot-spots.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 15 February 2025.

Table of Contents

1. Flooding Prunner Framework

1.1. Definitions and Axioms

Following section will outline a framework of definitions and axioms to allow mixing different flood reduction algorithms within a network safely.

As first important observation upfront, it will become clear later in this section that full, non-optimized flooding contains a special case of a prunner itself being an operation including all adjacencies and hence we name it the "zero-prunner" or "zero" for short.

1.1.1. Maximum of One Flooding Prunner on a Node

This framework allows maximum of a single prunner on each node (which was implied by the previous paragraph silently) while it allows changing a specific prunner at any time on any subset of nodes in the network while limiting the impact to the node and the convergence of nodes in its component.

1.1.2. Component

A component is defined as subset of nodes running a prunner A where each of the nodes is connected to all others by a path traversing adjacencies with A on both sides. Another way to think about this is that by removing all adjacencies with different prunners on both sides of the adjacency creates several non-connected components (partitions), each running a different prunner. Observe that there may be in the network very well multiple components which are not connected but run the same prunner. We denote a component for prunner A as A| and if two disjoint components running A are present in the network as A|' and A|''.

Observe that zero-prunner also builds components denoted as Z| and its primes.

1.1.3. Flooding Connected Dominating Sets

A flooding prunner may choose within its component a subset of links to flood on so that the component remains connected. In other words, there must be a path over such links connecting each node in the component of the prunner. We call this a flooding connected dominating set (of which e.g. a simple spanning tree is a special case) or CDS for short, and denote it for a component A| as A|*. Observe that A|* can be different for different information flooded, e.g. LSPs originated by different systems. In simple words again, the algorithm must choose a set of links that guarantee at minimum that flooding reaches all the nodes in the component.

1.1.4. Flooding Prunner

  1. Each node of a flooding prunner, except zero-prunner, MUST advertise in its node information the prunner currently operating on the node. If a node does not advertise this information, it MUST be a zero-prunner except when Section 1.1.4, Paragraph 1, Item 3 exception applies.
  2. A flooding prunner is an algorithm A building a CDS denoted as A|* over its component that MUST additionally include in flooding CDS all adjacencies to adjacent components running non zero-prunner algorithm different from A. A node running algorithm A different from zero-prunner SHOULD include in its flooding CDS all links to zero-prunners but MAY use the known behavior of zero-prunner for further optimizations (though the optimization MUST NOT assume that there is just a single Z| in the network). This is sufficient (but strictly speaking, more than necessary) to guarantee that the overall set of flooding CDS within each component creates an overall flooding CDS over the whole network or in other words, the resulting set of links that still flood connects all nodes in the network.
  3. In case dynamic-flooding is deployed in the same network, any node advertising the dynamic flooding sub-TLV MUST be treated like a node advertising a prunner with an unknown active algorithm and hence perform full flooding on adjacencies to it. A prunner node MUST NOT advertise any dynamic flooding information and disregard all such information except dynamic flooding sub-TLVs.

1.2. Desirable Properties of the Flooding Prunner Framework

Nodes within a component are free to use any kind of prunner algorithm to calculate optimized flooding. Any mode of computation, distributed or centralized will work fine as long as Section 1.1.4 is respected.

The framework is completely distributed without the need for any centralized instance or election. Computation and communication within each component is completely independent of other components.

Except determining which prunner is run on a node no configuration is necessary if the prunner algorithm itself does not need configuration, i.e. is completely distributed.

A node is free to choose a different prunner or zero-prunner at any point in time independent of all other nodes. It may end up in another component or become a zero-prunner and the maximum impact is re-computation within two components that see such node leave or join but more likely, only adjoining nodes have to adjust their prunning decisions. In simple words, the framework allows for a node by node deployment or even migration of prunners without network wide re-computation of optimized flooding. This is obviously critical to stability of large networks that may not even converge within reasonable time anymore if the whole network reverts back to zero-prunning due to network wide impact based on election, misconfiguration of a single node or deployment of a single node that affects the flooding optimization of the complete network.

Though the network provides extreme flexibility in deployment of prunners operationally the most likely scenario is a node-by-node deployment of a single prunner algorithm in the network in addition to zero-prunner and in case of necessity the node-by-node migration to another new prunner.

1.3. Example

A|' A|'' Z|' Z|'' Z|''' B|
Figure 1: Network of Mixed Prunners

Figure 1 visualizes a network with three prunners running, two components with prunner A, one with prunner B and three components running zero-prunner, annotated hence as Z|', Z|'' and Z|'''. CDS within components are not visualized since they do not contribute to further understanding but the links that are included to connect the CDS of the component following Section 1.1.4 are made fat. Obviously the overall graph is connected despite several algorithms and components the network encompasses on such, most likely not very likely, deployment.

Figure 1 also visualizes why the overall CDS can be easily more than a spanning tree of the overall network. A node seeing locally its neighbor running another algorithm cannot decide easily based simply on local knowledge whether the link should be included in flooding but could do so based on the overall view of the network of course and by some tie-breaking an algorithm to prune overall coverage to a spanning tree could be devised. Due to possible resulting long flooding paths and one link minimal cuts such an algorithm is not considered here. Of course in the future such an algorithm can be proposed with the nodes advertising whether they run such a 'prunner-of-prunners' while the absence of prunning can be denoted as 'zero-meta-prunner' to extend the symmetry of this solution recursively.

1.4. Signalling

The only signalling necessary is a Sub-TLV of the IS-IS Router Capability TLV-242 that is defined in [RFC7981] with the following format. The Sub-TLV MUST be advertised by a node that is actively running any prunner except zero-prunner and the absence of this Sub-TLV signifies a node being a 'zero-prunner'.

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |     Type      |     Length    |        Algorithm              |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2
  • Type: TBD1
  • Length: 2
  • Algorithm: a numeric identifier in the range 0 .. 2^16-1 that identifies the algorithm used to calculate CDS (flooding topology) of the component from the IGP Flooding Prunner Registry as assigned per Section 4.

2. Algorithm 256: MANET-Based, Load-Balancing Algorithm

The following section describes a distributed algorithm similar to and based on those implemented in OSPF to support mobile ad-hoc networks, as described in [RFC5449],[RFC5614]. These solutions have been widely implemented and deployed.

2.1. Experimental Evidence

Laboratory tests based on a well known open source codebase show that modifications similar to the algorithm presented here reduce flooding in a large scale emulated butterfly network topology significantly. Under unmodified flooding procedures intermediate systems receive, on average, 40 copies of any changed LSP fragment in a 2'500 nodes butterfly network. With the changes described in this document said systems received, on average, two copies of any changed LSP fragment. In many cases, only a single copy of each changed LSP was received and processed per node. In terms of performance, overall convergence times were cut in roughly half.

An early version of mechanisms described here has been implemented in the FR Routing open source routing stack as part of `fabricd` daemon and the described modification has been implement by commercial vendors.

2.2. Example Network

Following spine and leaf fabric will be used in further description of the introduced modifications.

+====+ +====+ +====+ +====+ +====+ +====+
| 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1)
+====+ +====+ +====+ +====+ +====+ +====+

+====+ +====+ +====+ +====+ +====+ +====+
| 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0)
+====+ +====+ +====+ +====+ +====+ +====+
Figure 3

The above picture does not contain the connections between devices for readability purposes. The reader should assume that each device in a given layer is connected to every device in the layer above it in a butterfly network fashion. For instance:

The tiers or stages of the fabric are marked for easier reference. Alternate representation of this topology is a "folded Clos" with T2 being the "top of the fabric" and T0 representing the leaves.

2.3. Flooding Modifications

This section describes detailed modifications to the IS-IS flooding process to reduce the full topology to a dominating connected set of links used for flooding. It does at the same time balance the remaining flooding across all links in the topology to prevent hot-spots.

2.3.1. Optimizing Flooding

The simplest way to conceive of the solution presented here is in two stages:

  • Stage 1: Forward Optimization

    • Find the group of intermediate systems that will all flood to the same set of neighbors as the local IS
    • Decide (deterministically) which subset of the intermediate systems within this group should re-flood any received LSPs
  • Stage 2: Reverse Optimization

    • Find neighbors on the shortest path towards the origin of the change
    • Do not flood towards these neighbors

The first stage is best explained through an illustration. In the network above, if 5A transmits a modified Link State Protocol Data Unit (LSP) to 4A-4F, each of 4A-4F nodes will, in turn, flood this modified LSP to 3A (for instance). With this, 3A will receive 6 copies of the modified LSP, while only one copy is necessary for the intermediate systems shown to converge on the same view of the topology. If 4A-4F could determine that all of them will all flood identical copies of the modified LSP to 3A, it would be possible for all of them except one to decide not to flood the changed LSP to 3A.

The technique used in this draft to determine such flooding group is for each intermediate system to calculate a special SPT (shortest-path spanning tree) from the point of view of the transmitting neighbor. As next step, by setting the metric of all links to 1 and truncating the SPT to two hops, the local IS can find the group of neighbors it will flood any changed LSP towards and the set of intermediate systems (not necessarily neighbors) which will also flood to this same set of neighbors. If every intermediate system in the flooding set performs this same calculation, they will all obtain the same flooding group.

Once such a flooding group is determined, the members of the flooding group will each (independently) choose which of the members should re-flood the received information. A common hash function is used across a set of shared variables so each member of the group comes to the same conclusion as to the designated flooding nodes. The group member which is in such a way `selected` to flood the changed LSP does so normally; the remaining group members suppress the flooding of the LSP initially.

Each IS calculates the special, truncated SPT separately, and determines which IS should flood any changed LSPs independently based on a common hash function. Because these calculations are performed using a shared view of the network, however (based on the common link state database) and such a shared hash function, each member of the flooding group will make the same decision under converged conditions. In the transitory state of nodes having potentially different view of topologies the flooding may either overflood or in worse case not flood enough for which we introduce a 'quick-patching' mechanism later but ultimately will converge due to periodic CSNP origination per normal protocol operation.

The second stage is simpler, consisting of a single rule: do not flood modified LSPs along the shortest path towards the origin of the modified LSP. This rule relies on the observation that any IS between the origin of the modified LSP and the local IS should receive the modified LSP from some other IS closer to the source of the modified LSP. It is worth to observe that if all the nodes that should be designated to flood within a peer group are pruned by the second stage the receiving node is at the `tail-end` of the flooding chain and no further flooding will be necessary. Also, per normal protocol procedures flooding to the node from which the LSP has been received will not be performed.

2.3.2. Optimization Process Details

This section provides normative description of the specification. Any node implementing this solution MUST exhibit external behavior that conforms to the algorithms provided.

Each intermediate system will determine whether it should re-flood LSPs as described below. When a modified LSP arrives from a Transmitting Neighbor (TN), the result of the following algorithm obtains the necessary decision:

Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) of nodes running this algorithm or zero-prunner by:

A)
Set all link metrics to 1
B)
Calculate an SPT truncated to 2 hops from the perspective of TN
C)

For each IS that is two hops away (has a metric of two in the truncated SPT) from TN:

i.
If the IS is the LSP originator, skip
ii.
If the IS is a neighbor of the LSP originator, skip
iii.
If the IS is on the shortest path from the TN towards towards the originator of the modified LSP, skip
iv.
If the IS is *not* on the shortest path from the TN towards the originator of the modified LSP, add it to THL
D)
Add each IS that is one hop away from TN to the RNL

Step 2: Sort nodes in RNL by system IDs, from the least value to the greatest.

Step 3: Calculate a number, H, by taking as initial value the fragment ID, shifting it by 1 bit to the right and starting at last byte of the system ID for all bytes XOR each with the current value and rotate the current value to the left by 4 bits. The shifting of the fragment ID will put 2 sequent fragments onto the same flooding tree to minimize re-ordering and the subsequent XOR'ing and rotating will maximize the amount of entropy obtained for a good hash value. RNum is the number of nodes in the RNL. Consequently, set N to the H MOD of RNum (N=H MOD RNum). With that N will be less than the number of members of RNL. (footnote 1: this allows for some balancing of LSPs coming from same system ID).

Step 4: Starting with the Nth member of RNL: where N is the index into the members in RNL, with index starting from zero (Index zero assigned to the IS with lowest system-id):

A)
If THL is empty, move to Step 5
B)
If this member of RNL is the local calculating IS, it MUST reflood the modified LSP to at least the remaining members in the THL it is adjacent to; move to Step 5
C)
Remove all members of THL connected to (adjacent to) this member of RNL
D)
Move to the next member of RNL, wrapping to the beginning of RNL if necessary

Step 5: To adhere to Section 1.1.4 include yourself as reflooder for LSPs arriving from all TNs running a different prunner unless it is zero-prunner.

Note 1: This description is leaning towards clarity rather than optimal performance when implemented.

Note 2: An implementation in a node MAY choose independently of others to provide a configurable parameter to allow for more than one node in RNL to reflood, e.g. it may reflood even if it's only the member that would be chosen from the RNL if a double coverage of THL is required. The modifications to the algorithm are simple enough to not require further text.

2.3.3. Flooding Failures

It is possible that during initial convergence or in some failure modes the flooding will be incomplete due to the optimizations outlined. Specifically, if a reflooder fails, or is somehow disconnected from all the links across which it should be reflooding, an LSP could be only partially distributed through the topology. To speed up convergence under such partition failures (observe that periodic CSNPs will under any circumstances converge the topology though at a slower pace), an intermediate system which does not reflood a specific LSP (or fragment) SHOULD:

A)
Set a short, configurable timer which should be significantly shorter than CSNP interval used.
B)
When the timer expires, send Partial Sequence Number Packet (PSNP) of all LSPs that have *not* been reflooded during the timer runtime to all neighbors unless an up-to-date PSNP or CSNP has been already received from the neighbor.
C)
Per normal protocol procedures process any Partial Sequence Number Packets (PSNPs) received that indicate that neighbors still have older versions of the LSP will lead to the usual synchronization of the databases that are out of sync due to optimized flooding.
D)
If such resynchronizations above a configurable threshold are required (i.e. PSNPs are sent to the neighbors and are answered with requests), an implementation SHOULD notify the network operator via the according mechanism about the condition.

2.3.4. Signaling Considerations

It bares repeating that in case the hashing algorithm a node uses is different from this draft a different algorithm number must be assigned and used.

2.3.5. Additional Deployment Considerations

A node deploying this algorithm on point-to-point links MUST send CSNPs on such links. This does not represent a dramatic change given most deployed implementations today already exhibit this behavior to prevent possible slow synchronization of IS-IS database across such links and to provide additional periodic consistency guarantees.

2.3.6. Flooding Example

Assume, in the network specified, that 5A floods some modified LSP towards 4A-4F and we only use a single node to reflood. To determine whether 4A should flood this LSP to 3A-3F:

  • 5A is TN; 4A calculates a truncated SPT from 5A's perspective with all link metrics set to 1
  • 4A builds THL, which contains 3A, 3B, 3C, 3D, 3E, 3F, 5B, 5C, 5D, 5E and 5F
  • 4A builds RNL, which contains 4A,4B,4C,4D,4E and 4F, sorting it by the system ID
  • 4A computes hash on the received LSP-ID to get N; assume N is 1 in this case
  • Since 4A is the 1st member of RNL and there are members in THL, 4A must reflood; the loop exits

2.3.7. A Note on Performance

The calculations described here seem complex, which might lead the reader to conclude that the cost of calculation is so much higher than the cost of flooding that this optimization is counter-productive. First, The description provided here is designed for clarity rather than optimal calculation. Second, many of the involved calculations can be easily performed in advance and stored, rather than being performed for each LSP occurence and each neighbor. Optimized versions of the process described here have been implemented, and do result in strong convergence speed gains.

3. Security Considerations

This document outlines framework for modifications to the IS-IS protocol for operation on high density network topologies. Implementations SHOULD implement IS-IS cryptographic authentication, as described in [RFC5304], and should enable other security measures in accordance with best common practices for the IS-IS protocol.

4. IANA Section

IANA is requested to set up a registry called "IGP Flooding Prunner Type" under the existing "Interior Gateway Protocol (IGP) Parameters" IANA registry.

Values in this registry come from the range 0 .. 2^16-1.

The following values are defined:

5. Contributors

The following people have contributed to this draft and are mentioned without any particular order: Abhishek Kumar, Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, and Rodny Molina.

6. Normative References

[RFC7981]
Ginsberg, L., Previdi, S., and M. Chen, "IS-IS Extensions for Advertising Router Information", RFC 7981, DOI 10.17487/RFC7981, , <https://www.rfc-editor.org/info/rfc7981>.

7. Informative References

[RFC5304]
Li, T. and R. Atkinson, "IS-IS Cryptographic Authentication", RFC 5304, DOI 10.17487/RFC5304, , <https://www.rfc-editor.org/info/rfc5304>.
[RFC5449]
Baccelli, E., Jacquet, P., Nguyen, D., and T. Clausen, "OSPF Multipoint Relay (MPR) Extension for Ad Hoc Networks", RFC 5449, DOI 10.17487/RFC5449, , <https://www.rfc-editor.org/info/rfc5449>.
[RFC5614]
Ogier, R. and P. Spagnolo, "Mobile Ad Hoc Network (MANET) Extension of OSPF Using Connected Dominating Set (CDS) Flooding", RFC 5614, DOI 10.17487/RFC5614, , <https://www.rfc-editor.org/info/rfc5614>.
[RFC8126]
Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, , <https://www.rfc-editor.org/info/rfc8126>.

Authors' Addresses

Russ White
Akamai
Shraddha Hegde
Juniper Networks
Tony Przygienda
Juniper Networks
Luay Jalil
Verizon
Daniel Voyer
Bell Canada