Internet-Draft | Flood Reduction | May 2022 |
Chen, et al. | Expires 16 November 2022 | [Page] |
This document describes extensions to Border Gateway Protocol (BGP) for flooding the link states on a topology that is a subgraph of the complete topology of a BGP-SPF domain, so that the amount of flooding traffic in the domain is greatly reduced. This would reduce convergence time with a more stable and optimized routing environment.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 16 November 2022.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
For some networks such as dense Data Center (DC) networks with BGP-SPF, the existing Link State (LS) flooding mechanism defined in [I-D.ietf-lsvr-bgp-spf] for a BGP-SPF domain may not be efficient and may have some issues. The extra LS flooding consumes network bandwidth. Processing the extra LS flooding, including receiving, buffering and decoding the extra LSs, wastes memory space and processor time. This may cause scalability issues and affect the network convergence negatively.¶
This document describes extensions to Border Gateway Protocol (BGP) for flooding the link states on a topology that is a subgraph of the complete topology of a BGP-SPF domain, so that the amount of flooding traffic in the domain is greatly reduced.¶
The following terminologies are used in this document.¶
[I-D.ietf-lsvr-bgp-spf] defines three BGP peering models:¶
This section briefs the BGP-SPF Link State Flooding in each of these models.¶
In RR model, BGP-SPF speakers/nodes peer solely with one or more Route Reflectors (RRs) or controllers over eBGP sessions. A BGP-SPF speaker sends/advertises its BGP-LS-SPF Link NLRI in a BGP update message to the RRs or controllers that the speaker peers with when it discovers that its corresponding link is up. After receiving the Link NLRI, each of the RRs or controllers sends the NLRI in a BGP update message to the other BGP-SPF speakers that peer with the RRs or controllers.¶
For example, Figure 1 shows a BGP-SPF domain, which contains two RRs RR1 and RR2, and three network nodes A, B and C. RR1 peers with all three nodes A, B and C in the network. RR2 also peers with all three nodes A, B and C in the network. There is a link between A and B, a link between A and C, and a link between B and C.¶
Each of the nodes A, B and C in the network sends/advertises its link NLRIs in BGP update messages to both RR1 and RR2. After receiving a link NLRI in a BGP update message from a node (e.g., node A), each of RR1 and RR2 sends the NLRI in a BGP update message to the other nodes (e.g., nodes B and C). Each of the other nodes receives two copies of the same NLRI, one from RR1 and the other from RR2. One copy is enough, the other redundant copy should be reduced.¶
In Node Connections model, EBGP single-hop sessions are established over direct point-to-point links interconnecting the nodes in the BGP-SPF routing domain. Once the session has been established and the BGP-LS-SPF AFI/SAFI capability has been exchanged for the corresponding session, then the link is considered up from a BGP-SPF perspective and the corresponding BGP-LS-SPF Link NLRI is advertised to all the nodes in the domain through all the BGP sessions over the links. If the session goes down, the corresponding Link NLRI will be withdrawn. The withdrawal is done through advertising a BGP update containing the NLRI in MP_UNREACH_NLRI to all the nodes in the domain using all BGP sessions over the links.¶
For example, Figure 2 shows a BGP-SPF domain, which contains four nodes A, B, C and D. These four nodes are connected by six links. There are two parallel links between A and B, a link between A and C, a link between A and D, a link between B and C and a link between C and D.¶
Suppose that the BGP sessions over all the links except for the session over the link between A and D have been established and the BGP-LS-SPF AFI/SAFI capability has been exchanged for the corresponding sessions. When the BGP session over the link between A and D is established and the BGP-LS-SPF AFI/SAFI capability is exchanged for the corresponding session, node A considers that the link from A to D is up and sends the BGP-LS-SPF Link NLRI for the link through its four BGP sessions (i.e., the session between A and B over the first parallel link between A and B, the session between A and B over the second parallel link between A and B, the session between A and C over the link between A and C, and the session between A and D over the link between A and D) to nodes B, C and D. After receiving the NLRI from node A, each of the nodes B, C and D sends the NLRI to the other nodes that have BGP sessions with the node. Node B sends the NLRI to node C. Node C sends the NLRI to nodes B and D. Node D sends the NLRI to node C.¶
Similarly, when the BGP session over the link between A and D is established and the BGP-LS-SPF AFI/SAFI capability is exchanged for the corresponding session, node D considers that the link from D to A is up and sends the BGP-LS-SPF Link NLRI for the link through its two BGP sessions (i.e., the session between D and C over the link between D and C, and the session between D and A over the link between D and A) to nodes C and A. After receiving the NLRI from node D, each of the nodes A and C sends the NLRI to the other nodes that have BGP sessions with the node. Node C sends the NLRI to nodes A and B. Node A sends the NLRI to nodes B and C through two parallel BGP sessions to B and the BGP session to C.¶
In Directly-Connected Nodes model, BGP-SPF speakers peer with all directly-connected nodes but the sessions may be between loopback addresses. Consequently, there will be a single BGP session even if there are multiple direct connections between BGP-SPF speakers. BGP-LS-SPF Link NLRI is advertised as long as a BGP session has been established, the BGP-LS-SPF AFI/SAFI capability has been exchanged. Since there are BGP sessions between every directly-connected nodes in the BGP-SPF routing domain, there is only a reduction in BGP sessions when there are parallel links between nodes comparing to node connections model.¶
This section describes the revised flooding procedures to support flooding reduction for different models, including RR Model and Node Connections Model. These procedures are backward compatible. In a network with some nodes (including RRs) not supporting flooding reduction, a link NLRI originated from any node will be distributed to every node in the network.¶
In RR model, the revised flooding procedure is as follows:¶
For example, for the BGP-SPF domain in Figure 1, using the revised flooding procedure, speaker/Node A sends its Link NLRI for link A to B to RR1 when A discovers that link A to B is up. Node A does not send the NLRI to RR2. After receiving the Link NLRI for link A to B from speaker/node A, RR1 sends the NLRI to the other nodes B and C. Each of the other nodes receives only one copy of the same NLRI, which is from RR1. There is no redundant copy of the same NLRI. Comparing to the normal flooding in RR model as illustrated in Figure 1, the revised flooding procedure reduces the amount of link states flooding by half.¶
In Node Connections model, the revised flooding procedure is as follows:¶
Given a real network topology (RT), a flooding topology (FT) of the RT is a sub network topology of the RT and connects all the nodes in the RT.¶
For example, Figure 3 shows a flooding topology of the real topology in Figure 2.¶
The flooding topology in Figure 3 is a sub network topology of the RT in Figure 2 and connects all the nodes (i.e., nodes A, B, C and D) in the RT in Figure 2.¶
Figure 4 shows a reduced flooding flow of a link NLRI in a BGP update message for a link up or down in the BGP-SPF domain, which is the same as the one in Figure 2.¶
Speaker/Node A sends the NLRI in a BGP update message for its link to its peers B and D. Nodes B and D are peers of node A and are directly connected to A on the flooding topology (FT). Node A does not send the NLRI to its peer C since C is not directly connected to A on the FT.¶
After receiving the NLRI in the message from A, node B sends the NLRI in a BGP update message to B's other peer C (which is directly connected to B on the FT). After receiving the NLRI in a BGP update message from A, node D sends the NLRI in a BGP update message to D's other peer C (which is directly connected to D on the FT).¶
The number of NLRIs in messages flooded in Figure 4 is much less than that in Figure 2. The performance of network is improved using the revised flooding procedure.¶
This section specifies BGP extensions for flooding reduction in two models: RR model and Node Connections model. The extensions for Directly-Connected Node model are included in the extensions for Node Connections model.¶
A single RR for a BGP-SPF domain is elected as a leader RR of the domain. The leader RR is the RR with the highest priority to become a leader in the domain. If there are more than one RRs having the same highest priority, the RR with the highest Node ID and the highest priority is the leader RR in the domain. In a deployment, only every RR advertises its priority for becoming a leader using a Leader Priority TLV defined below.¶
Two new TLVs are defined for flooding reduction in RR model.¶
The format of Leader Priority TLV is illustrated in Figure 5.¶
The format of Node Flood TLV is illustrated in Figure 6.¶
0 - Reserved. 1 - send link states to the RR with the minimum ID 2 - send link states to the RR with the maximum ID 3 - send link states to 2 RRs with smaller IDs 4 - send link states to 2 RRs with larger IDs 6-127 - Standardized flooding behaviors for RR Model 128-254 - Private flooding behaviors for RR Model.¶
In a deployment, the flooding behavior for every node is configured on a RR or controller such as the leader RR and the RR advertises the behavior to the other RRs and every node in the network though using a Node Flood TLV.¶
For example, if we want every node in the network to send its link states to only one RR, we configure this behavior on a RR and the RR advertises the behavior to every node using a Node Flood TLV with Flood-behavior set to one, which tells every node to send its link states to the RR with the minimum ID. If we want every node in the network to send its link states to two RRs for redundancy, we configure this behavior on a RR and the RR advertises the behavior to every node using a Node Flood TLV with Flood-behavior set to 3, which tells every node to send its link states to the two RRs with smaller IDs (i.e., the RR with the minimum ID and the RR with the second minimum ID).¶
There are two modes for the flooding topology computation: centralized mode and distributed mode. In a centralized mode, one BGP-SPF node is elected as a leader. The leader computes a flooding topology for the BGP-SPF domain and advertises the flooding topology to every BGP-SPF node in the domain. In a distributed mode, every BGP-SPF node computes a flooding topology for the BGP-SPF domain using a same algorithm. There is not any flooding topology distribution.¶
This section defines the new TLVs for the two modes, describes the flooding topology distribution in centralized mode and an algorithm that can be used by every node to compute its flooding topology in distributed mode.¶
Five new TLVs are defined for flooding reduction in Node Connections model.¶
The format of Node Algorithm TLV is illustrated in Figure 7.¶
0 - The leader computes a flooding topology using its own algorithm and advertises the flooding topology to every node. 1-127 - Every node computes its flooding topology using this standardized distributed algorithm. 128-254 - Private distributed algorithms.¶
A node such as the leader node can use this TLV to tell every node in the domain to use the flooding topology from the leader for flooding the link states through advertising the TLV with the Algorithm field set to zero, or to tell every node to compute its own flooding topology using the algorithm given by the Algorithm field in the TLV containing an identifier of an algorithm when the Algorithm field is not zero.¶
The format of Algorithms Support TLV is illustrated in Figure 8.¶
The format of Node IDs TLV is illustrated in Figure 9.¶
The format of Paths TLV is illustrated in Figure 10. A leader uses this TLV to advertise a part of flooding topology for centralized mode. A path may be described as a sequence of indices: (Index 1, Index 2, Index 3, ...), denoting a connection between the node with index 1 and the node with index 2, a connection between the node with index 2 and the node with index 3, and so on. A single link/connection is a simple case of a path that only connects two nodes. A single link path may be encoded in a paths TLV of 8 bytes with two indices.¶
Multiple such as N paths may be encoded in one paths TLV. Each of the multiple paths is represented as a sequence of indices of the nodes on the path, and two paths (i.e., two sequences of indices for the two paths) are separated by a special index value such as 0xFFFF. In this case, there are (N - 1) special indices as separators to separate N paths, and the Length field has a value of 2 * (number of indices in N paths + N - 1).¶
When there are a number such as N of single link paths, using one paths TLV to represent them is more efficient than using N paths TLVs to represent them (i.e., each paths TLV represents a single link path). Using one TLV consumes 4 + 2 * (2*N + N - 1) = 6*N + 2 bytes. Using N TLVs occupies N * (4 + 4) = 8*N bytes. The space used by the former is about three quarters of the space used by the latter for a big N such as 30.¶
The format of Connection Used for Flooding TLV is illustrated in Figure 11.¶
In centralized mode, the leader computes a flooding topology for the domain whenever there is a change in the real network topology of the domain and advertises the flooding topology to every node in the domain.¶
After the current leader has failed, a new leader is elected. The new leader computes a flooding topology for the domain and advertises the flooding topology to every node in the domain.¶
For a brand new flooding topology of the domain computed, the leader advertises the whole flooding topology to every node in the domain. The leader advertises the mappings between all the node IDs and their indices to every node in the domain using a number of node IDs TLVs first. These node IDs TLVs contain the IDs of all the nodes in the domain and indicates the index corresponding to each of the node IDs and are advertised under MP_REACH_NLRI in BGP update messages. And then the leader advertises the connections/links on the flooding topology to every node in the domain using a number of paths TLVs. These paths TLVs contain all the connections/links on the flooding topology and are advertised under MP_REACH_NLRI in BGP update messages.¶
After advertising a flooding topology to every node in the domain, which is called the current flooding topology, for a new flooding topology computed for the updated real network topology of the domain, the leader advertises only the changes in the new flooding topology comparing to the current flooding topology to every node in the domain. The leader advertises the changes in the mappings between all the node IDs and their indices to every node in the domain using node IDs TLVs first, and then advertises the changes in the flooding topology to every node in the domain using paths TLVs.¶
For the new nodes added into the domain, the leader advertises the mappings between the IDs of the new nodes and their indices using a node IDs TLV under MP_REACH_NLRI in a BGP update message to add the mappings. For the dead nodes removed from the domain, the leader advertises the mappings between the IDs of the dead nodes and their indices using a node IDs TLV under MP_UNREACH_NLRI in a BGP update message to withdraw the mappings.¶
For the new connections/links added into the current flooding topology, the leader advertises the new connections/links using a paths TLV under MP_REACH_NLRI in a BGP update message to add the new connections/inks to the current flooding topology. For the old connections/links removed from the current flooding topology, the leader advertises the old connections/links using a paths TLV under MP_UNREACH_NLRI in a BGP update message to withdraw the old connections/links from the current flooding topology.¶
This section specifies an algorithm that can be used by every node to compute its flooding topology.¶
The algorithm for computing a flooding topology of a BGP-SPF domain (real topology) is described as follows.¶
The algorithm starts from¶
The authors of this document would like to thank Donald E. Eastlake, Acee Lindem and Keyur Patel for the comments.¶