Network Working Group W. Cheng Internet Draft China Mobile Intended status: Informational C. Lin Expires: January 2, 2025 New H3C Technologies J. Ye China Mobile July 4,2024 Adaptive Routing Framework draft-cheng-rtgwg-adaptive-routing-framework-00 Abstract This document describes a framework for Adaptive Routing. Specifically, it identifies a set of adaptive routing components, explains their interactions, and exemplifies the workflow mechanism. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on January 2, 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. Cheng, et al. Expire January 2, 2025 [Page 1] Internet-Draft Adaptive Routing Framework July 2024 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction...................................................3 1.1. Requirements Language.....................................3 2. Problem Analysis...............................................3 2.1.1. Use Case 1...........................................4 2.1.2. Use Case 2...........................................5 3. Solution.......................................................5 4. Framework......................................................6 4.1. Framework Overview........................................6 4.2. Remote Path Info..........................................7 4.3. Routing Plane.............................................7 4.4. Forwarding Plane..........................................8 4.5. Adaptive Routing Mode.....................................9 4.6. Congestion Detection......................................9 4.7. Congestion Notify........................................10 5. Work Flow.....................................................10 5.1. Remote Link Congestion Adjustment........................10 5.2. Remote Flow Congestion Adjustment........................12 6. Security Considerations.......................................12 7. IANA Considerations...........................................12 8. References....................................................12 8.1. Normative References.....................................12 Authors' Addresses...............................................13 Cheng, et al. Expires January 2, 2025 [Page 2] Internet-Draft Adaptive Routing Framework July 2024 1. Introduction In many cases, ECMP flow-based hashing leads to high congestion and variable flow completion time. This reduces applications performance. Load balancing based on local link quality is not always optimal, A global view of congestion, with information from remote links, is needed for optimal balancing. Adaptive routing is a network routing mechanism that dynamically adjusts routing paths based on changes in network conditions, thereby optimizing network performance and resource utilization. This document describes a framework for Adaptive Routing. Specifically, it identifies a set of adaptive routing components, explains their interactions, and exemplifies the workflow mechanism. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Problem Analysis The current AI networks exhibit the following characteristics: a low number of flows, but each flow has a heavy load. The commonly used load balancing strategy employs an N-tuple hash algorithm to forward traffic on a per-flow basis. For current AI networks, this load balancing strategy can easily lead to load imbalances, causing network congestion. When network congestion occurs, the current load balancing adjustment strategy typically involves nearby devices at the congestion point switching links based on the local link congestion state. However, this approach is inefficient because adjustments made by devices near the congestion point have limited impact. If load balancing adjustments could be initiated from the earliest routing devices, it would significantly improve the efficiency of load balancing. The commonly used load balancing method nowadays typically adopts an N-tuple hash algorithm to forward packets on a per-flow basis. For current computing networks, this load distribution strategy can easily lead to load imbalances, resulting in network congestion. Cheng, et al. Expires January 2, 2025 [Page 3] Internet-Draft Adaptive Routing Framework July 2024 2.1.1. Use Case 1 +--+ +--+ Spine |R1| |R2| +--+ +--+ | \ / | | \ / | | \/ | | /\ X <- congested | / \ | | / \ | +--+ +--+ Leaf |R3| |R4| +--+ +--+ ^ | | v Source Destination Figure 1 Spine-Leaf network In the Spin-Leaf network shown in Figure 1, assuming that the R2-R4 link becomes congested, R3 will continue to send traffic to both R1 and R2. Due to the congestion, continuing to forward traffic at the current rate through R2 will exacerbate the link congestion, leading to the loss of some traffic. Cheng, et al. Expires January 2, 2025 [Page 4] Internet-Draft Adaptive Routing Framework July 2024 2.1.2. Use Case 2 Source | v +---------+ | | | Group 1 |-------------+ | | | +---------+ | | +---------+ | | | X<- congested | Group 3 | | | | | +---------+ +---------+ | | | | | Group 2 |-------------+ | | +---------+ | v Destination Figure 2 Dragon-fly network In the dragon-fly network shown in Figure 2, the ECMP paths include Group1->Group2 and Group1->Group3->Group2 for load balancing. When the link between Group1 and Group2 becomes congested, Group1 continues to send traffic at the current rate through the Group1->Group2 link, exacerbating the congestion and causing the loss of some traffic. 3. Solution Using a weighted load balancing strategy instead of a hash-based strategy can more fully utilize the bandwidth resources of multiple links. By assigning forwarding weights based on the state of each link, the load can be more evenly balanced. Additionally, dynamically adjusting the weights of each link according to congestion conditions allows for better adaptation and adjustment to bursty traffic in AI networks. For example, in Figure 1, when R2 detects congestion on the R2->R4 link, it sends the congestion information to R3 via the control plane. R3 then dynamically adjusts the forwarding weights of the ECMP paths based on the congestion status, reducing the forwarding weight for the congested link, thereby decreasing the traffic directed to that link and alleviating its load. Once the congestion Cheng, et al. Expires January 2, 2025 [Page 5] Internet-Draft Adaptive Routing Framework July 2024 is cleared, R2 sends a congestion clearance message to R3 via the control plane, and R3 restores the original forwarding weight for that link. In Figure 2, the egress router in Group 1 detects inter-group link congestion and sends a congestion message to the ingress router via the control plane. The ingress router dynamically adjusts the forwarding weights of the ECMP paths based on the congestion status, reducing the traffic through the Group1->Group2 link to alleviate the load on the congested link. Once the congestion is cleared, the egress router in Group 1 notifies the ingress router in Group 1 of the congestion-cleared message, and the ingress router restores the ECMP link weights. 4. Framework 4.1. Framework Overview A high-level view of the CATS framework, without expanding the functional entities in the network, is illustrated in Figure 3. +-------------+ |Routing Plane| +-------------+ | | Remote Path Info v +----------------+ +-----------------------+ |Forwarding Plane|<------|Adaptive Routing Policy| +----------------+ +-----------------------+ ^ | Congestion Notifiy | +----------------------------+ |Remote Congestion Detection | +----------------------------+ Figure 3 Adaptive Routing Framwork Overview Starting from the bottom part of Figure 1 and moving to the upper part, the following planes are defined: * Routing Plane: Responsible for the transmission and calculation of routes. The calculated routes should include remote path information. The routes and remote Path Info should be correlated and updated to the Forwarding Plane. * Forwarding Plane: Responsible for path adjustments based on the policies of Adaptive Routing and remote link congestion Cheng, et al. Expires January 2, 2025 [Page 6] Internet-Draft Adaptive Routing Framework July 2024 information, following the adjusted forwarding strategies for traffic forwarding. * Adaptive Routing Policy: Responsible for remote link congestion information or flow information, dynamically adjusting routing accordingly, and updating the Forwarding Plane. * Remote Congestion Detection: Responsible for detecting link congestion and sending Congestion Notification to neighboring devices. 4.2. Remote Path Info Currently, the forwarding table contains information about the route destination, next hop, and exit interface. Local dynamic load balancing can dynamically adjust the weight of load distribution based on the link metric of local interfaces, such as interface traffic load and queue size. Load balancing based on local link quality is not always optimal. Global congestion awareness, with information from remote links, is needed for optimal balancing. Therefore, the forwarding table needs to contain not only local exit interface information but also remote path info and remote link congestion information. Remote path info can be remote links or remote nodes, specifically as follows: * For BGP-based networks: Remote path info can be the BGP identifier corresponding to the next-next-hop, as described in [I-D.wang-idr- next-next-hop-nodes]. It can also be the BGP AS-PATH information or BGP router-id, which is not detailed in this document. * For IGP-based networks: Remote path info can be the interface information from the next-hop neighbor device to the next-hop device, which could be the interface index, or the interface's local address. By using remote path info, routes can be associated with remote paths. 4.3. Routing Plane When calculating routes, the path needs to be perceived, and the path information will be attached to the next hop. In a BGP-based network, a BGP route may carry the router-id of the peer from which that route is received, and the router-id will be added into the path information when calculating that route. The BGP protocol may need some extensions to support such a feature. The Cheng, et al. Expires January 2, 2025 [Page 7] Internet-Draft Adaptive Routing Framework July 2024 specific extensions can refer to [I-D.wang-idr-next-next-hop-nodes] or other extensions, which are not detailed in this document. In an IGP-based network, a router may compute the path information based on the SPF tree and attach it to the next hop. Path info can be a link-local address, interface ID, or Link Local Identifier, or other extensions. The detailed mechanisms are out of the scope of this document. 4.4. Forwarding Plane The following figure 4 is a schematic of forwarding table maintenance. For each prefix, the next hop and weight corresponding to each path are recorded. The next hop of the prefix is constructed from the local next hop and remote path information. The forwarding weight is determined by the quality of the local next-hop interface (local(q)) and the quality of the remote link in the remote path (remote(q)). When responding to local congestion events, the next-hop address in the congestion event is used to find the corresponding ECMP entry, and the weight of this ECMP entry is modified according to the congestion level. When responding to remote congestion events, the path info in the congestion message is used to find the corresponding ECMP entry. The link quality of the remote path is updated, and a new weight value is calculated based on the local and remote link quality. Then the weight of this ECMP entry is modified according to the congestion level. +------+ +--------------------------+ local(q)+remote(q) |Prefix|---+-->|Next-hop: to R1, Weight w1|<----------------| +------+ | +--------------------------+ | | | +------------+ +--------+ | +---------->|Path: R1->R4|-->|Quality1| | +------------+ +--------+ | +--------------------------+ local(q)+remote(q) +-->|Next-hop: to R2, Weight w2|<----------------| +--------------------------+ | | +------------+ +--------+ +---------->|Path: R2->R4|-->|Quality2| +------------+ +--------+ Figure 4 Forwarding table for Adaptive Routing When the number of flows is small or when there are elephant flows, adaptive routing needs to be performed through flow redirection. The following figure 5 is a schematic of the forwarding layer flow table maintenance. The flow tables are maintained according to the five- Cheng, et al. Expires January 2, 2025 [Page 8] Internet-Draft Adaptive Routing Framework July 2024 tuple of the traffic, recording the path information corresponding to this flow. When responding to remote flow congestion events as described in section 4.7, the flow will be rehashed to choose an ECMP path, and this flow is redirected to the least loaded ECMP path. +------+ |SAddr | |DAddr | |SPort | +------------------+ |DPort |------>|Next-hop: to R1 | |Proto | +------------------+ +------+ Figure 5 Flow table 4.5. Adaptive Routing Mode For network congestion, detection can be performed either on a per- link basis or on a per-flow basis. Link-based congestion detection and flow-based congestion detection can also be used in combination. For link-level congestion events, the forwarding weights of the corresponding ECMP links in the forwarding table are adjusted, thereby affecting the weight distribution of subsequent traffic for load balancing and reducing the traffic weight on the congested link. The forwarding weights are calculated based on the quality of the local link and the quality of the remote link. For flow-level congestion events, the corresponding flow is redirected to ECMP links with lower loads. Based on the severity of network congestion, network congestion can be divided into multiple levels, such as levels 1 to 7 corresponding to link congestion from mild to severe. The Congestion Response Module adjusts the ECMP link weights accordingly based on the congestion level. 4.6. Congestion Detection Congestion detection is generally performed by devices near the congestion point, including the detection of link congestion and congestion clearance. Network performance and congestion points can be identified by sending test traffic. A queue exceeds a threshold depth may send congestion notification. Congestion can also be Cheng, et al. Expires January 2, 2025 [Page 9] Internet-Draft Adaptive Routing Framework July 2024 inferred by monitoring the packet loss rate to determine if a link is congested. Congestion Specific detection methods are beyond the scope of this document. 4.7. Congestion Notify When a change in congestion status is detected, it needs to be communicated to remote devices in order to adjust traffic scheduling from the source. Congestion messages can be of two types: 1) The first type includes Path information, which helps in identifying the corresponding route for adjustments. It also includes the congestion information of the link corresponding to the Path. With this information, global congestion calculation can be performed to derive the weight information for the forwarding table. For details, refer to section 4.4. 2) The second type includes the five-tuple information of the congested flow. By using this congested flow information, congestion flow redirection can be implemented. For details, refer to sections 4.4 and 4.5. This can be done by extending the IGP protocol to transmit link state information within the IGP domain, or by extending the BGP protocol and setting up BGP reflectors to communicate between BGP neighbors. Alternatively, new protocols can be designed for this purpose. Congestion messages can be transmitted in-band or out-of- band. For high-performance solutions, additional protocols may be needed for efficient out-of-band message transmission. Specific methods are beyond the scope of this document. 5. Work Flow 5.1. Remote Link Congestion Adjustment As shown in Figure 1, the workflow for handling remote link congestion is as follows: 1) In the initial state, there are two paths from R3 to R4: R3->R1->R4 and R3->R2->R4. Assume the initial weights are the same, set to 50 for both. The initial table entries are as shown in Figure 6. 2) R2 detects a change in congestion on the R2->R4 link using congestion detection methods and classifies the congestion into levels according to severity. Cheng, et al. Expires January 2, 2025 [Page 10] Internet-Draft Adaptive Routing Framework July 2024 3) R2 notifies the remote device R3 of the congestion change event, including the congested node (R2), the next-hop information (R4), and the congestion level. 4) R3 receives the remote notification and, based on the congested node (R2) and next-hop information (R4), looks up its local forwarding table. It then adjusts the forwarding weights of the corresponding ECMP entries according to the congestion level, assuming the weight is adjusted to 10, as shown in Figure 7. 5) When R3 receives new traffic, it performs load balancing according to the adjusted forwarding weights. +------+ +--------------------------+ |Prefix|---+-->|Next-hop: to R1, Weight 50| +------+ | +--------------------------+ | | +----------------+ | +---------->|Path: R1->R4 | | +----------------+ | +--------------------------+ +-->|Next-hop: to R2, Weight 50| +--------------------------+ | +----------------+ +---------->|Path: R2->R4 | +----------------+ Figure 6 Initial forwarding table +------+ +--------------------------+ |Prefix|---+-->|Next-hop: to R1, Weight 50| +------+ | +--------------------------+ | | +----------------+ | +---------->|Path: R1->R4 | | +----------------+ | +--------------------------+ +-->|Next-hop: to R2, Weight 10| +--------------------------+ | +----------------+ +---------->|Path: R2->R4 | +----------------+ Figure 7 Adaptive forwarding table Cheng, et al. Expires January 2, 2025 [Page 11] Internet-Draft Adaptive Routing Framework July 2024 5.2. Remote Flow Congestion Adjustment As shown in Figure 1, the workflow for handling remote flow congestion is as follows: 1) R2 detects congestion on a specific flow passing through the R3->R4 link using congestion detection methods; 2) R2 notifies the remote device R3 of the congestion change event, including the congested path info and flow information; 3) R3 receives the flow congestion event and looks up the flow table based on the flow information, redirecting the flow to the least loaded link among the ECMP links; 4) Subsequently, the flow is forwarded according to the new flow table. 6. Security Considerations TBD. 7. IANA Considerations TBD. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, May 2017 Cheng, et al. Expires January 2, 2025 [Page 12] Internet-Draft Adaptive Routing Framework July 2024 Authors' Addresses Weiqiang Cheng China Mobile China Email: chengweiqiang@chinamobile.com Changwang Lin New H3C Technologies China Email: linchangwang.04414@h3c.com Jiaming Ye China Mobile China Email: yejiaming@chinamobile.com Cheng, et al. Expires January 2, 2025 [Page 13]