Internet-Draft FARE February 2024
Xu, et al. Expires 28 August 2024 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-xu-lsr-fare-02
Published:
Intended Status:
Standards Track
Expires:
Authors:
X. Xu
China Mobile
Z. He
Broadcom
J. Wang
Centec
H. Huang
Huawei
Q. Zhang
H3C
H. Wu
Ruijie Networks
Y. Liu
Tencent
Y. Xia
Tencent
P. Wang
Baidu
S. Hegde
Juniper

Fully Adaptive Routing Ethernet

Abstract

Large language models (LLMs) like ChatGPT have become increasingly popular in recent years due to their impressive performance in various natural language processing tasks. These models are built by training deep neural networks on massive amounts of text data, often consisting of billions or even trillions of parameters. However, the training process for these models can be extremely resource-intensive, requiring the deployment of thousands or even tens of thousands of GPUs in a single AI training cluster. Therefore, three-stage or even five-stage CLOS networks are commonly adopted for AI networks. The non-blocking nature of the network become increasingly critical for large-scale AI models. Therefore, adaptive routing is necessary to dynamically load balance traffic to the same destination over multiple ECMP paths, based on network capacity and even congestion information along those paths.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 28 August 2024.

Table of Contents

1. Introduction

Large language models (LLMs) like ChatGPT have become increasingly popular in recent years due to their impressive performance in various natural language processing tasks. These models are built by training deep neural networks on massive amounts of text data, often consisting of billions or even trillions of parameters. However, the training process for these models can be extremely resource-intensive, requiring the deployment of thousands or even tens of thousands of GPUs in a single AI training cluster. Therefore, three-stage or even five-stage CLOS networks are commonly adopted for AI networks. Furthermore, In rail-optimized CLOS topologies with standard GPU servers (HB domain of eight GPUs), the Nth GPUs of each server in a group of servers are connected to the Nth leaf switch, which provides higher bandwidth and non-blocking connectivity between the GPUs in the same rail. In rail-optimized topology, most traffic between GPU servers would traverse the intra-rail networks rather than the inter-rail networks.

The non-blocking nature of the network, especially the network for intra-rail communication, become increasingly critical for large-scale AI models. AI workloads tend to be extremely bandwidth-hungry and they usually generate a few elephant flows simultaneously. If the traditional hash-based ECMP load-balancing was used without any optimization, it's highly possible to cause serious congestion and high latency in the network once multiple elephant flows are routed to the same link. Since the job completion time depends on worst-case performance, serious congestion will result in model training time longer than expected. Therefore, adaptive routing is necessary to dynamically load balance traffic to the same destination over multiple ECMP paths, based on network capacity and even congestion information along those paths. In other words, adaptive routing is a capacity-aware and even congestion-aware path selection algorithm.

Furthermore, to reduce the congestion risk to the maximum extent, the routing should be more granular if possible. Flow-granular adaptive routing still has a certain statistical possibility of congestion. Therefore, packet-granular adaptive routing is more desirable although packet spray would cause out-of-order delivery issue. A flexible reordering mechanism must be put in place(e.g., egress ToRs or the receiving servers). Recent optimizations for RoCE and newly invented transport protocols as alternatives to RoCE no longer require handling out-of-order delivery at the network layer. Instead, the message processing layer is used to address it.

To enable adaptive routing, no matter whether flow-granular or packet-granular adaptive routing, it is necessary to propagate network topology information, including link capacity and/or even available link capacity (i.e., link capacity minus link load) across the CLOS network. Therefore, it seems straightforward to use link-state protocols such as OSPF or ISIS as the underlay routing protocol in the CLOS network, instead of BGP, for propagating link capacity information and/or even available link capacity information by using OSPF or ISIS TE Metric or Extended TE Metric [RFC3630] [RFC7471] [RFC5305] [RFC7810]. More specifically, the Maximum Link Bandwidth sub-TLV and Unidirectional Utilized Bandwidth Sub-TLV could be used for advertising the link capacity and available link capacity information.

For information on resolving flooding issues caused by link-state protocols in large CLOS networks, please refer to the following draft [I-D.xu-lsr-flooding-reduction-in-clos].

Note that while adaptive routing especially at the packet-granular level can help reduce congestion between switches in the network, thereby achieving a non-blocking fabric, it does not address the incast congestion issue which is commonly experienced in last-hop switches that are connected to the receivers in many-to-one communication patterns. Therefore, a congestion control mechanism is always necessary between the sending and receiving servers to mitigate such congestion.

2. Terminology

This memo makes use of the terms defined in [RFC2328] and [RFC1195].

3. Solution Description

3.1. Adaptive Routing in 3-stage CLOS

   +----+ +----+ +----+ +----+
   | S1 | | S2 | | S3 | | S4 |  (Spine)
   +----+ +----+ +----+ +----+

   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
   | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 |  (Leaf)
   +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+


                              Figure 1

(Note that the diagram above does not include the connections between nodes. However, it can be assumed that leaf nodes are connected to every spine node in their CLOS topology.)

In a three-stage CLOS network as shown in Figure 1, also known as a leaf-spine network, all nodes MAY be in OSPF area zero or ISIS Level-2.

Leaf nodes are enabled for adaptive routing for OSPF area zero or ISIS Level-2.

When a leaf node, such as L1, calculates the shortest path to a specific IP prefix originated by another leaf node in the same OSPF area or ISIS Level-2 area, say L2, four equal-cost multi-path (ECMP) routes will be created via four spine nodes: S1, S2, S3, and S4. To enable adaptive routing, weight values based on link capacity or even available link capacity associated with upstream and downstream links SHOULD be considered for global load-balancing. In particular, the minimum value between the capacity of upstream link (e.g., L1->S1) and the capacity of downstream link (S1->L2) of a given path (e.g., L1->S1->L2) is used as a weight value for that path when performing weighted ECMP load-balancing.

3.2. Adaptive Routing in 5-stage CLOS

   =========================================
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-1  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===============================     ===============================
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   # |SS1 | |SS2 | |SS3 | |SS4 | #     # |SS1 | |SS2 | |SS3 | |SS4 | #
   # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
   #   (Super-Spine@Plane-1)     #     #   (Super-Spine@Plane-4)     #
   #============================== ... ===============================

   =========================================
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                                PoD-8  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================

                              Figure 2

(Note that the diagram above does not include the connections between nodes. However, it can be assumed that the leaf nodes in a given PoD are connected to every spine node in that PoD. Similarly, each spine node (e.g., S1) is connected to all super-spine nodes in the corresponding PoD-interconnect plane (e.g., Plane-1).)

For a five-stage CLOS network as illustrated in Figure 2, each Pod consisting of leaf and spine nodes is configured as an OSPF non-zero area or an ISIS Level-1 area. The PoD-interconnect plane consisting of spine and super-spine nodes is configured as an OSPF area zero or an ISIS Level-2 area. Therefore, spine nodes play the role of OSPF area border routers or ISIS Level-1-2 routers.

In rail-optimized topology, Intra-rail communication with high bandwidth requirements would be restricted to a single PoD. Inter-rail communication with lower bandwidth requirements can traverse across PoDs through the PoD-interconnect planes. Therefore, enabling adaptive routing only in PoD networks is sufficient. In particular, only leaf nodes are enabled for adaptive routing in their associated OSPF non-zero area or ISIS Level-1 area.

When a leaf node within a given PoD (a.k.a., in a given OSPF non-zero area or ISIS Level-1 area), such as L1 in PoD-1, calculates the shortest path to a specific IP prefix originated by another leaf node in the same PoD, say L2 in PoD-1, four equal-cost multi-path (ECMP) routes will be created via four spine nodes: S1, S2, S3, and S4 in the same PoD. To enable adaptive routing, weight values based on link capacity or even available link capacity associated with upstream and downstream links SHOULD be considered for global load-balancing. In particular, the minimum value between the capacity of upstream link (e.g., L1->S1) and the capacity of downstream link (e.g., S1->L2) of a given path (e.g., L1->S1->L2) is used as a weight value of that path.

4. Modifications to OSPF and ISIS Behavior

Once an OSPF or ISIS router is enabled for adaptive routing, the capacity or even the available capacity of the SPF path SHOULD be calculated as a weight value for global load-balancing purposes.

When advertising the available link capacity metric alongside the link capacity metric, it is important to maintain adaptive routing stable enough. To achieve this, a threshold SHOULD be set for the available link capacity fluctuation to avoid frequent LSA or LSP advertisements. That's to say, it's useful to avoid sending any update that would otherwise be triggered by a minor available link capacity fluctuation below that threshold. More specifically, the announcement suppression mechanisms as defined in Sec 5, 6 and 7 of [RFC7810] can be applied here.

5. Acknowledgements

TBD.

6. IANA Considerations

TBD.

7. Security Considerations

TBD.

8. References

8.1. Normative References

[RFC1195]
Callon, R., "Use of OSI IS-IS for routing in TCP/IP and dual environments", RFC 1195, DOI 10.17487/RFC1195, , <https://www.rfc-editor.org/info/rfc1195>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC2328]
Moy, J., "OSPF Version 2", STD 54, RFC 2328, DOI 10.17487/RFC2328, , <https://www.rfc-editor.org/info/rfc2328>.
[RFC5340]
Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF for IPv6", RFC 5340, DOI 10.17487/RFC5340, , <https://www.rfc-editor.org/info/rfc5340>.

8.2. Informative References

[I-D.xu-lsr-flooding-reduction-in-clos]
Xu, X., "Flooding Reduction in CLOS Networks", Work in Progress, Internet-Draft, draft-xu-lsr-flooding-reduction-in-clos-01, , <https://datatracker.ietf.org/doc/html/draft-xu-lsr-flooding-reduction-in-clos-01>.
[RFC3630]
Katz, D., Kompella, K., and D. Yeung, "Traffic Engineering (TE) Extensions to OSPF Version 2", RFC 3630, DOI 10.17487/RFC3630, , <https://www.rfc-editor.org/info/rfc3630>.
[RFC5305]
Li, T. and H. Smit, "IS-IS Extensions for Traffic Engineering", RFC 5305, DOI 10.17487/RFC5305, , <https://www.rfc-editor.org/info/rfc5305>.
[RFC7471]
Giacalone, S., Ward, D., Drake, J., Atlas, A., and S. Previdi, "OSPF Traffic Engineering (TE) Metric Extensions", RFC 7471, DOI 10.17487/RFC7471, , <https://www.rfc-editor.org/info/rfc7471>.
[RFC7810]
Previdi, S., Ed., Giacalone, S., Ward, D., Drake, J., and Q. Wu, "IS-IS Traffic Engineering (TE) Metric Extensions", RFC 7810, DOI 10.17487/RFC7810, , <https://www.rfc-editor.org/info/rfc7810>.

Authors' Addresses

Xiaohu Xu
China Mobile
Zongying He
Broadcom
Junjie Wang
Centec
Hongyi Huang
Huawei
Qingliang Zhang
H3C
Hang Wu
Ruijie Networks
Yadong Liu
Tencent
Yinben Xia
Tencent
Peilong Wang
Baidu
Shraddha Hegde
Juniper