| Internet-Draft | Fast CNP in RoCEv2 Networks | April 2024 | 
| Min & Li | Expires 27 October 2024 | [Page] | 
This document describes a Remote Direct Memory Access (RDMA) over Converged Ethernet version 2 (RoCEv2) congestion control mechanism, which is similar to Really Explicit Congestion Notification (RECN) described in RFC 7514, also known as Fast Congestion Notification Packet (CNP). By extending the RoCEv2 CNP, Fast CNP can be sent by the switches directly to the sender, advising the sender to reduce the transmission rate at which it sends the flow of RoCEv2 data traffic.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 27 October 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Remote Direct Memory Access (RDMA) is a method of accessing memory on a remote system without interrupting the processing of the Central Processing Unit (CPU) on that system. RDMA enables lower latency and higher throughput on the network and lower CPU utilization for the servers and storage systems. High Performance Computing (HPC) and Artificial Intelligence (AI) applications can be accelerated by RDMA.¶
InfiniBand is a lossless network optimized for HPC and AI. It typically supports RDMA enabling machines to communicate and share data without interrupting the host CPU.¶
RDMA over Converged Ethernet (RoCE) is an open standard enabling RDMA and network offloads over an Ethernet network. The current and most popular implementation is RDMA over Converged Ethernet version 2 (RoCEv2) [IBTA-Spec]. RoCEv2 runs the InfiniBand transport layer over UDP and IP protocols on an Ethernet network, bringing many of the advantages of InfiniBand to Ethernet networks.¶
The RoCEv2 networks often implement a proactive congestion control mechanism analogous to Explicit Congestion Notification (ECN) [RFC3168], in which the switches mark packets if congestion occurs in the network. The marked packets alert the receiver that congestion is imminent, and the receiver alerts the sender with a Congestion Notification Packet (CNP). After receiving the CNP, the sender knows to back off, slowing down the transmission rate temporarily until the flow path is ready to handle a higher rate of traffic.¶
This document describes a RoCEv2 congestion control mechanism, which is similar to Really Explicit Congestion Notification (RECN) [RFC7514], also known as Fast CNP. By extending the RoCEv2 CNP, Fast CNP can be sent by the switches directly to the sender, advising the sender to reduce the transmission rate at which it sends the flow of RoCEv2 data traffic. The primary benefit of Fast CNP has been explicitly indicated by its name saying that it's faster than the receiver-originated CNP.¶
AI: Artificial Intelligence¶
CNP: Congestion Notification Packet¶
CPU: Central Processing Unit¶
DoS: Denial-of-Service¶
ECN: Explicit Congestion Notification¶
ECMP: Equal-Cost Multipath¶
HPC: High Performance Computing¶
HPCC++: Enhanced High Precision Congestion Control¶
IBTA: InfiniBand Trade Association¶
IOAM: In situ Operations, Administration, and Maintenance¶
RDMA: Remote Direct Memory Access¶
RECN: Really Explicit Congestion Notification¶
RoCE: RDMA over Converged Ethernet¶
RoCEv2: RDMA over Converged Ethernet version 2¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
RoCEv2 packets use a well-known UDP Destination Port number 4791 that unambiguously distinguishes them in a stateless manner. RoCEv2 data packet format and RoCEv2 Congestion Notification Packet (CNP) format are shown in Figure 1 and Figure 2 respectively.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Ethernet Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ IPv6 Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ UDP Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | InfiniBand Transport Header(s) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Payload ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Invariant CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
In a RoCEv2 data packet, the InfiniBand Transport Header(s) must start with an InfiniBand Base Transport Header, followed by 0, 1, or multiple InfiniBand Extended Transport Header(s).¶
Within the InfiniBand Base Transport Header, there is a 24-bit field called Destination Queue Pair (QP), indicating the Work Queue Pair Number at the destination. The QP is the virtual interface that the hardware provides to an InfiniBand architecture consumer, and it serves as a virtual communication port for the consumer. The operation on each QP is independent from the others.¶
Note that in order to save the space, the Source QP indicating the Work Queue Pair at the source is not included in the InfiniBand Base Transport Header. It's assumed that the receiver is able to figure out the Source QP of a RoCEv2 data packet, because both the sender and the receiver of a RoCEv2 data packet know the mapping between the Source QP and the Destination QP.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Ethernet Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ IPv6 Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ UDP Header ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | InfiniBand Base Transport Header | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Reserved | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Invariant CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
In a RoCEv2 congestion notification packet, only the InfiniBand Base Transport Header but no any other InfiniBand Transport Header is present, following the IP/UDP headers. In this document, only IPv6 is taken into account while IPv4 is beyond the scope.¶
The RoCEv2 CNP is generated by the receiver when the receiver receives RoCEv2 data packet with ECN bits set. The field Destination QP within the InfiniBand Base Transport Header is set to the Work Queue Pair Number at the sender, corresponding to the Destination QP of the RoCEv2 data packet received by the receiver.¶
After the sender receives the RoCEv2 CNP, the sender would reduce the transmission rate at which it sends the RoCEv2 data packets using the Destination QP of the RoCEv2 CNP. The congestion control algorithm used by the sender to reduce the transmission rate is outside the scope of this document.¶
Fast CNP is an extended CNP generated by the switch at which congestion occurs, but not generated by the receiver. The switch would send Fast CNP to the sender of RoCEv2 data packet causing congestion. If the switch doesn't know about whether the sender is able to process the Fast CNP, then the switch MAY choose to mark the ECN bits of the RoCEv2 data packet at the same time of sending Fast CNP. The marked ECN bits of the RoCEv2 data packet would cause the receiver to send RoCEv2 CNP to the sender. In this case, the sender would receive both the Fast CNP and the receiver-originated CNP. If the switch knows that the sender is able to process the Fast CNP, then the switch MUST NOT mark the ECN bits of the RoCEv2 data packet at the same time of sending Fast CNP. How the switch can know the sender's capability of processing Fast CNP is outside the scope of this document.¶
Fast CNP's Source IPv6 address is set to the IPv6 loopback address of the switch which sends the Fast CNP, and the Destination IPv6 address of the Fast CNP is copied from the Source IPv6 address of the RoCEv2 data packet causing congestion. After the sender receives the Fast CNP, the sender can use the Source IPv6 address to differentiate between the Fast CNP and the receiver-originated CNP. If the Source IPv6 address of the received CNP is an address of a receiver, then it's a receiver-originated CNP, otherwise it's a Fast CNP sent by a switch. Furthermore, if the sender knows how to detour the congested switch (e.g., by changing the ECMP field(s) of the flow of RoCEv2 data packets that were subject to forward congestion), then the sender can also use the Source IPv6 address of the Fast CNP to detour the congested switch.¶
Fast CNP's field Destination QP within the InfiniBand Base Transport Header is copied from the field Destination QP within the InfiniBand Base Transport Header of the RoCEv2 data packet causing congestion.¶
Fast CNP adds an IPv6 extension header [RFC8200] to the RoCEv2 CNP, specifically, an IPv6 Destination Options header with one IPv6 destination option is added. There are two types of IPv6 destination option which can be added.¶
When the RoCEv2 data packet causing congestion doesn't carry an IPv6 In situ OAM (IOAM) Hop-by-Hop Trace Option [RFC9486], the following IPv6 destination option is carried in the Fast CNP.¶
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |  Option Type  |  Opt Data Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|       Destination IPv6 address of the RoCEv2 data packet      |
|              that was subject to forward congestion           |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Option Type: 8-bit identifier of the type of Option that needs to be allocated. [RFC8200] defines how to encode the three high-order bits of the Option Type field. The two high-order bits specify the action that must be taken if the processing IPv6 node does not recognize the Option Type; for this Option, these two bits MUST be set to 10 (discard the packet and, regardless of whether or not the packet's Destination Address was a multicast address, send an ICMP Parameter Problem, Code 2, message to the packet's Source Address, pointing to the unrecognized Option Type). The third-highest-order bit specifies whether the Option Data can change en route to the packet's final destination; for this Option, the value of this bit MUST be set to 0 (Option Data does not change en route).¶
Opt Data Len: 16. It is the length of the Option Data Field of this Option in bytes.¶
Option Data: Destination IPv6 address of the RoCEv2 data packet that was subject to forward congestion. The Option Data, combined with the Destination QP within the InfiniBand Base Transport Header, are used by the sender to obtain the Work Queue Pair Number for which the transmission rate would be reduced.¶
When the RoCEv2 data packet causing congestion carries an IPv6 IOAM Hop-by-Hop Trace Option, the following IPv6 destination option is carried in the Fast CNP.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Type | Opt Data Len | Reserved | IOAM Opt-Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | | | . . I . . O . . A . . M . . . . Option Data . O . . P . . T . . I . . O . . N | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | | | Destination IPv6 address of the RoCEv2 data packet | | that was subject to forward congestion | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Option Type: 8-bit identifier of the type of Option that needs to be allocated. For this Option, the two high-order bits MUST be set to 10 (discard the packet and, regardless of whether or not the packet's Destination Address was a multicast address, send an ICMP Parameter Problem, Code 2, message to the packet's Source Address, pointing to the unrecognized Option Type). The third-highest-order bit MUST be set to 0 (Option Data does not change en route).¶
Opt Data Len: 8-bit unsigned integer. It is the length of the Option Data Field of this Option in bytes.¶
Option Data: IOAM Trace Option Data and Destination IPv6 address of the RoCEv2 data packet that was subject to forward congestion. IOAM Trace Option Data is copied from the IPv6 Hop-by-Hop Options header of the RoCEv2 data packet. The Destination IPv6 address of the RoCEv2 data packet, combined with the Destination QP within the InfiniBand Base Transport Header, are used by the sender to obtain the Work Queue Pair Number for which the transmission rate would be reduced. The IOAM Trace Option Data is used by the sender to decide how to reduce the transmission rate, based on a congestion control algorithm. One example of the IOAM Trace Option Data and the congestion control algorithm is Enhanced High Precision Congestion Control (HPCC++) [I-D.miao-ccwg-hpcc] [I-D.miao-ccwg-hpcc-info].¶
The Fast CNP MUST be applied in a specific controlled domain. A limited administrative domain provides the network administrator with the means to select, monitor, and control the access to the network, making it a trusted domain.¶
To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting to incoming Fast CNPs.¶
To protect against unauthorized sources sending Fast CNP to the host, implementations MUST provide a means of checking the source addresses of Fast CNP against an access list before accepting the packet.¶
A deployment MUST ensure that border-filtering drops inbound Fast CNP from outside of the domain and that drops outbound Fast CNP leaving the domain.¶
A deployment MUST support the configuration option to enable or disable the Fast CNP feature defined in this document. By default, the Fast CNP feature MUST be disabled.¶
As this document describes new options for IPv6, containing IOAM data or not, the security considerations of [RFC8200], [RFC9098], and [RFC9486] apply.¶
This document requests the following IPv6 Option Type assignments from the Destination Options and Hop-by-Hop Options sub-registry of Internet Protocol Version 6 (IPv6) Parameters (https://www.iana.org/assignments/ipv6-parameters/).¶
Hex Value Binary Value Description                  Reference
          act chg rest
----------------------------------------------------------------
TBD1      10   0  tbd1 Fast CNP Destination Option1 [This draft]
TBD2      10   0  tbd2 Fast CNP Destination Option2 [This draft]
                            Table 1
¶
TBD.¶