Internet-Draft | CSIG | February 2024 |
Ravi, et al. | Expires 5 August 2024 | [Page] |
This document presents Congestion Signaling (CSIG), an in-band network telemetry protocol that allows end-hosts to obtain visibility into fine-grained network signals for congestion control, traffic management, and network debuggability in the network. CSIG provides a simple, low-overhead, and extensible packet header mechanism to obtain fixed-length summaries from bottleneck devices along a packet path. This summarized information is collected over L2 CSIG-tags in a compare-and-replace manner across network devices along the path. Receivers can reflect this information back to senders via L4+ CSIG reflection headers.¶
CSIG builds upon the successful aspects of prior work such as switch in-band network telemetry (INT) that incorporates multibit signals in live data packets. At the same time, CSIG's end-to-end mechanism for carrying the signals via fixed size header is simple, practical and deployable akin to Explicit Congestion Notification (ECN).¶
In addition to a detailed description of the end-to-end protocol, this document also motivates the use cases for CSIG and the rationale for design choices made in CSIG. It describes a set of signals of interest to applications (minimum available bandwidth, maximum link utilization, and maximum hop delay), methods to compute these signals in network devices, and how these signals can be leveraged in applications. Additionally, it describes how attributes about the bottleneck's location can be carried and made useful to applications. It also provides the framework to incorporate future signals. Finally, this document addresses incremental deployment, backward compatibility and nuances of CSIG's applicability in a range of scenarios.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 August 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Many network control loops, including Congestion Control, Traffic Engineering and Network Operations, make decisions based on the congestion experienced by application flows. The signals used to determine congestion are often implicitly derived from end-to-end signals, approximated over larger timescales than desired, or obtained out-of-band from the network. This can lead to suboptimal performance for applications or inefficiency in network usage. CSIG (Congestion Signaling) provides direct, real-time, inband signals that network control loops can incorporate for performance and efficiency.¶
A number of congestion control algorithms (CCA) are deployed in datacenters, including Swift [SWIFT], BBR [BBR], DCTCP [RFC8257], DCQCN [DCQCN] and HPCC++ [I-D.miao-tsv-hpcc]. These CCA vary in the congestion signals they use and in how they increase/decrease flow rates in response to the signals. Swift uses precise measurements of round-trip time (RTT) to modulate its congestion window. BBR uses a combination of flow's delivery rate and RTT measurements. DCTCP and DCQCN rely on Explicit Congestion Notification (ECN [RFC3168]) from switches that indicate if the queue build up is above a threshold. HPCC++ leverages per-hop queue depth and transmit bytes along the flow's path, obtained via inband telemetry probes, to update flow rates.¶
Despite the advances in sophisticated signals on when to slow down transfers, there continue to be blind-spots for CCA when it comes to increasing flow rates, e.g., What is the appropriate starting rate for a flow? How quickly should a flow ramp up in the absence of congestion? Without explicit information from the network, end-to- end CCA have come to rely on heuristics that can either undershoot or overshoot the bottleneck bandwidth, which can lead to slower Flow Completion Times (FCT) or increased round-trip times or packet losses. At the same time, applications' appetite for fast network performance is rising: AI/ML applications are pushing for fast network transfers and avoid idling expensive Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs). Similarly Storage disaggregation needs fast transfers to make a remote Storage device appear as a local device at host.¶
In this document we introduce Congestion Signaling (CSIG) to explicitly notify the hosts of the bottleneck link metrics. There are several important use cases for CSIG, including:¶
Congestion Control Algorithms for making decisions on sending rate: CCA at senders can use CSIG for quickly and safely ramping up to the maximum feasible rate as determined by the bottleneck link, and react with precision to the bottleneck hop both in the presence and absence of congestion. The motivation for quick ramp-up stems from making maximal use of datacenter bandwidth, and decreasing latency even for large transfers. There are several ways in which CSIG can help complete transfers quickly, e.g., transfers belonging to an ML collective communication can ramp up quickly to maximally use all network bandwidth and complete close to the ideal transfer completion time.¶
Traffic Management systems including Traffic Engineering (TE), Load Balancing and Multipathing too benefit from CSIG. TE systems infer congested flows through an offline multi-minute process via superimposition of network traffic stats, topology and routing information. With CSIG, TE has more up to date information on the congested points and the application flows experiencing congestion. Using such finer-grained information can lead to more efficient and timely provisioning for bursty traffic. Similarly, CSIG-enabled multipathed transport flows can choose paths in real time with the most available bandwidth.¶
Troubleshooting and Performance Optimization. We also envision CSIG to assist with debugging the network-level performance of datacenter applications. Large-scale applications, including ML training workloads, open thousands of connections at the transport layer. When the network is slow for an application, it is almost impossible to identify the bottleneck hops without joining many data sources across switches and hosts. Because CSIG conveys the path bottleneck characteristics, it is valuable in pinpointing choke points in the network. Knowledge of these choke points can lead to better bandwidth provisioning, timely repair processes, and real-time control, such as better load balancing.¶
CSIG provides simple, fixed-length summaries of bottleneck links along a path, such as maximum hop delay, minimum available bandwidth, and maximum link utilization. Information is collected at L2 from network devices along a packet path. Each data receiver then returns the collected information to the data sender via L4 transport options or payloads. CSIG uses a simple compare-and-replace operation at network devices, which allows it to scale with network topology, link speeds, and packet rates.¶
CSIG builds on the successful aspects of prior explicit feedback schemes, but is more capable. CSIG carries rich multi-bit switch telemetry in live data packets, drawing from the advancements in in-band network telemetry, also generally known as INT. At the same time, CSIG retains the fixed-size headers and reflection in L4 transports akin to Explicit Congestion Notification (ECN). The industry has three key variants of INT: the one first specified in P4.org [P4-INT], the IOAM (In Situ Operations, Administration, and Maintenance) standard [RFC9378] in IETF and the Inband Flow Analyzer (IFA) spec [I-D.kumar-ippm-ifa] that is used in HPCC deployment [HPCCPLUS]. While they differ in the header definitions and encapsulation mechanisms, they all commonly stack up multiple per-switch telemetry data per-hop in the path of a packet. The packet size grows proportional to the metrics per switch and the number of forwarding devices along its path. Depending on the use case and header definition, the per-packet overhead ranges from 20B to above 100B. The large and variable size header overhead incurs challenges in end-to-end MTU limit conformation and parsing of the packet header data in the forwarding or receiving devices.¶
There exist several efforts to address the challenges incurred in INT variants, including: 1) carrying INT data in synthetically generated non-data packets also known as probe packets, and 2) carrying only the fixed-size INT instructions (e.g., specifying which data to collect per hop) in data packets, while hop devices generate separate report packets that deliver the requested per-hop data. While these techniques reduced the per-data-packet overhead, they did not fundamentally reduce the total amount of bytes or PPS overhead on the network devices or the data collector. TCP-INT [TCP-INT] was developed in parallel to carry fixed-size min/max/sum aggregate metric over the hops together with a hop locator in live data packets. However, it is limited to TCP Options, hence not applicable to various modern transports for AI/HPC, and furthermore there is no flexible way to introduce a new metric. CSIG's type-value format ensures a constant size overhead with future-proofness. The guaranteed constant size is small enough to fit into the 4B or 8B tag, enabling the unique placement of CSIG in L2, which frees the operators from the concerns around tunneling and encryption in deploying CSIG.¶
In the rest of the document, we describe the design of end-to-end CSIG at hosts and network devices.¶
Available Bandwidth¶
Active Queue Management¶
Congestion Control Algorithms¶
A 5-tuple transport connection, e.g. TCP connection¶
Congestion Signaling¶
Fields in the CSIG tag excluding the TPID.¶
Packets that contain the CSIG-tag and optionally the CSIG reflection header¶
Path is termed CSIG-capable if all transit devices along the path support the CSIG protocol and end hosts have at least pass-through support for CSIG packets¶
Packets that contain the CSIG-tag in the packet header¶
Secure network deployment domain where all devices in the domain have complete CSIG support or pass-through CSIG support¶
Per-hop delay¶
End-to-End¶
Internet Protocol Security¶
Maximum Transmission Unit¶
Maximum Segment Size¶
Network Interface Card¶
The port-by-port network path taken by a given packet specified as a sequence of device interfaces¶
PSP Security Protocol¶
Tag Protocol ID¶
Traffic Engineering¶
Any switch, router or middlebox in the path of a CSIG packet¶
Weighted Round Robin¶
CSIG was conceived to address problems in congestion control, traffic management and network debuggability in production networks. We describe below the design principles that shaped CSIG, with simplicity and ease of deployment being at the forefront. Section 7 discusses the rationale behind the specific design choices made in CSIG.¶
Simple Signals driven by Use Cases: Simple device port or queue metrics that solve concrete use cases are at the heart of CSIG's design principles. This simplicity is not only important to applications, but also keeps the area, power and cost of implementation low on network devices. Signals in CSIG are designed to be implementable in ASICs at line rate. Signals that track per-flow state at the switch, for example, are harder to implement and deploy, and are hence avoided in CSIG. CSIG is also flexible enough to accommodate new signals and use cases beyond those described in this document.¶
End-to-End Perspective: CSIG's design stems from an end-to-end perspective of requirements and trade-offs for both applications and the network. This document covers the necessary end-to-end aspects and the resulting design choices that make CSIG both useful to applications and practical to deploy.¶
Small and Fixed Packet Overhead: It is important that the packet size does not increase as it traverses the network, which means that the MTU does not need to be changed. Any overhead that is introduced should be fixed and small, minimizing the cost of implementation in switch / NIC pipelines. Low protocol overhead also means low bandwidth overhead for small packets, minimizing impact to packet-per-second (PPS) load and bandwidth efficiency. We make very few assumptions about which packets and devices CSIG is enabled on. Device implementations must be able to process CSIG on packets at line rate with minimal CPU involvement. Keeping the overhead small and fixed allows for CSIG to be enabled on every single packet at line rate. This is important because deployments may choose to enable CSIG on every packet rather than on a small sample of packets.¶
Works easily under Tunneling and Encryption: Tunnels are broadly used in modern deployments e.g., Traffic-engineering systems and Cloud traffic frequently use tunnels. CSIG is designed to easily support end-to-end signaling on devices even in the presence of complex tunneling deployments. This is in contrast to other in-band telemetry schemes that put more pressure on the ASICs to relocate metadata across inner and outer headers to work in the presence of tunnels. In addition, CSIG also works with encrypted packets, including PSP, IPSec and 802.1AE MAC Security.¶
Incremental Deployability: CSIG allows incremental deployment, where the mechanism can be deployed gradually into domains where some devices may support the new protocol and others may not. This document addresses interoperability in heterogeneous networks, and addresses backward compatibility with legacy devices. We envision CSIG to be broadly valuable across wired networks, although our target domain for initial usage is datacenter networks. We make minimal assumptions about the network architecture around tunneling, number of hops (diameter), routing, topology etc. Configuring CSIG for end-to-end consistency in a private network, or deployments over the Internet are not in scope for this document.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying significance described in RFC 2119.¶
CSIG protocol defines two components in the packet header to achieve end to end congestion signaling in a production network.¶
CSIG-tag: An L2 protocol that end hosts and transit devices participate in.¶
CSIG Reflection: A flexible L4+ protocol that only end hosts participate in.¶
CSIG-tag is the core component of the CSIG specification. It enables end hosts to request network signals of interest and for transit devices to provide these signals to end hosts over the specified packet header bits.¶
However, to achieve end-to-end CSIG, CSIG-tag MAY be combined with the CSIG reflection protocol to expose the signals of interest to the relevant endpoints or consumers where the signals are needed.¶
This section first describes the header formats for CSIG-tag and CSIG reflection. Then it describes the life of a CSIG packet, outlining the different roles of network devices in the context of CSIG, and how these two packet header mechanisms work together to achieve end- to-end signaling.¶
CSIG tag is a fixed size tag at the layer 2 header.¶
CSIG-tag placement in various packet encapsulations is shown below for completeness. It is always the last tag in the layer 2 header.¶
ARPA: dstmac / srcmac / csig-tag / ethertype / payload¶
802.1q: dstmac / srcmac / vlan-tag / csig-tag / ethertype / payload¶
802.1ad: dstmac / srcmac / vlan-tag / vlan-tag / csig-tag / ethertype / payload¶
802.1ad tunnel: dstmac / srcmac / vlan-tag / vlan-tag / vlan-tag / vlan-tag / csig-tag / ethertype / payload¶
802.1ae: dstmac / srcmac / security-tag / vlan-tag / csig-tag / ethertype / payload¶
Consequently, the placement / offset of the CSIG tag is not affected by the headers and payload at layers 3 and above. Layer 2.5 headers, such as MPLS, are also placed after the CSIG tag and do not impact its offset.¶
CSIG-tag is defined in two variants - Compact and Expanded. Each variant has a dedicated TPID codepoint to allow devices to infer which variant is in use. Each variant supports a distinct set of requirements with respect to production deployment and identifies contrasting trade-off points in the solution space. Deployment considerations are discussed in Section 6.¶
Structurally, the compact CSIG-tag variant resembles a single VLAN tag and the expanded CSIG-tag variant resembles a double VLAN tag. This structural similarity is intentional and the reasons are elaborated in Section 6.4.¶
CSIG-tag compact format is as shown, with 2B allocated for the CSIG Tag Protocol ID (TPID) and 2B allocated for the data fields.¶
CSIG-tag expanded format is as shown, with 2B allocated for the Tag Protocol ID (TPID) and 6B allocated for the data fields¶
This section describes the format and usage of data fields within the CSIG-tag¶
The Signal Type field T is three (four) bits long in the compact (expanded) format and indicates the type of signal being carried in the CSIG-tag. End hosts set the signal type T and request it on each packet of interest. Up to 8 signal types are supported in the compact format, and up to 16 signal types are supported in the expanded format. This draft concretely defines three signals: min(ABW), min(ABW/C) and max(PD), elaborated in Section 5 and Section 8. The remaining codepoints are reserved for future signals, and may be defined and used in future versions of CSIG.¶
A single packet can carry at most one Congestion Signal. However, end hosts MAY obtain multiple signals for a single 5-tuple flow by requesting different signal types on alternating packets of a flow or in a round-robin fashion across packets. Therefore, end hosts need not tie a single flow to a specific signal type, and MAY obtain all supported CSIG signals for a single flow.¶
The Signal Value field S is 5 bits (20 bits) long in the compact (expanded) format and captures the value of the signal specified by Signal Type T. End hosts set the initial Signal Value S alongside the requested Signal Type T, and each transit device along the packet path in the network MAY modify S in accordance with the e2e signal being computed. E.g., For signals that are min() aggregations, end hosts set the initial value of S to the maximum allowable value of the signal or its encoding thereof, and transit devices perform compare-and-replace to compute the min() across signals of individual devices on the packet path.¶
In the compact format, the 5-bit Signal Value is bucketed with 32 fully configurable buckets. Each bucket is configured with (low, high) value range. This configuration is specific to each Signal Type and MAY vary across Signal Types. This allows the Signal Value representation to be tailored to the specific needs of each Signal Type. For example, in typical use cases of available bandwidth, it is more useful to have higher granularity at lower values of the signal (i.e., when ABW is close to 0) than at higher values of the signal. This is because lower values of ABW have greater impact on application control decisions e.g., knowing whether there was 0 Gbps vs 1 Gbps available on a path makes a larger difference than knowing if there was 399 Gbps vs 400 Gbps available. Appendix A shows how the buckets could be defined in order to provide such a non-linear encoding of value-ranges to buckets. Such configurable encodings allow capturing useful information about the signal with fewer bits and is a core feature of the compact CSIG format.¶
In the expanded format, Signal Value is uniformly quantized into a 20 bit value. The unit of quantization is configurable on a per Signal Type basis, depending on the minimum and maximum value that needs to be represented with the given bits. The higher bit length allows for enhanced signal granularity and fewer configuration knobs in domains where the expanded CSIG format is viable to deploy (Section 6.5). 20-bits are sufficient to represent a wide range of values with high granularity. As an example, with a 8Mbps quantum for min(ABW), the signal value field can represent up to a max of 8Tbps. With a 128ns quantum for max(PD), the signal value field can represent up to a max of 128ms. More discussion on signal-specific quanta is in Appendix A.¶
Signal quantization / bucketing parameters are configured directly at the transit devices where the signal is computed. End hosts do not explicitly request or negotiate these parameters. As described in Section 5, all devices MUST be configured with the same quantization / bucketing parameters for each signal type, in order to correctly compute the requested signal along packet paths.¶
Locator Metadata field LM is an optional 7 bits (16 bits) in the compact (expanded) format. It captures relevant metadata about the bottleneck port or device, where the notion of bottleneck is specific to individual signal types. Locator Metadata MAY include compressed attributes about the bottleneck that is relevant for the use case e.g., capacity of the bottleneck port, stage of the bottleneck device in the data center topology, orientation of the bottleneck port - uplink / downlink. LM MAY also include expanded attributes of the bottleneck (e.g., port ID, TTL). This document provides recommendations for the type of information that locator metadata MAY carry, but it does not require any specific set of metadata to be supported. Metadata that is useful and viable to support will depend on the production setting, which is out of scope for this document. Instances of CSIG deployment MAY include locator metadata with custom-defined metadata beyond those described in this document. Section 5.5 discusses requirements for supporting LM in devices.¶
End hosts initialize LM to a default value. Transit devices that do not update the Signal Value S on a given packet MUST NOT alter LM on the packet. Transit devices that update S on a packet MUST update LM on the same packet.¶
CSIG reflection enables consumption of tag data fields at the point where the signals are needed for telemetry or control. This mechanism is particularly relevant for sender-driven / source-based telemetry and control. For receiver-driven transports and controllers, CSIG reflection may not be necessary as the signals on the CSIG tag are available at the receiver without reflection (See Section 4.3).¶
This document provides recommendations on how CSIG reflection SHOULD be implemented, and provides the framework to make the implementation deployment-specific.¶
CSIG reflection header is a separate header from the CSIG tag, implemented at layer 4 or above. The location of the header and the choice of which packets carry the header are transport-specific. As an example, the header can be carried on TCP ACK packets from the receiver back to the sender. Note that the presence of ACK coalescing, piggybacked ACKs, Selective Acknowledgements (SACK) etc. can impact the behavior of CSIG reflection. More generally, there may not be a 1:1 mapping between forward and reverse path packets. In a scenario where the transport implements ACK coalescing, the CSIG reflection header SHOULD reflect the latest CSIG-tag data fields received across the packets being acknowledged or a more advanced summary of the CSIG-tag data fields across the packets being acknowledged. It is important to note that since Signal Type is chosen on a per-packet granularity, a coalesced ACK may acknowledge multiple packets that carry different signal types in their CSIG- tags. In such a scenario, the reflection header MAY only reflect one of the signals. The sender transport should choose Signal Type for packets in a way that ensures that it can continue to receive all signals of interest.¶
CSIG reflection header MAY include all of the CSIG data fields i.e., 2B for the compact version and 6B for the expanded version. However, one could optimize header space and include only a subset of the data fields if the consumer is interested only in a subset of signals or locator metadata.¶
CSIG reflection is an end-host-only protocol and transit devices do not participate in it. Therefore, CSIG reflection header can be incorporated in portions of the packet that are e2e encrypted via PSP or IPSec.¶
The following subsections discuss locations in the packet header where CSIG reflection could be implemented for different transports¶
Reflection in TCP is typically achieved via TCP options. CSIG Reflection can be implemented via a new TCP Option, identified by a unique Kind.¶
Several transports such as QUIC [RFC9000] and PonyExpress [PONYEXPRESS] are built atop UDP. Reflection in UDP can be achieved by including CSIG data fields in the UDP payload from receiver to sender. For unidirectional UDP traffic, an out-of-band reverse connection from the receiver to the sender may be necessary for CSIG reflection.¶
As an example, PonyExpress [PONYEXPRESS] is a custom transport implemented within a userspace host networking stack. It supports a flexible L4 wire protocol that periodically changes as new features are added (Sec 3.1 in Snap). CSIG reflection can be implemented as additional bytes within this wire format.¶
For simplicity and to avoid the need for negotiation, the CSIG reflection header can be carried on all packets independent of whether CSIG is enabled on them. The Valid
bit in the Flags field can be set to 1 for packets that carry valid data fields in the reflection header. In certain deployments, negotiation is unavoidable for a variety of reasons. Section 6.3.3 provides details regarding options for negotiation.¶
This section describes the end-to-end operation of CSIG with the walkthrough of the life a packet. It assumes that all nodes in the path are CSIG-capable and omits the negotiation phase. Details of negotiation are covered in in Section 6.3.3¶
The sender end-host first constructs a CSIG-tagged packet for a flow of interest and sends out the packet with the tag data fields initialized. The transport determines these initial values for the packet, including Signal Type to request and default values for the other data fields. Each transit device performs a compare-and-replace on the CSIG-tag to optionally update the Signal Value and Locator Metadata fields on the tag. As the packet traverses through the network, the CSIG-tag data fields accumulate the desired aggregation of the requested signal.¶
When the CSIG-tagged packet reaches the receiver end-host, the data fields in the CSIG tag are extracted and delivered to the transport layer at the receiver. The transport stores the data fields of the packet to be reflected, or a summary of these fields across packets. It reflects these data fields in the layer-4 CSIG reflection header on packets traversing the reverse path from receiver to sender. The CSIG reflection header is unmodified as the packet travels from receiver to sender. The sender extracts the CSIG data fields from the CSIG reflection header of the incoming packet, and hands it to the transport layer for use in applications at the sender. As a result, the sender transport learns the desired signal for a flow within approximately one round-trip time.¶
The transport layer has a significant role to play in making CSIG usable. Although the CSIG data fields are carried on packets, the measurements are ultimately relevant at the flow / connection level for specific paths. If the sender transport desires to obtain multiple signals for the same flow, it MAY choose Signal Type on a per-packet basis (e.g., in a round robin fashion across the flow's packets), and internally keep track of all of the requested signals as part of the flow's state variables. This approach allows the sender transport to use all supported CSIG signals for use cases such as congestion control, load balancing and multipathing.¶
CSIG has three participating entities, each with their own roles and responsibilities for achieving end-to-end congestion signaling.¶
The sender host is responsible for¶
(i) Constructing CSIG-tagged packets for flows of interest and initializing the CSIG-tag data fields on each packet as specified by the transport, and¶
(ii) Parsing the CSIG reflection header received in incoming packets and extracting CSIG data fields for use in the sender transport / applications.¶
Only the sender is allowed to insert CSIG-tags into packets.¶
Transit devices are responsible for¶
(i) Computing and tracking Congestion signals such as ABW and ABW/C of each port and hop delay per packet¶
(ii) Parsing the CSIG-tag based on the TPID code point on incoming packets to identify the Signal type being requested, and¶
(iii) Performing compare-and-replace on the Signal value and locator metadata fields on the CSIG-tag based on the aggregation corresponding to the requested signal type (min / max)¶
Transit devices MUST NOT add CSIG tags to incoming packets that are not already CSIG-tagged. Transit devices MAY delete the CSIG tag before forwarding the packet. This functionality can be exercised when downstream devices are not CSIG-capable. Further discussion on this topic is in Section 6 on Incremental Deployment of CSIG.¶
The receiver host is responsible for¶
(i) Extracting the CSIG-tag on incoming packets and exposing the data fields to the transport layer and/or receiver-driven applications¶
(ii) Inserting and populating the CSIG Reflection header at the transport layer for packets traversing the reverse path to the sender.¶
Note that for bi-directional flows, the Sender and Receiver are specific to each direction within the flow. For a bi-directional flow between hosts A and B,¶
(i) A plays the Sender host role and B plays the Receiver host role for data packets traveling from A to B, and similarly¶
(ii) B plays the Sender host role and A plays the Receiver host role for data packets traveling from B to A.¶
In this scenario, packets traversing from A to B contain both a CSIG- tag that captures the congestion signals on the forward A-->B path, and a CSIG reflection header that captures the CSIG data fields of the reverse B-->A path. Equivalently, packets traversing from B to A contain both a CSIG-tag that captures the congestion signals on the forward B-->A path, and a CSIG reflection header that captures the CSIG data fields of the reverse A-->B path¶
As described in the previous section, Signal Type indicates the type of congestion signal that CSIG-tag carries on each packet. Up to 8 signal types are supported by the compact format and up to 16 signal types are supported by the expanded format.¶
In this section, we concretely define three signals driven by use cases described in Section 8. While Section 8 covers how these three signals are useful to applications, this section focuses on precise definitions of these signals and how they may be implemented on transit devices.¶
Note for future extensions: Signals in CSIG are intended to be aggregation functions of individual per-hop or per-port signals across the path of a packet. The typical definition of such signals with max / min aggregations captures the notion of a path bottleneck for different definitions of bottleneck. However, structurally, the format supports arbitrary read-modify-write operations, including aggregations such as max, min, count and sum, allowing future use cases to leverage this structure for new signals.¶
min(ABW) captures the minimum absolute available bandwidth (in bps) across all the ports in the packet path. Available bandwidth is defined per egress port on each device.¶
ABW can be computed using one of many algorithm variants, each having implications on HW or SW implementation complexity, timescales of computation and accuracy of the signal.
In its rudimentary form, the raw ABW for a given egress port p
over a time interval delta_t
can be computed as follows:¶
// delta_txbit is the number of bits that exited on the wire utilization_bps[p] = (delta_txbit[p]) / delta_t; // capacity_bps[p] captures the link speed of port p abw_bps[p] = capacity_bps[p] - utilization_bps[p];¶
Implementation of these computations relies on at least one of the following capabilities in the devices:¶
Timer-based computations: Most networking ASICs maintain hardware counters that track the number of bits that exit on each egress port. To compute available bandwidth, a periodic-timer thread in SW or HW triggers the computation and update of available bandwidth every delta_t
time interval , where delta_t
is a configurable parameter.¶
Per-packet computations: In this alternative, available bandwidth is computed and updated on every packet that is processed via the egress pipeline, typically in HW e.g., via Exponential Weighted Moving Average (EWMA) estimation where the weights are configurable. delta_t
is not an explicit parameter in this approach, and is implicitly determined by EWMA weights.¶
Variants such as Discounted Rate Estimator (DRE) [CONGA] use a combination of per-packet updates and timer-based approaches.¶
ABW/C captures the fraction or percentage of available bandwidth on a given link relative to the link's capacity. min(ABW/C) captures the link utilization bottleneck along the path of the packet. This signal is most relevant in paths with heterogeneous link speeds, where it distinguishes itself from min(ABW). min(ABW/C) is equivalent to max(U/C), where¶
U = utilization of a given egress port in bps C = capacity of a given egress port in bps ABW = available bandwidth of a given egress port in bps¶
Therefore, max(U/C) = max (1 - ABW/C) = 1 - min(ABW/C)¶
ABW/C can be computed from ABW as follows:¶
// Represents fraction of available bandwidth on port p // relative to the port's capacity. abwc_frac[p] = abw_bps[p] / capacity_bps[p];¶
Algorithms for ABW computation described in Section 5.1.1 also apply to ABW/C computation, except that the resulting value is normalized by the port capacity. Quantization / bucketing is performed after normalization.¶
On paths with heterogeneous link speeds, min(ABW) and min(ABW/C) bottlenecks are not necessarily the same ports. Figure 2 shows an example where these two bottlenecks are different. Each type of bottleneck has its own value, as demonstrated in Section 8.¶
max(PD) captures the maximum per-hop delay experienced by a packet among all the hops in the packet path. Per-hop delay PD is the time spent by the packet in the device pipeline. It MAY include link layer delays or it MAY only include the delays observed in the forwarding pipeline.¶
Unlike ABW and ABW/C which are per-port signals, PD is a per-packet signal. It consists of PHY, MAC and switch pipeline delay experienced by the packet. Pipeline delay is the most relevant component as it captures congestion related queueing delay. Device implementations MAY track ingress and egress timestamps explicitly for each packet and perform a diff in the final stages of the pipeline. Precise definitions of these stages depend on the architecture of the device. For example, some devices could leverage existing timestamping support from tail timestamping capabilities for this purpose.¶
To support max(PD) in CSIG, the device SHOULD support per-packet tracking of delay experienced through the device.¶
It is desirable to have minimal gaps in the components of packet delays captured by the device. However, CSIG does NOT set strict requirements on the accuracy of PD to be supported by the implementation.¶
The computed delay values MUST be compressed to fit in the available Signal value bits on the CSIG-tag. The device MUST support 32 fully configurable delay buckets for compact CSIG, and configurable quanta for uniform quantization in expanded CSIG. All devices along the packet path MUST be configured with the same buckets / quanta to correctly compute max(PD) along the path.¶
Each transit device performs a compare-and-replace, i.e., updates the signal value on the CSIG tag if the incoming delay signal value on the packet is lower than the device's locally computed delay for the packet, post bucketization / quantization. E.g.,¶
// Update the signal value on packet if current hop is the bottleneck pkt->csig_tag->pd = max(pkt->csig_tag->pd, device->pkt->pd)¶
Delay experienced by the packet on a device, as defined, is implicitly a QoS-specific signal. This is because the packet is subject to QoS policies as it traverses through the device pipeline, including prioritization, scheduling and buffering. For example, a high priority packet may see smaller delays than low priority packets. Therefore, the delay measured for the packet SHOULD include components in the pipeline where QoS policies are applied.¶
Locator metadata (LM) captures information about the bottleneck device or port, as described in Section 4.1.3.3. In this section, we discuss requirements for supporting LM in CSIG, and provide recommendations for commonly useful attributes to carry in LM.¶
A single deployment MAY choose a subset of the attributes in Section 5.5.2 and/or newly defined attributes beyond those listed in Section 5.5.2 to include in LM. However, the total size of the individual attributes MUST be within 7 bits for Compact CSIG and within 16 bits for Expanded CSIG.¶
CSIG does not set strict requirements on the LM internal format i.e., how the individual attributes are organized among the available LM bits. However, this LM internal format MUST be consistent across devices in the deployment domain so that the end hosts can consistently interpret these bits. The LM internal format MAY be specific to each signal type.¶
Devices SHOULD support configuring per-port values for LM to be written on the CSIG-tag. Devices MAY provide more granular configurability of LM based on Signal type as well. CSIG packets egressing on a given port that have their Signal Value updated by the device MUST be updated with the LM corresponding to the port and Signal Type.¶
Attributes can be designed to capture the level of resolution desired by use cases for pinpointing the bottleneck. Attributes may be encoded to fit within the limited number of LM bits available in CSIG.¶
We separate the list of attributes into compact attributes and expanded attributes. Compact attributes are motivated by the limited number of LM bits available in Compact CSIG, and therefore capture only the essential information about the bottleneck that is necessary for the use cases i.e., to inform control decisions or telemetry. Expanded attributes provide higher resolution information about the bottleneck, and can aid in directly pinpointing bottleneck devices or ports. Expanded attributes typically require more bits and are hence more suited for Expanded CSIG.¶
Examples of attributes are listed below.¶
Link capacity: Encodes the capacity of the bottleneck link. In typical deployments, the number of link speeds deployed is a small set, can be encoded using <= 5 bits.¶
Stage of the bottleneck: Encodes the stage of the topology where the bottleneck device / port is located. For example, in a 5-stage clos topology, the stage of the device can be encoded with 3 bits.¶
Link orientation: Encodes the direction of a port in the context of the network topology. For example, with three categories - uplinks, downlinks and side-links - link orientation can be encoded using 2 bits.¶
Port ID: Encodes a unique identifier for each port within a deployment domain.¶
Device ID: Encodes a unique identifier for each device within a deployment domain.¶
TTL (Time-to-live): Captures the TTL value of the packet at the bottleneck device, represented using 8-bits. End hosts can use this attribute to infer the hop number at which the packet was bottlenecked.¶
LM attributes and encoding schemes are ultimately deployment specific and use-case specific. CSIG supports a flexible specification of LM to accommodate a variety of requirements and future applications.¶
Most production networks are heterogeneous, with a mix of network devices across generations. This document addresses the brownfield deployment of CSIG in a heterogeneous network, where there may be a mix of devices that offer varying degrees of support for CSIG packet construction and processing.¶
Before describing incremental deployment, we introduce the idea of CSIG stripping, an action primitive which is foundational to deploying CSIG in a heterogeneous environment.¶
Devices that support CSIG MUST be capable of removing the CSIG tag before forwarding the packet. Devices MUST allow configuring CSIG- stripping on a per egress-port basis. If a port is configured to strip CSIG, then all CSIG-tagged packets that egress on this port must have the tag removed before being forwarded.¶
In the following sections, we describe how this capability can enable incremental deployment.¶
We first classify devices into three simplified categories based on their level of CSIG support. In the subsequent sections we describe how CSIG can interoperate with each category of device. Note that the level of support is a function of the tag placement and whether the compact or expanded CSIG tag format is used as shown in Section 4.1.¶
Devices in this category are not capable of recognizing or parsing CSIG tagged packets. If such packets are received, they will simply be dropped.¶
Devices in this category are able to recognize and parse CSIG tagged packets, and transparently forward the packet with the tag intact or with the tag stripped to neighboring devices (in the case of transit devices) or to the end host transport layer (in the case of end hosts). However, they do not support updating the CSIG data fields on the tag.¶
Some devices that do not natively support CSIG may be configured to support pass-through mode for CSIG if they support VLAN tags with configurable TPIDs. This is discussed in more detail in Section 6.4.¶
Devices in this category support the complete CSIG protocol, including recognition, parsing, forwarding, tag-stripping, signal computation, and signal updates on the tag. However, only a subset of signal types may be supported.¶
It is noteworthy that in some devices that do not natively support CSIG, resources available for VLAN tag processing can be repurposed to support CSIG for certain signal types using a combination of software and hardware capabilities. We refer to this level of support as software-assisted support. This capability is discussed in more detail in Section 6.4.¶
Devices that natively support CSIG are explicitly equipped with the hardware capabilities required to implement the CSIG protocol.¶
A CSIG domain is a deployment domain where all network devices have complete support or pass-through support for CSIG.¶
In this section, we first define the requirements for CSIG Interoperability in brownfield deployments. Then, we consider devices with all levels of support described in Section 6.2 and describe how these devices MAY be configured to achieve interoperability. Note that the following descriptions apply separately to both Compact and Expanded CSIG-tags.¶
Device category | Interop support |
---|---|
Discard | Upstream devices must strip CSIG tags before packets reach this device |
Pass-through support only | Device may strip tag or transparently forward with tag unmodified depending on e2e signal accuracy requirements |
Native CSIG support | Device updates CSIG-tag as per protocol |
SW-assisted CSIG support | Device updates CSIG-tag using VLAN match/action with approximate signals computed in S/W agent |
Forwarding: The fundamental requirement is that no CSIG-tagged packet should be dropped in the network due to a lack of CSIG support on a device. This requirement means packets with CSIG-tags MUST never reach devices in the Discard category, or MUST have their CSIG-tag stripped before reaching such devices.¶
Negotiation: End hosts / flows SHOULD ensure that the path (including end hosts and transit devices) is CSIG-capable before enabling CSIG- tagging on packets. Devices in the Discard category should not require any changes in order to achieve negotiation. This requirement is to ensure correctness of data fields in end-to-end CSIG operation, and to interoperate with legacy devices or software stacks.¶
To achieve forwarding interoperability requirements for CSIG, CSIG stripping may be exercised as shown below¶
When a neighboring device connected to a given egress port is a Discard device and cannot parse CSIG packets, this egress port MUST be configured to strip the tag on outgoing packets to ensure that the packet does not get dropped downstream.¶
When a device supports Pass-through only or does not support the requested signal type on a CSIG packet, egress ports on this device MAY be configured to strip the tag on outgoing packets to ensure that CSIG does not carry inaccurate information. In some use cases where it is acceptable for CSIG to miss capturing signals on certain hops, pass-through devices MAY transparently forward the packet with the CSIG tag intact.¶
At the boundary of a CSIG domain, device ports that are connected to devices outside of the CSIG domain MUST strip the tag to ensure that packets exiting the domain do not contain CSIG-tags. Only egress ports connected to devices within the CSIG domain SHOULD retain CSIG-tags on outgoing packets.¶
CSIG packets and non-CSIG packets can be used together in a brownfield setting. This requirement means that end hosts MUST be capable of transmitting and receiving both CSIG packets and non-CSIG packets, including for the same flow. A packet marked with CSIG-tag at the sender host may arrive at the receiver host without the tag. In addition, Compact CSIG and Expanded CSIG packets may be used together on the same network.¶
Support for sending and receiving CSIG-tagged packets may require software and/or hardware changes on transit devices and end hosts. In many deployments, particularly those requiring hardware upgrades to support CSIG (such as Switch or NIC support), version stragglers continue to exist for long time horizons for a variety of reasons, and interoperability with such stragglers is a critical requirement. Without negotiation for CSIG capability, devices that are not CSIG- compliant may drop CSIG packets and thus blackhole traffic. Negotiating for CSIG-capability of a path is critical to ensure that CSIG protocol operates safely end-to-end in a brownfield deployment.¶
A path is considered CSIG-capable if end-hosts have at least Pass-through CSIG support and transit devices have Complete CSIG support (native or software-assisted). Before sending CSIG-tagged packets on a network flow, end-hosts must negotiate for path CSIG-capability. We discuss one approach to negotiation for path CSIG-capability, which involves two parts: negotiation for transit device support and negotiation for end host support.¶
In this section, we describe one simple approach to negotiate CSIG support on transit devices with CSIG stripping.¶
CSIG stripping can be used to implicitly achieve negotiation by removing the CSIG-tag from the packet header at or before devices on the packet path that do not have the desired level of CSIG support. If the receiver end host receives a CSIG-tagged packet, it serves as an explicit indication that all devices on the packet path, including transit devices and end-hosts, have the desired CSIG support. If the receiver end host receives a packet without a CSIG-tag, it is an indication that one or more devices do not have the desired CSIG support, or that the packet was not tagged at the sender to begin with. This indication can be implicitly reported to the sender via an empty / invalid CSIG reflection header and the sender can determine whether the packet path was CSIG-capable.¶
This approach assumes that each device has knowledge about the level of CSIG support in its immediate neighboring devices, which is viable through configuration in typical private SDN networks. In the absence of centralization, mechanisms such as a new LLDP TLV may be defined to advertise aspects of CSIG support on the device, including compact vs expanded CSIG-tag support, signal types that are supported, pass-through vs complete support etc. We leave the details of such an LLDP extension for future extensions of the protocol.¶
A sender end host may need to explicitly negotiate with the remote end-host to ensure that the host networking stack at the remote host has the desired level of CSIG support. Ideally such explicit CSIG negotiation should be performed during or before the initial connection handshake, after which CSIG is enabled / disabled on packets post connection establishment. It may also be necessary to explicitly negotiate the use of CSIG Reflection in transports, separately from the negotiation for path CSIG-capability. For example, in TCP, negotiation is required to use the CSIG Reflection TCP Option. We leave the details of such negotiation schemes for future extensions of the protocol.¶
Transit devices without native CSIG support MAY participate in CSIG protocol via a Software-assisted approach. This allows brownfield deployments to reap incremental benefits of CSIG without having to upgrade a significant fraction of device HW on their networks.¶
Since compact and expanded CSIG tags are structurally similar to single VLAN-tags and double VLAN-tags respectively, VLAN resources in a transit device can be repurposed to support CSIG updates. More specifically, configurable TPIDs for VLAN tags can be used to treat CSIG tags as VLAN tags, and VLAN match/action resources for tag updates in the device can be leveraged to support updating CSIG data fields on the tag.¶
For signals such as ABW and ABW/C, a software agent running on the CPU of a transit device can periodically compute these signals based on hardware byte counters, and program VLAN match/action rules in the dataplane to update CSIG data fields based on the computed signals. Since the match/action rules are in the dataplane, CSIG packets can be processed at line rate without CPU involvement. However the match/action rules themselves can be updated at a slower cadence via the software agent.¶
Compact CSIG is designed to enable software-assisted backward compatibility while operating within the constraints of commonly available VLAN resources on transit devices. Backward compatibility via software is a fundamental feature in the design of Compact CSIG.¶
Note that it may not be possible to track signal types such as hop delay per packet in a software agent. However, approximations of the signal based on available hardware counters and registers (such as latency histograms) can be implemented in the agent if software- assisted support is desired for such signal types.¶
In greenfield deployments of CSIG domains, all devices in the domain natively support the CSIG protocol.¶
Expanded CSIG is designed to leverage greenfield deployments where backward compatibility, negotiation and interoperability are not requirements. It provides enhanced signal resolution via higher bit width for signal values and locator metadata in comparison to Compact CSIG. Expanded CSIG can also support up to 16 signal types.¶
Devices in Greenfield CSIG domains MUST support CSIG stripping at the domain boundary to ensure that CSIG packets don't exit the domain.¶
CSIG's design choices are shaped by an end-to-end perspective of what matters to applications and where tradeoffs can be made towards simplicity and practicality. In this section, we discuss the rationale behind CSIG's design and the advantages it provides over existing state of the art.¶
CSIG-tag offsets at layer 2 are independent of headers and payload at layer 3 and above, which means that only a small set of tag placement offsets need to be supported for reading and updating the header. This makes device implementations of CSIG simpler. In contrast, in-band network telemetry schemes implemented at layer 3 or higher require support for a large set of packet formats as this set grows by the cross-product of formats / encapsulations at each layer. This complexity forces device implementations to restrict support for only a fraction of packet formats / encapsulations, hindering the adoption and deployment of such schemes. CSIG-tagging, on the other hand, is simpler to support and deploy since it is at layer 2 and has a fixed offset despite various formats / encapsulation at layer 3 and above.¶
The choice of layer 2 also makes compatibility with in-network tunneling and encryption simpler, which are common features in data center deployments.¶
CSIG-tags are, by design, compatible with PSP encrypted packets and IPSec encrypted packets, where Layer 4 headers and payloads may be encrypted.¶
CSIG tags are carried through Layer 3 tunnels e.g., IP-in-IP, VxLAN, Geneve, at a fixed offset in the packet header. This avoids the need to copy and relocate CSIG tags across inner / outer headers during encapsulation and decapsulation of packets, which would be necessary if implemented instead at layers 3 or higher.¶
CSIG tags are placed as the last header in the Layer 2 header stack to ensure compatibility with layer 2 and layer 2.5 tunneled domains as well. The placement of CSIG tags in MACSec and other Layer 2 encapsulations is shown in the table in Section 4.1.¶
Most in-band network telemetry schemes are not backward compatible. However, CSIG tag's structural similarity to VLAN tags enables backward compatibility with many devices that don't have native CSIG support as described in Section 6.4. This allows deployments to reap the benefits of CSIG without having to upgrade a significant portion of their network hardware.¶
In addition, since expanded CSIG is limited to 8B, i.e., the size of double VLAN tags, the packet parsing depth required on devices to read and process headers at layer 3 and above is not affected.¶
In summary, the choice of Layer 2 for CSIG-tag is a key part of CSIG's simplicity and efficiency, since it keeps device implementations simple while supporting multiple encapsulations and backward compatibility.¶
CSIG's design separates the CSIG-tag and CSIG reflection headers into distinct layers. This decoupling enables end hosts to develop different transport-specific implementations of CSIG reflection while sharing the underlying CSIG-tag mechanism. This means that transit device behaviors are not impacted by innovations in CSIG reflection.¶
In addition, this decoupling enables the separate tracking of forward and reverse path bottlenecks. This is important since CCAs typically prefer to react to congestion on the forward path only and not react to congestion on the reverse path. In contrast, in-band schemes that mix signaling and reflection into the same header do not provide distinctions between forward and reverse path.¶
CSIG's fixed-size headers constitute less than 0.2% bandwidth overhead in packets with 4k or 9k MTU. This means that there is no need for fragmentation or increasing MTU size for the purposes of supporting multiple congestion signals. Furthermore, the performance of network device packets per second (PPS) is minimally impacted by the inclusion of CSIG tag and reflection headers.¶
The low overhead allows CSIG to be enabled on all live data packets or explicit probe packets or sampled packets. This is an important capability because it allows for the direct quantification of the bottlenecks experienced by the data packets themselves instead of having to rely on probes. However, leveraging CSIG on probes or sampled packets is still an option for deployments that require such visibility.¶
CSIG is designed to perform compare-and-replace (or more generally read-modify-write for future extensions), with a fixed size header. Therefore, CSIG is not limited by the number of hops in a network path (i.e., diameter of the network) unlike schemes that append information at each hop.¶
CSIG's signal design focuses on simple, aggregate signals that are driven by use cases, as demonstrated in Section 5 and Section 8.¶
CSIG allows a single packet to carry only one congestion signal. To obtain multiple signals at the end hosts, it takes advantage of the fact that the end host can request different signal types across multiple packets of a flow. In contrast, other schemes tend to overload each packet with a lot of information, including metadata about multiple signals, which can be limiting. Moreover, CSIG-tag's format is also extensible, which means that it can be adapted to support additional signal types and locator metadata in the future without compromising the advantages of CSIG's design.¶
A unique feature of Compact CSIG's design is the ability to fully configure signal value buckets, which allows for efficient signal representations with a limited number of bits. For example, the encodings can be adjusted to provide greater granularity at value ranges that are more important to the application, and lower granularity at ranges that are less important. Similarly, locator metadata can be efficiently represented by carrying fewer bits of relevant compressed attributes of the bottleneck that are important to applications. Expanded CSIG, on the other hand, uses uniform signal quantization for more accuracy and provides even more flexibility in defining signals and locator metadata with a larger bit width.¶
The use cases for CSIG are motivated by congestion control, traffic management and network debuggability. These use cases have always existed in production before CSIG, often using signals that are measured end-to-end (such as packet loss and delay), or out-of-band signals from network devices such as port utilization. CSIG provides a boost in performance, efficiency and debuggability by augmenting existing use cases with explicit in-band measurements.¶
In this document, we present the use cases for the three signals defined in Section 5. At the crux of a signal is the definition of bottleneck. Over time we envision use cases for other signals that would define a bottleneck, e.g., the maximum number of co-sharing flows on a link. For each of these new signals, locator metadata can continue to provide attributes about the bottleneck port such as port capacity.¶
CCA can make use of CSIG signals in at least two different ways. First, existing CCA can use CSIG values to address blindspots in end- to-end signals such as packet loss, delay, and delivery rates. This use case is immediately relevant as most production networks deploy some form of end-to-end congestion control including Swift [SWIFT], and BBR [BBR]. A second way to use CSIG is to design entirely new congestion control algorithms that use CSIG as their primary signal. We focus below on the former category.¶
E2E CCA comes in various forms and for simplicity we describe the use cases taking Swift CC [SWIFT] as the baseline. Swift is delay-based congestion control that uses accurate round-trip time (RTT) measurements done via the NIC hardware timestamps. These signals can be applied to other CCA and are NOT limited to Swift.¶
The interpretation and applications of CSIG for congestion control in lossless networks and networks that use packet spraying is a topic for future research.¶
E2E RTT measurements used in Swift include the queueing delays on all hops along the flows' path, including the forward and reverse paths. A consequence of using a lumped delay signal is that a flow reduces its sending rate in response to delays that it may not be able to directly control. Furthermore, in deployments where there can be multiple congested links along the path of a flow, it is desirable to modulate the sending rate of a flow in response to just the maximum of the per-hop delays, max(PD), along a flows' path. Replacing the end-to-end measured delay with bottleneck delay into Swift's equation yields the following:¶
// Reduce the congestion window when bottleneck hop delay // exceeds a chosen target hop delay if (max(PD) > target_delay) then md = beta * (max(PD) - target_delay) / max(PD) cwnd = (1 - md) *cwnd¶
Poseidon [POSEIDON] is a CC proposed in literature that exemplifies the use of maximum per-hop delay in reducing its congestion window. By incorporating bottleneck information in congestion control response, POSEIDON flows achieve higher flow throughputs in presence of reverse path congestion, and congestion across multiple network hops. Algorithm 1 in [POSEIDON] details the use of maximum per-hop delay in both the increase and the decrease of the congestion window.¶
E2E CC uses heuristics to determine by how much to increase the congestion window, e.g., in the case of Swift, when the measured round-trip time is lower than the target delay, Swift increments the congestion window by one per round-trip time. BBR [BBR] increases the rate as a function of the flow's measured delivery rate.¶
The problem with these heuristics is that they don't get the rate or window adjustments just right and either under or overshoot. Undershooting the rate would mean that transfers take longer to complete even when the bottleneck link has a low utilization, while overshooting can cause an unnecessary increase in queueing delay and packet losses.¶
In the following example, we integrate the maximum utilization signal into Swift's congestion window update equation to ramp up adaptively faster when the bottleneck link has low utilization. The congestion window evolution is represented below:¶
// Increase congestion window in proportion // to the utilization headroom if (rtt < target_rtt) then fcwnd <-- fcwnd + additive_increment + kLambda . fcwnd . (1 - max(U/C))¶
As an example, the fixed additive increase in Swift of rate <-- rate + Additive Increment, means that it takes 200 RTTs to take 80 Gbps of bandwidth with an Additive Increment of 400 Mbps. The fast ramp-up with CSIG using the bottleneck link utilization takes <10 RTTs to safely ramp to 80 Gbps.¶
E2E CC uses heuristics to determine the initial transfer rate for newly established connections. Starting too slowly would cause the transfer to take longer than necessary while wasting available bandwidth, whereas starting too quickly would cause queue delays and packet drops. The same dilemma exists for transfers that are starting on a connection that has been idle for multiple round-trip times.¶
In networks where we know ahead of time that the degree of multiplexing is low i.e., just a handful of flows co-existing on the link at any point in time, transfers complete quickly when they "jump-start" to use up all of the bottleneck bandwidth. This is especially helpful when transports employ robust loss recovery mechanisms such that even if the queue overflows, any lost packets can be quickly recovered.¶
As an example, on an empty network of 200Gbps, a single transfer can use up the entire 200Gbps in the second RTT, after the CSIG feedback in the first RTT indicates the availability of 200Gbps at the bottleneck link.¶
CSIG's min(ABW) bottleneck bandwidth allows transfers to start safely at line-rate.¶
CSIG encodes the most notable information about the path for each flow by carrying bottleneck link signals and bottleneck locator metadata. This path-level information, which is obtained directly from application data packets rather than synthetic probes, is directly attributable to the flow and is valuable for traffic engineering and application performance debugging.¶
Datacenter topologies employ a diverse set of paths between any source-destination pairs. Transports employ techniques such as Protective Load Balancing [PLB] and Multipathing [RFC8684] to spread traffic across the multitude of paths. Load balancing and multipathing in transports use a combination of end-to-end signals and heuristics to select which paths to use and how much traffic to channel in each of the paths.¶
Using CSIG signals from bottleneck links along the diverse set of paths, load balancing and multipathing schemes can select high quality paths with lower congestion, and spread traffic across them in a congestion-aware manner.¶
Locator metadata can also be used to distinguish between incast congestion and core network congestion, which can then be used to adjust load balancing / multipathing actions. For instance, the stage of the bottleneck and link orientation attributes are enough to determine whether the last hop is the bottleneck or not. When the last hop is the bottleneck, flow-level load balancing / multipathing actions may not be effective and may, in fact, worsen incasts. Such cases may require application-level load balancing or job scheduling techniques to distribute traffic. However, when congestion is instead known to be in the core network, flow-level load balancing / multipathing actions can route around congested areas and improve performance.¶
Traffic Engineering carves out paths with apt bandwidth across aggregate source-destination pairs. Examples within a datacenter include Datacenter Network Interconnection Layer (DCNI) [JUPITEREVOL]. CSIG can be used to provide fine-grained path level information, including short timescale microburst congestion, to TE systems. By using summarized CSIG signals aggregated both spatially and temporally across flows, TE can select paths and balance traffic at the datacenter level to accommodate bursty traffic, e.g., from ML.¶
Applications often complain that the network is slow, but it can be challenging to identify the specific segment of the network that is causing the problem. This is especially true with the scale of datacenters, where flows can traverse up to nine hops [JUPITEREVOL]. Figuring out where the bottleneck is and the timescales at which the path poses a bottleneck is like searching for a needle in a haystack for an application with thousands of flows across various source-destination pairs.¶
On application network flows, CSIG information, with its bottleneck locator, can quickly and precisely answer why the flows are slow and where the network / path bottlenecks are.¶
CSIG can also be enabled on mesh prober systems similar to [PINGMESH] to augment end-to-end probe measurements between any two servers with bottleneck information to aid troubleshooting.¶
Only trusted sender hosts MUST be allowed to construct, initialize and insert a CSIG tag into packets for authorized flows. Based on deployments, the authorization can be done at the NICs or at the switches, akin to firewall rules. CSIG stripping may also be employed as fencing rules at domain boundaries to ensure that unauthorized CSIG-tags are not traversing across these boundaries.¶
A rogue or broken network-device in a private network might put in arbitrary CSIG values, or insert a CSIG tag in packets on a transit node. We expect there to be checks and balances to identify and take non-functioning or rogue network devices out of a private network, as they can impose greater harm than distributing misleading CSIG values.¶
There are no IANA considerations. CSIG Tag Protocol Identifier (TPID) is requested from IEEE.¶
With the increased deployment of applications that are sensitive to delay and bandwidth usage in data centers, e.g., AI/ML/HPC workloads and RDMA based applications, relying solely on end-to-end signals is insufficient under dynamically changing traffic patterns. Simple and timely signals from network devices to end-hosts can augment and optimize end-host transports to make optimal use of datacenter bandwidth. CSIG is a simple, practical and deployable protocol for distributing congestion information in networks that builds on the successful aspects of prior work and is grounded in use-cases of congestion control, traffic management and network debuggability.¶
This work would not be possible without the following individuals whose various engineering and design contributions shaped CSIG and its use cases:¶
Christopher Alfeld, Neelesh Bansod, Jis Ben, Neal Cardwell, Yongzhou Chen, Yuchung Cheng, Dal Chand Choudhary, Mick Fingleton, Mahmudul Hasan, Jeffrey Ji, Marc De Kruijf, Praveen Kumar, Rich Lane, Chang Liu, Morley Mao, Carl Mauer, Sachin Menezes, Nipen Mody, Masoud Moshref, Alex Rumyantsev, Gerald Schmidt, Arjun Singh, Arjun Singhvi, Babru Thatikunta, Jeff Tikkanen, Frank Uyeda, Brian Vasquez, Rui Wang, Hassan Wassel, Yong Xia, Zhengxu Xia, Kevin Yang, Liangcheng Yu.¶
We would like to thank Arjun Singh, David Wetherall, Neal Cardwell, Akash Deshpande and Arvind Krishnamurthy for their feedback on several portions of this document.¶
The following table demonstrates an example encoding of a 3-bit signal value. Note that this is an example ONLY. The encoding that is meaningful to a certain deployment is specific to the use cases in consideration.¶
Note that CSIG tag supports 5 bit (20 bit) signal value size for the compact (expanded) formats.¶
Value | min(ABW/C) | min(ABW) | max(PD) |
---|---|---|---|
0x0 | 0%-1% | 0-1Gbps | 0-10us |
0x1 | 1%-5% | 1-5Gbps | 10-50us |
0x2 | 5%-10% | 5-10Gbps | 50-100us |
0x3 | 10%-20% | 10-20Gbps | 100-200us |
0x4 | 20%-50% | 20-50Gbps | 200-400us |
0x5 | 50%-75% | 50-75Gbps | 400-800us |
0x6 | 75%-90% | 75-90Gbps | 800-2000us |
0x7 | 90%-100% | >90 Gbps | >2000us |