Congestion Exposure (ConEx) Working Group | M. Mathis |
Internet-Draft | Google, Inc |
Intended status: Informational | B. Briscoe |
Expires: May 03, 2012 | BT |
October 31, 2011 |
Congestion Exposure (ConEx) Concepts and Abstract Mechanism
draft-ietf-conex-abstract-mech-03
This document describes an abstract mechanism by which senders inform the network about the congestion encountered by packets earlier in the same flow. Today, the network may signal congestion to the receiver by ECN markings or by dropping packets, and the receiver passes this information back to the sender in transport-layer feedback. The mechanism to be developed by the ConEx WG will enable the sender to also relay this congestion information back into the network in-band at the IP layer, such that the total level of congestion is visible to all IP devices along the path, where it could, for example, be used to provide input to traffic management.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 03, 2012.
Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
One of the required functions of a transport protocol is controlling congestion in the network. There are three techniques in use today for the network to signal congestion to a transport:
In all cases the congestion signals follow the route indicated in Figure 1. A congested network device sends a signal in the data stream on the forward path to the transport receiver, the receiver passes it back to the sender through transport level feedback, and the sender makes some congestion control adjustment.
This document proposes to extend the capabilities of the Internet protocol suite with the addition of a ConEx Signal that, to a first approximation, relays the congestion information from the transport sender back through the internetwork layer. That signal is shown in Figure 1. It would be visible to all internetwork layer devices along the forward (data) path and is intended to support a variety of new policy-controlled mechanisms that might be used to manage traffic.
For the avoidance of doubt, there is no expectation that internetwork layer devices will do fine-grained congestion control using ConEx information. That is still probably best done at the transport sender. Rather, network operators will be able to use ConEx information to do better bulk traffic management, which in turn should incentivize end-system transports to be more careful about congesting others.
The ConEx signals are anticipated to be most useful at longer time scales, for example the total congestion caused by a user might be serve as an input to higher level policy or billing functions, designed to create incentives for improving user behavior, such as choosing to send large quantities of data at off peak times, at lower rates or with less aggressive protocols such as LEDBAT[I-D.ietf-ledbat-congestion]. For this reason many algorithms and analyses are described in terms of "volume" or the time integral of various parameters. For example, the "congestion volume" is defined to be the total number of bytes marked as congested[I-D.ietf-conex-concepts-uses]. Note that although the ConEx protocol only signals individual congestion events to the whole path the policy and audit functions described below are most likely to act on accumulated counts of these signals.
,---------. ,---------. |Transport| |Transport| | Sender | . |Receiver | | | /|___________________________________________| | | ,-<---------------Congestion-Feedback-Signals--<--------. | | | |/ | | | | | |\ Transport Layer Feedback Flow | | | | | | \ ___________________________________________| | | | | | \| | | | | | | ' ,-----------. . | | | | | |_____________| |_______________|\ | | | | | | IP Layer | | Data Flow \ | | | | | | |(Congested)| \ | | | | | | | Network |--Congestion-Signals--->-' | | | | | Device | \| | | | | | | /| | | `----------->--(new)-IP-Layer-ConEx-Signals-------->| | | | | | / | | | |_____________| |_______________ / | | | | | | |/ | | `---------' `-----------' ' `---------'
Not shown are policy devices along the data path that observe the ConEx Signal, and use the information to monitor or manage traffic. These are discussed in Section 4.5.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
ConEx signals in IP packet headers from the sender to the network {ToDo: These are placeholders for whatever words we decide to use}:
Ideally, all the following requirements would be met by a Congestion Exposure Signal. However it is already known that some compromises will be necessary, therefore all the requirements are expressed with the keyword 'SHOULD' rather than 'MUST'. The only mandatory requirement is that a concrete protocol description MUST give sound reasoning if it chooses not to meet any of these requirements:
It is important to note that the auditing requirement implies a number of additional constraints: The basic auditing technique is to count both actual congestion signals and ConEx Signals someplace along the data path:
Given that loss-based and ECN-based ConEx might sometimes be best audited at different locations, having distinct encodings would widen the design space for the auditing function. Using the same encoding for both signals is likely to make one of the auditing techniques infeasible, and the others less accurate.
Most protocol specifications start with a description of packet formats and codepoints with their associated meanings. This document does not: It is already known that choosing the encoding for the ConEx Signal is likely to entail some engineering compromises that have the potential to reduce the protocol's usefulness in some settings. Rather than making these engineering choices prematurely, this document side steps the encoding problem by describing an abstract representation of ConEx Signals. All of the elements of the protocol can be defined in terms of this abstract representation. Most important, the preliminary use cases for the protocol are described in terms of the abstract representation in companion documents [I-D.ietf-conex-concepts-uses].
Once we have some experience of example use cases we can evaluate different encoding schemes. Any encoding chosen for ConEx experiments may include compromises; it may include some conflated code points, some information may be lost resulting in weakening or disabling some of the algorithms and eliminating some use cases. For instance the experimental ConEx encoding chosen for IPv6 [I-D.ietf-conex-destopt] had to make compromises on tunnelling. The abstract encoding requirements that follow still stand despite this choice, in case experience shows these were not the best compromises to make.
The goal of this approach is to be as complete as possible for discovering the potential usage and capabilities of the ConEx protocol, so we have some hope of making optimal design decisions when choosing the encoding.
For tutorial purposes, it is helpful to describe a naïve encoding of the ConEx protocol for TCP and similar protocols: set a bit (not specified here) in the IP header on all retransmissions or once per ECN signaled window reduction. Clearly network devices along the forward path can see this bit and act on it. For example any device along the path can count marked and unmarked packets to estimate the total congestion levels along the entire path.
This simple encoding is sufficient to provide many of the envisioned benefits for ConEx and could be unilaterally deployed across a significant fraction of all Internet traffic by a agreement of small number of OS vendors and content providers. However, this encoding does not support sufficient auditing and might motivate users and/or applications to misrepresent the congestion that they are causing. As a consequence the naïve encoding is not likely to be trusted and thus create its own disincentives for further deployment.
To be successful, ConEx not only has to function while partially deployed, but at all stages of partial deployment it has to create incentives for further deployment. Central to making this work are strong auditing capabilities that do not permit congestion to be misrepresented as either non-congested or non-ConEx capable traffic.
Nonetheless, this Naïve encoding does present a clear mental model of how the ConEx protocol might function under various uses. It is useful for thought experiments where it can be stipulated that all participants are honest, and be used to understand the incentives that might be introduced by ConEx.
Ideally ConEx and ECN are orthogonal signals and SHOULD be entirely independent. However, given the limited number of header bit and/or code points, these signals may have to share code points, at least partially.
The re-ECN specification [I-D.briscoe-tsvwg-re-ecn-tcp] presents an implementation of ConEx that had to be tightly integrated with the encoding of ECN in order to fit into the IP header. The central theme of the re-ECN work is an audit mechanism that can provide sufficient disincentives against misrepresenting congestion [I-D.briscoe-tsvwg-re-ecn-motiv], which is analyzed extensively in Briscoe's PhD dissertation [Refb-dis].
Re-ECN is a good example of one chosen set of compromises attempting to meet the requirements of Section 2. However, the present document takes a step back, aiming to state the ideal requirements in order to allow the Internet community to assess whether other compromises are possible.
In particular, different incremental deployment choices may be desirable to meet the partial deployment requirement of Section 2. Re-ECN requires the receiver to be at least ECN-capable as well as requiring an update to the sender. Although ConEx will inherently require change at the sender, it would be preferable if it could work, even partially, with any receiver.
The chosen ConEx protocol certainly must not require ECN to be deployed in any network. In this respect re-ECN is already a good example—it acts perfectly well as a loss-based ConEx protocol it the loss-based audit techniques in Section 4.4 are used. However, it would still be desirable to avoid the dependence on an ECN receiver.
For a tutorial background on re-ECN techniques, see [[Re-fb], [FairerFaster]].
Although the re-ECN protocol requires no changes to the network part of the ECN protocol, it is important to note that it does propose some relatively minor modifications to the host-to-host aspects of the ECN protocol specified in RFC 3168. They include: redefining the ECT(1) code point (the change is consistent with RFC3168 but requires deprecating the experimental ECN nonce [RFC3540]); modifications to the ECN negotiations carried on the SYN and SYN-ACK; and using a different state machine to carry ECN signals in the transport acknowledgments from a modified Receiver to the Sender. This last change is optional, but it permits the transport protocol to carry multiple congestion signals per round trip. It greatly simplifies accurate auditing, and is likely to be useful in other transports, e.g. DCTCP [DCTCP].
All of these adjustments to RFC 3168 may also be needed in a future standardized ConEx protocol. There will need to be very careful consideration of any proposed changes to ECN or other existing protocols, because any such changes increase the cost of deployment.
Ideally, this document would not describe encoding at all, and leave that little detail to some future document. However, given the protocol engineering mindset of most readers, we have discovered that nearly everybody invents an encoding in order to help themselves understand the document. We sketch here two different plausible encodings: independently settable bits or an enumerated set of mutually exclusive codepoints.
In both cases, the amount of congestion is signaled by the volume of marked data—just as the volume of lost data or ECN marked data signals the amount of congestion experienced. Thus the size of each packet carrying a ConEx Signal is significant.
This encoding involves flag bits, each of which the sender can set independently to indicate to the network one of the following four signals:
This encoding does not imply any exclusion property among the signals. Multiple types of congestion (ECN, loss) can be signalled on the same ACKs.
This encoding involves signaling one of the following five codepoints:
ENUM {Not-ConEx, ConEx-Not-Marked, Re-Echo-Loss, Re-Echo-ECN, Credit}
Each named codepoint has the same meaning as in the encoding using independent bits (Section 3.3.1). The use of any one codepoint implies the negative of all the others.
Inherently, the semantics of most of the enumerated codepoints are mutually exclusive. 'Credit' is the only one that might need to be used in combination with either Re-Echo-Loss or Re-Echo-ECN, but even that requirement is questionable. It must not be forgotten that the enumerated encoding loses the flexibility to signal these two combinations, whereas the encoding with four independent bits is not so limited. Alternatively two extra codepoints could be assigned to these two combinations of semantics.
Figure 1 shows three of the main components of Congestion exposure: network devices subject to congestion, transport sender and transport receiver. There are two additional components,that, in principle, could be placed anywhere along the data path. They are a ConEx auditor and a Policy Device.
The role of the auditor is to encourage accurate ConEx signals by detecting and sanctioning flows that misrepresent the amount of congestion that they are causing. The auditor compares the ConEx signals to some direct observation of the congestion, to verify that the ConEx signals are accurate.
The policy device is the natural ultimate consumer of ConEx signal. It uses ConEx to facilitate better traffic management through improved instrumentation, monitoring or control of the traffic.
All 5 components are described in more detail.
Congestion signals originate from network devices as they do today. A congested router, switch or other network device can discard or ECN mark packets when it is congested. .
The sending transport needs to be modified to send Congestion Exposure Signals in response to congestion feedback signals (see [I-D.conex-tcp-mods]). We want to permit ConEx senders to be able to turn off ECN (e.g. if the receiver does not support ECN). However, we want to encourage a ConEx sender to at least attempt to negotiate EC, because it is known that ConEx without ECN is harder to audit, and thus potentially exposed to fraud. Since honest users have the potential to benefit from stronger mechanisms to manage traffic they have an incentive to deploy ConEx and ECN together. This incentive is not sufficient to prevent a dishonest user from constructing (or configuring) a sender that enables ConEx after choosing not to negotiate ECN, but is should be sufficient to prevent this from being the sustained default case for any significant pool of users.
Permitting ConEx without ECN is necessary to facilitate bootstrapping other parts of ConEx deployment.
Any receiving transport may already feedback sufficiently useful signals to the sender so that it does not need to be altered.
If the transport receiver does not support ECN, then it's native loss signaling mechanism (required for compliance with existing congestion control standards) will be sufficient for the Sender to generate ConEx signals.
A traditional ECN implementation (RFC 3168 for TCP) signals congestion no more than once per round trip. The sender may require more precise feedback from the receiver otherwise it is at risk of appearing to be understating its ConEx Signals (see Section 3.2.1).
Ideally, ConEx should be added to a transport like TCP without mandatory modifications to the receiver. But an optional modification to the receiver could be recommended for precision (see [I-D.conex-accurate-ecn]). This was the approach taken when adding re-ECN to TCP [I-D.briscoe-tsvwg-re-ecn-tcp].
To audit ConEx Signals against actual losses (as opposed to ECN) an auditor could use one of the following techniques:
To audit ConEx Signals against actual ECN markings or losses, the auditor could work as follows: monitor flows or aggregates of flows, only holding state on a flow if it first sends a ConEx-Marked packet (Credit or either Re-Echo marking). Count the number of bytes marked with Credit or Re-Echo-ECN. Separately count the number of bytes marked with ECN. Use Credits to assure that {#ECN} <= {#Re-Echo-ECN} + {#Credit}, even though the Re-Echo-ECN markings are delayed by at least one RTT.
At the audit function,there will be an inherent delay of at least one round trip between a congestion signal and the subsequent ConEx signal it triggers—as it makes the two passes of the feedback loop in Figure 1. However, the audit function cannot be expected to wait for a round trip to check that one signal balances the other, because it is hard for a network device to know the RTT of each transport.
Instead, it considerably simplifies the audit function if the source transport is made responsible for removing the round trip delay in ConEx signals. The transport SHOULD signal sufficient credit in advance to cover any reasonably expected congestion during its feedback delay. Then, the audit function does not need to make allowance for round trip delays—that it cannot quantify. This design choice correctly makes the transport responsible for both minimizing feedback delay and for the risk that packets in flight will cause congestion to others before the source can react.
For example, imagine the audit function keeps a running account of the balance between actual congestion signals (loss or ECN), which it counts as negative, and ConEx signals, which it counts as positive. Having made the transport responsible for round trip delays, it will be expected to have pre-loaded the audit function with some credit at the start. Therefore, if ever the balance does go negative, the audit function can immediately start punishing a flow, without any grace period.
The one-way nature of packet forwarding probably makes per-flow state unavoidable for the audit function. This was a necessary sacrifice to avoid per-flow state elsewhere in the wider ConEx architecture. Nonetheless, care was taken to ensure that packets could bring soft-state to the audit function, so that it would continue to work if a flow shifted to a different audit device, perhaps after a reroute or an audit device failure. Therefore, although the audit function is likely to need flow state memory, at least it complies with the 'fate-sharing' design principle of the Internet [IntDesPrinciples], and at least per-flow audit is only required at the outer edges of the internetwork, where it is less of a scalability concern.
Note also that ConEx does not intend to embed rules in the network on how individual flows behave. The audit function only does per-flow processing to check the integrity of ConEx information.
There is no intention to standardise how to design or implement the audit function. However, it is necessary to lay down the following normative constraints on audit behaviour so that transport designers will know what to design against and implementers of audit devices will know what pitfalls to avoid:
Policy devices are characterised by a need to be configured with a policy related to the users or neighboring networks being served. In contrast, the auditing devices referred to in the previous section primarily enforce compliance with the ConEx protocol and do not need to be configured with any client-specific policy.
Policy devices can typically be decomposed into two functions i) monitoring the ConEx signal to compare it with a policy then ii) acting in some way on the result. Various actions might be invoked against 'out of contract' traffic, such as policing (see Section 4.5.3), re-routing, or downgrading the class of service.
Alternatively a policy device might not act directly on the traffic, but instead report to management systems that are designed to control congestion indirectly. For instance the reports might trigger capacity upgrades, penalty clauses in contracts, levy charges between networks based on congestion, or merely send warnings to clients who are causing excessive congestion.
Nonetheless, whatever action is invoked, the congestion monitoring function will always be a necessary part of any policy device.
ConEx signals indicate the level of congestion along a whole path from source to destination. In contrast when ECN signals are monitored in the middle of a network, they indicate the level of congestion experienced so far on the path.
If a monitor in the middle of a network (e.g. at a border) measures both of these signals, it can subtract the level of ECN (path so far) from the level of ConEx (whole path) to derive a measure of the congestion that packets are likely to experience between the monitoring point and their destination (rest-of-path congestion).
It will often be preferable for policy devices to monitor rest-of-path congestion if they can, because it is a measure of the downstream congestion that the policy device can directly influence by controlling the traffic passing through it.
A monitor cannot reliably measure upstream congestion if it is signaled by losses rather than ECN. Therefore a monitor can only accurately measure rest-of-path congestion if it ignores traffic from non-ECN-capable transports (Not-ECT) and if the congested queues upstream of the monitor are ECN-enabled.
A congestion policer can be implemented in a very similar way to a bit-rate policer, but its effect can be focused solely on traffic causing congestion downstream, which ConEx signals make visible. Without ConEx signals, the only way to mitigate congestion is to blindly limit traffic bit-rate, on the assumption that high bit-rate is more likely to cause congestion.
A congestion policer monitors all ConEx traffic entering a network, or some identifiable subset. Using ConEx signals (and preferably subtracting ECN signals), it measures the amount of congestion that this traffic is contributing somewhere downstream. If this exceeds a policy-configured 'congestion-bit-rate' the congestion policer can limit all the monitored ConEx traffic.
A congestion policer can be implemented by a simple token bucket. But unlike a bit-rate policer, it removes a token only when it forwards a packet that is ConEx-Marked, effectively treating Not-ConEx-Marked packets as invisible. Consequently, because tokens give the right to send congested bits, the fill-rate of the token bucket will represent the allowed congestion-bit-rate. This should provide sufficient traffic management without having to additionally constrain the straight bit-rate at all. See [CongPol] for details.
The ConEx abstract protocol described so far is intended to support incremental deployment in every possible respect. For convenience, the following list collects together all the features of ConEx that support incremental deployment, and points to further information on each:
This memo includes no request to IANA.
Note to RFC Editor: this section may be removed on publication as an RFC.
Significant parts of this whole document are about auditability of ConEx Signals, in particular Section 4.4.
{ToDo:}
This document was improved by review comments from Toby Moncaster, Nandita Dukkipati, Mirja Kuehlewind and Caitlin Bestler.
Comments and questions are encouraged and very welcome. They can be addressed to the IETF Congestion Exposure (ConEx) working group mailing list <conex@ietf.org>, and/or to the authors.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |