Internet-Draft Safe CC March 2023
Mathis Expires 11 September 2023 [Page]
Workgroup:
WG Working Group
Internet-Draft:
draft-mathis-tsvwg-safecc-01
Published:
Intended Status:
Experimental
Expires:
Author:
M. Mathis
MLab

Safe Congestion Control

Abstract

We present criteria for evaluating Congestion Control for behaviors that have the potential to cause harm to other Internet applications or users.

Although our primary focus is the safety of transport layer congestion control, many of these criteria need to be applied to all protocol layers: entire stacks, libraries and applications themselves.

About This Document

This note is to be removed before publishing as an RFC.

Status information for this document may be found at https://datatracker.ietf.org/doc/draft-mathis-tsvwg-safecc/.

Discussion of this document takes place on the TSVWG Working Group mailing list (mailto:tsvwg@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/tsvwg/. Subscribe at https://www.ietf.org/mailman/listinfo/tsvwg/.

Source for this draft and an issue tracker can be found at https://github.com/mattmathis/safeCC/.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 11 September 2023.

Table of Contents

1. Introduction

We present criteria for evaluating Congestion Control for behaviors that have the potential to cause harm to other Internet applications or users. Although our primary focus is the safety of congestion control, many of these criteria need to be applied to all protocol layers: entire stacks, libraries and applications themselves.

Ideally we would like to cast these criteria as requirements; however such an effort will fail because many of them have known exceptions that don't seem to be important.

As an interim position: all implementations SHOULD comply with all criteria, and MUST document all exceptions: under what circumstances and how severely they fail to comply?

The open research question will be deciding which exceptions can be tolerated and which are grounds for preventing protocols or algorithms from progressing into full standards.

To prove the criteria described in the note they should be used to evaluate current and legacy algorithms: we expect to find alignment between known implementation pathologies and failed criteria. Discrepancies may suggest additional criteria or sharpen our understanding of how to decide if a failed criteria is material or not.

The phrase "under adverse conditions" refers to any increase in any congestion signals (loss, delay, marks or reduced queue space or capacity) from any starting state. For example introducing 1 Mb/s cross traffic to an otherwise ideal 10 Gb/s link is an adverse condition that SHOULD NOT trigger any of the misbehaviors indicated below.

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Tentative list of criteria

  1. Free from regenerative congestion - Adverse conditions do not cause additional presented load.
  2. Free from congestion collapse - Adverse conditions do not cause declining goodput / overhead ratio
  3. Bound control frequency - Control frequency scales with 1/rtt but is insensitive to data rate.
  4. Bound steady state losses - Steady state bulk transport should not cause more than 2\% loss over any unchanging network.
  5. Bound slowstart duration and loss - Slowstart into a droptail queue should not cause more than one RTT of loss nor cause more than 50% loss for that RTT. e.g. Provisional window/rate reductions should start when losses/disorder is first detected, even before the loss recovery can decide if the missing segments are due to reordering or loss.
  6. Bound losses on link changes - Step changes in link properties (RTT, bandwidth or queue size) or cross traffic should not cause losses that are larger than the change in maximum flight size supported by the link. Specifically, during loss recovery the transport is not permitted to send more data than reported at the receiver. (Conservative property from PRR.)
  7. No unnecessary slowstarts - All application stacks must use connection caching, CC state caching or some other mechanism such that application workloads are prevented from causing persistent or repeated overlapping slowstarts.
  8. Monotonic response - The CCA should have monotonic response to all congestion signals that it responds to (loss, marks, delay, etc) otherwise it will have multiple stable operating points for the same network conditions. It would be likely to exhibit stable pathologies such as latecomer (dis)advantage.
  9. Freedom from starvation - (need a new strong definition). Flows below some resource threshold (data rate, window size, ConEx marks, etc) will successfully search upwards, as long as there is either idle capacity or other flows above the same(?) threshold.
  10. Bound standing queue - Do not create steady state standing queues larger than k*minRTT*maxBW, for some prescribed k, to be defined.
  11. Maintain queue headroom - Individual flows do not keep queues pegged at full even if the queues are substantially smaller than minRTT*maxBW. When there is queue full, CC should reduce its window enough to create some small headroom to prevent locking out new flows
  12. Balanced probe size - Balance the worst case queue backlog against the need to trigger mode shifting in links that batch data. This should become a global (policy) parameter of the Internet, because the queue backlogs force jitter on flows trying to do realtime without QoS.
  13. Self scaling - at all layers. If the network is too slow, the application must also slow down to avoid "stacking" requests.

4. Security Considerations

TODO Security

This document provides evaluation criteria for Congestion control and other implementation or algorithms that might be deployed on the internet. It has no direct security considerations of its own.

Over the long haul it is expected to increase the overall robustness of the Internet by helping to eliminate pathological congestion behaviors that have the potential cause the Internet to be fragile under some conditions.

5. IANA Considerations

This document has no IANA actions.

6. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

Acknowledgments

TODO acknowledge.

Author's Address

Matt Mathis
Freelance, Measurement Lab