Internet-Draft Case for Sub-Child SAs February 2023
Rossberg, et al. Expires 31 August 2023 [Page]
Workgroup:
Network
Internet-Draft:
xxxxx
Published:
Intended Status:
Informational
Expires:
Authors:
M. Rossberg
TU Ilmenau
S. Klassert
secunet
M. Pfeiffer
TU Ilmenau

Problem statements and uses cases for lightweight Child Security Associations

Abstract

IKE SAs may have one or more child SAs that are used for traffic protection. This document collects arguments for (and against) having more fine-grained sub-child-SAs. They can be used to separate data streams for various technical reasons but share the same security properties and traffic selectors. This shall allow for a more flexible use of IPsec in multiple scenarios.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 31 August 2023.

Table of Contents

1. Introduction

This document does not (yet) describe an addition to IPsec. Rather, it attempts to describe scenarios which do not fit to the concept of a Security Association (SA) protecting a single stream of packets. Afterwards, possible solutions for those scenarios are discussed and evaluated.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Envisioned Scenarios

Especially, but not limited, to intra-data-center traffic there are several challenges, when deploying coarse-grained IPsec child SAs. In particular, these challenges originate in implementing one of the following techniques, or a combination thereof:

  1. Multicore Software Processing
  2. QoS support
  3. Multipath
  4. Multicast

As these challenges are due to the same root cause, they are commonly addressed in this document. This root cause is the idea of an SA forming a single stream of packets that is generally in-order. In practice, the common implementation of replay protection using anti-replay windows sets an upper limit for packets that may arrive late. Before discussing possible solutions, the following sections will elaborate how the techniques above collide with the idea of a single packet stream.

2.1. Multicore Software Processing

Due to IPsec being often processed in software, small-packet throughputs of significantly above 10Gbit/s are currently only achievable when scaling to multiple CPU cores. However, this scaling only works if cores do not have to synchronize tightly. In particular, it is impossible to synchronize anti-replay windows and sequence counters efficiently, even when using atomic CPU instructions. Detailed explanations may be found in [I-D.pwouters-ipsecme-multi-sa-performance]. Consequently, scaling over multiple cores leads to multiple packet streams, one per processing core. These streams may advance independently, and thus introduce packet reordering. This reordering contradicts to the concept of an anti-replay window which does not allow for packets being too far out of order. Consequently, packets might be dropped unpredictably.

2.2. Implementing QoS mechanisms

Similarly, traffic may be categorized into different classes to provide quality of service. QoS classes do not belong to the traffic selector of a Child SA. So using different QoS classes for the same traffic selector will introduce reordering of packets within a child SA. In contrast to multicore software processing, this type of packet reordering is intentional and not accidental. The consequences, however, are comparable.

2.3. Multipath

A sender may also decide to send packets to single receiver via multiple paths, e.g., by using multiple uplinks in an SD-WAN scenario. Depending on the characteristics of the uplinks, this shows similarities to the multicore scenario (uplinks with relatively similar characteristics) or the QoS scenario (uplinks with rather different characteristics).

2.4. Multicast

A multicast scenario with only a single sender does not pose an issue, as the sender can simply increment its sequence counter. Each receiver has a complete view of the traffic and can thus maintain its replay window as usual. But as soon as there are multiple senders, they would need to coordinate their sequence number usage, which is even less efficiently implementable than in the multicore case. Therefore, replay protection is usually disabled in multicast scenarios.

3. Requirements

Besides the obvious requirement of not impairing security the following shall be considered:

  1. Deterministic performance
  2. Scalability
  3. Robustness
  4. Simple implementation

4. Discussion of possible approaches

There are several approaches to deal with the presence of multiple independent packet streams introduced by the scenarios in Section 2:

  1. Disabling replay protection
  2. Using multiple IKE SAs
  3. Using multiple child SAs
  4. Increasing anti-replay window sizes
  5. Using sub-child SAs

4.1. Disabling Replay Protection

A straightforward solution would be to simply disable replay protection.

Advantages:

  • Trivially solves all the reordering and synchronization issues discussed previously. Note: This may still violate existing RFCs, which require sequence numbers to be generated in order, but this violation should not have an impact.

Disadvantages:

  • The approach significantly lowers the level of security. Although most upper layer protocols (e.g., TCP) provide protection from duplicated data, this cannot be assumed for the general case. Even if the duplicates are never delivered to a user application, they usually do trigger responses from the receivers' network stack, e.g., TCP RSTs or ICMP errors. This in turn enables an attacker to trigger ciphertext generation, possibly facilitating subsequent attacks. Such attacks have practically been used against WiFi encryption in the early 2000s.
  • It is unclear how an SA protecting multiple plaintext flows can be distributed to multiple cores on the receiver. Receive-Side Scaling (RSS) or explicit steering rules need some indication which packets carry the same plaintext flow and thus need to be sent to the same core. Otherwise, intra-flow reordering is introduced, which may severely disturb higher level protocols, e.g., TCP's congestion control or VoIP audio streams. Thus, efficient multicore processing is not possible for the receiver.

This approach may be acceptable for specific scenarios (e.g., multicast), but not for the general case. It is especially problematic for any multicore scenarios, as the status quo without parallelization provides replay protection. This approach is therefore not discussed any further.

4.2. Using multiple IKE SAs

For some scenarios, it might be reasonable to set up multiple, separate IKE SAs.

Advantages:

  • As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.
  • Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.
  • No changes to existing standards required.

Disadvantages:

  • There is a time and communication overhead due to the negotiation of every IKE SA requiring network round trips, packet processing, asymmetric cryptography, etc. The initial setup could be accelerated by a reactive instead of a proactive SA negotiation, i.e., delaying the setup of the SA for a specific core or QoS class until the first packet arrives on the core or with the respective QoS tag. However, this is a highly debatable strategy, as it induces either drops or large delays for the initial packets of these flows.
  • There is a state/memory overhead due to completely separate state of every SA, e.g., traffic selectors, keys, lifetimes. To a large extent, these states will hold identical information.
  • During operation, there is overhead due to the regular rekeying of each SA and, if enabled, dead peer detection.
  • Additional effort to configure the required number of SAs must be made. Furthermore, monitoring larger networks becomes more complex due to the fact that multiple SAs now mapping to identical connections.
  • The failure model is unspecified if a subset of the IKE SAs cannot be established. For example, in the multicore scenario, this leads to packet loss or at least performance fluctuations on some plaintext flows, depending on the core they are processed on. Such situations historically have a bad track record, e.g., partially loading websites with (non-persistent) HTTP; SIP-working-but-RTP-failing conditions in VoIP, etc.

In summary, the main issue of this approach is scalability. It may be appropriate for certain scenarios, where the total number of additional IKE SAs is low. It is not suited for general usage in large deployments. In particular, deploying multiple of the techniques described in Section 2 leads to a combinatorial explosion of the number of required SAs. For example, if one intends to transport traffic with 8 QoS classes between two gateways with 32 cores, there would be already 256 SAs solely between these two gateways. Even if the data plane and IKE daemon can support such a setup, there may be too much complexity pushed into the operational domain. Therefore, this approach is not generally applicable.

4.3. Using multiple (per-CPU) child SAs

This approach has been proposed recently as a draft [I-D.pwouters-ipsecme-multi-sa-performance]. The draft is restricted to the multicore scenario outlined in Section 2.1. It is similar to establishing multiple IKE SAs, but avoids a significant portion of their overhead by restricting the multiple instantiations to child SAs.

Advantages:

  • There is significantly less overhead compared to setting up independent IKE SAs.
  • As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.
  • Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.
  • The draft incurs only a small change in standards and existing source code, as multiple child SAs are already possible in IKEv2 [RFC7296], and the draft simply adds a mechanism to negotiate them explicitly.

Disadvantages:

  • Due to the setup of child SAs via separate CREATE_CHILD_SA exchanges, there is still communications overhead, especially for larger numbers of SAs. As for multiple IKE SAs, both a proactive setup or a reactive setup are possible, i.e., resulting in a longer establishment time or a less predictable runtime behavior, respectively.
  • There is still some per-child-SA state overhead in the data plane. However, as the IKE daemon knows about those SAs being child per-Queue children of the same IKE SA, an optimized implementation might be able to reduce that overhead to a minimum.
  • During operation, there is overhead traffic due to the regular rekeying.
  • Similar to separate IKE SAs, there is the possibility of a partially working SA if some the child SAs fail to set up. It is not immediately clear what the correct reaction should be, especially in the scope of a large VPN deployment, compared to the all-or-nothing failure model when parallel child SAs are not used.

Using multiple child SAs is a significant step forward for the multicore scenario. It is a simple (in the positive sense), straightforward solution harvesting low-hanging fruits. But this simplicity inherits some drawbacks from the multiple-IKE-SAs approach caused by the independence of the child SAs regarding setup, state, rekeying and failure. These disadvantages get worse the more child SAs are required. Therefore, the per-CPU child SAs approach is not an ideal fit to the other scenarios described in Section 2, or a combination of the scenarios.

4.4. Increasing anti-replay window sizes

This approach differs from the previous two as it does not attempt to create multiple replay windows, but to accommodate the traffic within a single anti-replay window. This fits to the QoS scenario depicted in Section 2.2 if any higher-prioritized traffic does not advance the anti-replay window too far for the lower-prioritized traffic. The idea is not applicable to the multicore or multicast scenarios, as larger windows can only solve the problem of packets being reordered by the network, but do not allow unsynchronized sequence counters (as, e.g., [RFC4303] requires strict monotonicity).

Advantages:

  • No changes to standards are required, as the anti-replay window size is a local matter.
  • The approach inherits the advantages of a single child SA, e.g., there is no setup overhead, less state overhead than with multiple child SAs (only the larger replay windows) and no complex failure model.

Disadvantages:

  • Even in software implementations, the anti-replay windows cannot grow indefinitely large. Especially in latency-sensitive deployments, i.e., where one would use QoS, achieving throughput above 10 Gbit/s depends on the ability to keep state in the CPU caches, even for a larger number of peers.
  • Complex configuration: Choosing a correct value of for window size depends not on only the number of QoS classes, but also on the maximum divergence of sequence numbers, which in turn depends on the QoS configuration, the possible throughput and the traffic mix.

As discussed previously, this approach is only suitable for the QoS and multipath scenarios. A comparison with other mechanisms requires an estimation of the required window sizes. The time low-priority packets may be delayed by shapers and queues depends on many parameters, e.g., the actual and admitted traffic rates, the sizes of admissible burst, strict-priority scheduling, etc.

An attempt to simplify the problem is to make windows large enough to admit packets that are delayed up to a certain time threshold T. Consider a packet being "stuck" in the network due to other packets being prioritized. Those packets advance the replay window. Let their Ethernet size be S and their throughput TP. It makes sense for TP to be an interface speed, otherwise, the delayed packet would not be stuck. We therefore end up with the following packets rates R:

Table 1: Packet rates
S [byte] TP [Gbit/s] R [Mp/s]
64 10 14.881
1518 10 0.813
64 100 148.810
1518 100 8.127

For T = 100 ms, this would mean that the windows must, in the worst case, accommodate between 80,000 and 14.8 million packets. It might be argued that the higher boundary is currently unrealistic, as it would require a 100 Gbit/s link to be saturated with small, prioritized packets. On the other hand, 100 ms is the acceptable delay for VoIP, whereas for applications with low priority demands, it might make sense to deliver even older packets.

4.5. Using Sub-Child SAs

The final possibility is standardizing a new approach that tries to combine the advantages of the approaches discussed previously. In essence, it is the idea of allowing multiple sequence counters (and thus use multiple anti-replay windows) per child SA. These sequence counters must allow incrementing independently of each other, making the approach applicable to all outlined scenarios. It is also possible to think of the individual counter/windows pairs as sub-SAs within a child SA.

First of all, receivers must be able to distinguish those sub-SAs. There are multiple possibilities to achieve this:

  • Using the SPI: The SPI would be allocated per sub-SA, i.e., a range of SPIs would belong to a single child SA. Therefore, it is possible to embed, e.g., the ID of the sending core in some bits of the SPI.
  • Using the sequence number: Some bits of the sequence number would be used to indicate the sub-SA, as proposed in [I-D.ponchon-ipsecme-anti-replay-subspaces]. This approach reduces the available sequence numbers. Note that the consequences depend on whether the traffic is distributed evenly among the individual sub-SAs (e.g., multicore scenario) or not (e.g., QoS scenario).
  • Using an additional field: Of course, it is also possible to introduce a new field to the ESP header. This can lead to a simpler design, but also constitutes the largest change to existing standards.

In any case, the approach necessitates some additional clarifications:

  • The receiver may use the steering capabilities of its NIC to map ingress packets to its sub-SAs, e.g., to different queues, to allow for efficient multicore utilization. This is especially important for the multicore scenario, as software redirects to other cores must be avoided for performance reasons. The simplest case is the sub-SA being encoded in the SPI, as many NICs already provide features for matching on SPIs. For the other two distinguishing mechanisms, flexible or raw matchers may be used.
  • The setup and renewal of sub-SAs should happen in bulk, i.e., there is only one exchange to set up the child SA. This leads to reliable performance characteristics, as there is no on-demand sub-SA creation. Furthermore, the failure model is very simple: The child SA with all its sub-SAs exists, or it does not.
  • Only the sequence counters and anti-replay windows would be allocated per sub-SA.
  • All other properties of the SA are per child SA i.e., traffic selectors, mode, but also the key material. Using the same key for all sub-SAs needs to be done with care to avoid effects on security (details will follow shortly). However, if there were different keys, neither the scalability (bulk setup and rekeying) nor the predictable failure model would be possible.

Using a single key for multiple sub-SAs has implications on security:

  • It must be ensured that this approach cannot lead to reused IVs for counter modes. For example, in the case of AES-GCM [RFC4106], this means either the salt must be different for each sub-SA, or the IV space must be partitioned accordingly. Note that partitioning the IV space is not possible with implicit IV modes ([RFC8750]), as [RFC4303] requires sequence numbers to be initialized to zero.
  • Hard limits for packet and byte counters must be scaled accordingly. For example, if no more than 2^64 packets should be transmitted using a given key, and the child SA consists of 2^8 sub-SAs, then every sub-SA must not be allowed to send more than 2^56 packets, in case no fine-grained synchronization is possible. In case transmission happens on the same CPU core, overcommitting may be possible as long as the total number of packets or bytes is ensured to be never exceeded.
  • Rekey limits must apply to all sub-SAs combined. For example, if a child SA is configured to be rekeyed after transmission of X bytes or Y packets, then the rekey must be triggered if the sum of bytes or packets on all sub-SAs reaches X or Y. For situations where overcommitting is not possible, we suggest to reference the sub-SA with the maximum number of bytes/packets already sent, say X'_max and Y'_max. X'_max and Y'_max are multiplied with number of sub-SAs and if that value exceeds X or Y, a rekeying is initiated.
  • In case SPIs or an explicit header field are used to encode sub-SAs it may (theoretically) be possible to send more than 2^64 packets using a single key. This may form a problem for ciphers, such as AES-GCM. In this case a hard limit of at most 2^64 packets MUST be enforced.

Advantages:

  • Independent sequence numbers and anti-replay windows are available.
  • The approach allows for RSS or explicit steering, especially if the SPI-encoding is used.
  • Most scalable approach: The child SA setup requires exchanging, e.g., an SPI range but does not depend on the number of sub-SAs allocated. Similarly, there is only an ID, sequence counters, and an anti-replay window to store per sub-SA. The remainder of state can be shared.
  • There is no rekeying overhead, as just a single Child SA needs to be rekeyed.
  • Predictable performance characteristics due to the batched, proactive establishment.
  • Clean failure model due to the all-or-nothing setup.

Disadvantages:

  • There are potential security implications, which must be discussed thoroughly, to avoid weakening security at any point.
  • The change in the data plane may seem be a bit more complex change compared to per-CPU child SAs. Nevertheless, fallback SAs like mentioned in [I-D.pwouters-ipsecme-multi-sa-performance] are avoided.

Compared to setting up separate IKE or child SAs, it might be argued that the idea of sub-SAs keeps the complexity and overhead away from the VPN's operation. Furthermore, storing an SPI, a 64-bit sequence number, and a replay window for 64 packets for 64 different QoS classes requires a total of 10240 bit. This is significantly less than even the lower boundary established for the approach described in Section 4.4. However, of the discussed alternatives, it is the most complex change to existing standard and implementation semantics.

5. Remark on steering

Please note: For any of the approaches it is essential for the receiver to steer traffic being generated by a CPU core of the sender to a determined CPU core that handles the incoming traffic. For example, if a two CPU cores at the sender generate large amounts of traffic in one QoS class, it is not only sufficient to perform RSS on the child SAs or sub-child SAs, as this would not avoid the two streams being mapped to the same receiver CPU.

6. IANA Considerations

This memo includes no request to IANA.

7. Security Considerations

TODO: In its current state, this draft discusses multiple alternatives. Please refer to Section 4 for a discussion including remarks on security.

8. References

8.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC7296]
Kaufman, C., Hoffman, P., Nir, Y., Eronen, P., and T. Kivinen, "Internet Key Exchange Protocol Version 2 (IKEv2)", STD 79, RFC 7296, DOI 10.17487/RFC7296, , <https://www.rfc-editor.org/info/rfc7296>.
[RFC4303]
Kent, S., "IP Encapsulating Security Payload (ESP)", RFC 4303, DOI 10.17487/RFC4303, , <https://www.rfc-editor.org/info/rfc4303>.
[RFC4106]
Viega, J. and D. McGrew, "The Use of Galois/Counter Mode (GCM) in IPsec Encapsulating Security Payload (ESP)", RFC 4106, DOI 10.17487/RFC4106, , <https://www.rfc-editor.org/info/rfc4106>.
[RFC8750]
Migault, D., Guggemos, T., and Y. Nir, "Implicit Initialization Vector (IV) for Counter-Based Ciphers in Encapsulating Security Payload (ESP)", RFC 8750, DOI 10.17487/RFC8750, , <https://www.rfc-editor.org/info/rfc8750>.

8.2. Informative References

[I-D.pwouters-ipsecme-multi-sa-performance]
Antony, A., Brunner, T., Klassert, S., and P. Wouters, "IKEv2 support for per-queue Child SAs", Work in Progress, Internet-Draft, draft-pwouters-ipsecme-multi-sa-performance-05, , <https://datatracker.ietf.org/doc/html/draft-pwouters-ipsecme-multi-sa-performance-05>.
[I-D.ponchon-ipsecme-anti-replay-subspaces]
Ponchon, P., Shaikh, M., Pfister, P., and G. Solignac, "IPsec and IKE anti-replay sequence number subspaces for multi-path tunnels and multi-core processing", Work in Progress, Internet-Draft, draft-ponchon-ipsecme-anti-replay-subspaces-00, , <https://datatracker.ietf.org/doc/html/draft-ponchon-ipsecme-anti-replay-subspaces-00>.

Authors' Addresses

Michael Rossberg
Technische Universität Ilmenau
Helmholtzplatz 5
98693 Ilmenau
Germany
Steffen Klassert
secunet Security Networks AG
Ammonstrasse 74
01067 Dresden
Germany
Michael Pfeiffer
Technische Universität Ilmenau
Helmholtzplatz 5
98693 Ilmenau
Germany