Problem statements and uses cases for lightweight Child Security Associations

Internet-Draft	Case for Sub-Child SAs	February 2023
Rossberg, et al.	Expires 31 August 2023	[Page]

Abstract

IKE SAs may have one or more child SAs that are used for traffic protection. This document collects arguments for (and against) having more fine-grained sub-child-SAs. They can be used to separate data streams for various technical reasons but share the same security properties and traffic selectors. This shall allow for a more flexible use of IPsec in multiple scenarios.¶

4. Discussion of possible approaches

There are several approaches to deal with the presence of multiple independent packet streams introduced by the scenarios in Section 2:¶

Disabling replay protection¶
Using multiple IKE SAs¶
Using multiple child SAs¶
Increasing anti-replay window sizes¶
Using sub-child SAs¶

4.1. Disabling Replay Protection

A straightforward solution would be to simply disable replay protection.¶

Advantages:¶

Trivially solves all the reordering and synchronization issues discussed previously. Note: This may still violate existing RFCs, which require sequence numbers to be generated in order, but this violation should not have an impact.¶

Disadvantages:¶

The approach significantly lowers the level of security. Although most upper layer protocols (e.g., TCP) provide protection from duplicated data, this cannot be assumed for the general case. Even if the duplicates are never delivered to a user application, they usually do trigger responses from the receivers' network stack, e.g., TCP RSTs or ICMP errors. This in turn enables an attacker to trigger ciphertext generation, possibly facilitating subsequent attacks. Such attacks have practically been used against WiFi encryption in the early 2000s.¶
It is unclear how an SA protecting multiple plaintext flows can be distributed to multiple cores on the receiver. Receive-Side Scaling (RSS) or explicit steering rules need some indication which packets carry the same plaintext flow and thus need to be sent to the same core. Otherwise, intra-flow reordering is introduced, which may severely disturb higher level protocols, e.g., TCP's congestion control or VoIP audio streams. Thus, efficient multicore processing is not possible for the receiver.¶

This approach may be acceptable for specific scenarios (e.g., multicast), but not for the general case. It is especially problematic for any multicore scenarios, as the status quo without parallelization provides replay protection. This approach is therefore not discussed any further.¶

4.2. Using multiple IKE SAs

For some scenarios, it might be reasonable to set up multiple, separate IKE SAs.¶

Advantages:¶

As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.¶
Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.¶
No changes to existing standards required.¶

Disadvantages:¶

There is a time and communication overhead due to the negotiation of every IKE SA requiring network round trips, packet processing, asymmetric cryptography, etc. The initial setup could be accelerated by a reactive instead of a proactive SA negotiation, i.e., delaying the setup of the SA for a specific core or QoS class until the first packet arrives on the core or with the respective QoS tag. However, this is a highly debatable strategy, as it induces either drops or large delays for the initial packets of these flows.¶
There is a state/memory overhead due to completely separate state of every SA, e.g., traffic selectors, keys, lifetimes. To a large extent, these states will hold identical information.¶
During operation, there is overhead due to the regular rekeying of each SA and, if enabled, dead peer detection.¶
Additional effort to configure the required number of SAs must be made. Furthermore, monitoring larger networks becomes more complex due to the fact that multiple SAs now mapping to identical connections.¶
The failure model is unspecified if a subset of the IKE SAs cannot be established. For example, in the multicore scenario, this leads to packet loss or at least performance fluctuations on some plaintext flows, depending on the core they are processed on. Such situations historically have a bad track record, e.g., partially loading websites with (non-persistent) HTTP; SIP-working-but-RTP-failing conditions in VoIP, etc.¶

In summary, the main issue of this approach is scalability. It may be appropriate for certain scenarios, where the total number of additional IKE SAs is low. It is not suited for general usage in large deployments. In particular, deploying multiple of the techniques described in Section 2 leads to a combinatorial explosion of the number of required SAs. For example, if one intends to transport traffic with 8 QoS classes between two gateways with 32 cores, there would be already 256 SAs solely between these two gateways. Even if the data plane and IKE daemon can support such a setup, there may be too much complexity pushed into the operational domain. Therefore, this approach is not generally applicable.¶

4.3. Using multiple (per-CPU) child SAs

This approach has been proposed recently as a draft [I-D.pwouters-ipsecme-multi-sa-performance]. The draft is restricted to the multicore scenario outlined in Section 2.1. It is similar to establishing multiple IKE SAs, but avoids a significant portion of their overhead by restricting the multiple instantiations to child SAs.¶

Advantages:¶

There is significantly less overhead compared to setting up independent IKE SAs.¶
As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.¶
Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.¶
The draft incurs only a small change in standards and existing source code, as multiple child SAs are already possible in IKEv2 [RFC7296], and the draft simply adds a mechanism to negotiate them explicitly.¶

Disadvantages:¶

Due to the setup of child SAs via separate CREATE_CHILD_SA exchanges, there is still communications overhead, especially for larger numbers of SAs. As for multiple IKE SAs, both a proactive setup or a reactive setup are possible, i.e., resulting in a longer establishment time or a less predictable runtime behavior, respectively.¶
There is still some per-child-SA state overhead in the data plane. However, as the IKE daemon knows about those SAs being child per-Queue children of the same IKE SA, an optimized implementation might be able to reduce that overhead to a minimum.¶
During operation, there is overhead traffic due to the regular rekeying.¶
Similar to separate IKE SAs, there is the possibility of a partially working SA if some the child SAs fail to set up. It is not immediately clear what the correct reaction should be, especially in the scope of a large VPN deployment, compared to the all-or-nothing failure model when parallel child SAs are not used.¶

Using multiple child SAs is a significant step forward for the multicore scenario. It is a simple (in the positive sense), straightforward solution harvesting low-hanging fruits. But this simplicity inherits some drawbacks from the multiple-IKE-SAs approach caused by the independence of the child SAs regarding setup, state, rekeying and failure. These disadvantages get worse the more child SAs are required. Therefore, the per-CPU child SAs approach is not an ideal fit to the other scenarios described in Section 2, or a combination of the scenarios.¶

4.4. Increasing anti-replay window sizes

This approach differs from the previous two as it does not attempt to create multiple replay windows, but to accommodate the traffic within a single anti-replay window. This fits to the QoS scenario depicted in Section 2.2 if any higher-prioritized traffic does not advance the anti-replay window too far for the lower-prioritized traffic. The idea is not applicable to the multicore or multicast scenarios, as larger windows can only solve the problem of packets being reordered by the network, but do not allow unsynchronized sequence counters (as, e.g., [RFC4303] requires strict monotonicity).¶

Advantages:¶

No changes to standards are required, as the anti-replay window size is a local matter.¶
The approach inherits the advantages of a single child SA, e.g., there is no setup overhead, less state overhead than with multiple child SAs (only the larger replay windows) and no complex failure model.¶

Disadvantages:¶

Even in software implementations, the anti-replay windows cannot grow indefinitely large. Especially in latency-sensitive deployments, i.e., where one would use QoS, achieving throughput above 10 Gbit/s depends on the ability to keep state in the CPU caches, even for a larger number of peers.¶
Complex configuration: Choosing a correct value of for window size depends not on only the number of QoS classes, but also on the maximum divergence of sequence numbers, which in turn depends on the QoS configuration, the possible throughput and the traffic mix.¶

As discussed previously, this approach is only suitable for the QoS and multipath scenarios. A comparison with other mechanisms requires an estimation of the required window sizes. The time low-priority packets may be delayed by shapers and queues depends on many parameters, e.g., the actual and admitted traffic rates, the sizes of admissible burst, strict-priority scheduling, etc.¶

An attempt to simplify the problem is to make windows large enough to admit packets that are delayed up to a certain time threshold T. Consider a packet being "stuck" in the network due to other packets being prioritized. Those packets advance the replay window. Let their Ethernet size be S and their throughput TP. It makes sense for TP to be an interface speed, otherwise, the delayed packet would not be stuck. We therefore end up with the following packets rates R:¶

Table 1: Packet rates
S [byte]	TP [Gbit/s]	R [Mp/s]
64	10	14.881
1518	10	0.813
64	100	148.810
1518	100	8.127

For T = 100 ms, this would mean that the windows must, in the worst case, accommodate between 80,000 and 14.8 million packets. It might be argued that the higher boundary is currently unrealistic, as it would require a 100 Gbit/s link to be saturated with small, prioritized packets. On the other hand, 100 ms is the acceptable delay for VoIP, whereas for applications with low priority demands, it might make sense to deliver even older packets.¶

4.5. Using Sub-Child SAs

The final possibility is standardizing a new approach that tries to combine the advantages of the approaches discussed previously. In essence, it is the idea of allowing multiple sequence counters (and thus use multiple anti-replay windows) per child SA. These sequence counters must allow incrementing independently of each other, making the approach applicable to all outlined scenarios. It is also possible to think of the individual counter/windows pairs as sub-SAs within a child SA.¶

First of all, receivers must be able to distinguish those sub-SAs. There are multiple possibilities to achieve this:¶

Using the SPI: The SPI would be allocated per sub-SA, i.e., a range of SPIs would belong to a single child SA. Therefore, it is possible to embed, e.g., the ID of the sending core in some bits of the SPI.¶
Using the sequence number: Some bits of the sequence number would be used to indicate the sub-SA, as proposed in [I-D.ponchon-ipsecme-anti-replay-subspaces]. This approach reduces the available sequence numbers. Note that the consequences depend on whether the traffic is distributed evenly among the individual sub-SAs (e.g., multicore scenario) or not (e.g., QoS scenario).¶
Using an additional field: Of course, it is also possible to introduce a new field to the ESP header. This can lead to a simpler design, but also constitutes the largest change to existing standards.¶

In any case, the approach necessitates some additional clarifications:¶

The receiver may use the steering capabilities of its NIC to map ingress packets to its sub-SAs, e.g., to different queues, to allow for efficient multicore utilization. This is especially important for the multicore scenario, as software redirects to other cores must be avoided for performance reasons. The simplest case is the sub-SA being encoded in the SPI, as many NICs already provide features for matching on SPIs. For the other two distinguishing mechanisms, flexible or raw matchers may be used.¶
The setup and renewal of sub-SAs should happen in bulk, i.e., there is only one exchange to set up the child SA. This leads to reliable performance characteristics, as there is no on-demand sub-SA creation. Furthermore, the failure model is very simple: The child SA with all its sub-SAs exists, or it does not.¶
Only the sequence counters and anti-replay windows would be allocated per sub-SA.¶
All other properties of the SA are per child SA i.e., traffic selectors, mode, but also the key material. Using the same key for all sub-SAs needs to be done with care to avoid effects on security (details will follow shortly). However, if there were different keys, neither the scalability (bulk setup and rekeying) nor the predictable failure model would be possible.¶

Using a single key for multiple sub-SAs has implications on security:¶

It must be ensured that this approach cannot lead to reused IVs for counter modes. For example, in the case of AES-GCM [RFC4106], this means either the salt must be different for each sub-SA, or the IV space must be partitioned accordingly. Note that partitioning the IV space is not possible with implicit IV modes ([RFC8750]), as [RFC4303] requires sequence numbers to be initialized to zero.¶
Hard limits for packet and byte counters must be scaled accordingly. For example, if no more than 2^64 packets should be transmitted using a given key, and the child SA consists of 2^8 sub-SAs, then every sub-SA must not be allowed to send more than 2^56 packets, in case no fine-grained synchronization is possible. In case transmission happens on the same CPU core, overcommitting may be possible as long as the total number of packets or bytes is ensured to be never exceeded.¶
Rekey limits must apply to all sub-SAs combined. For example, if a child SA is configured to be rekeyed after transmission of X bytes or Y packets, then the rekey must be triggered if the sum of bytes or packets on all sub-SAs reaches X or Y. For situations where overcommitting is not possible, we suggest to reference the sub-SA with the maximum number of bytes/packets already sent, say X'_max and Y'_max. X'_max and Y'_max are multiplied with number of sub-SAs and if that value exceeds X or Y, a rekeying is initiated.¶
In case SPIs or an explicit header field are used to encode sub-SAs it may (theoretically) be possible to send more than 2^64 packets using a single key. This may form a problem for ciphers, such as AES-GCM. In this case a hard limit of at most 2^64 packets MUST be enforced.¶

Advantages:¶

Independent sequence numbers and anti-replay windows are available.¶
The approach allows for RSS or explicit steering, especially if the SPI-encoding is used.¶
Most scalable approach: The child SA setup requires exchanging, e.g., an SPI range but does not depend on the number of sub-SAs allocated. Similarly, there is only an ID, sequence counters, and an anti-replay window to store per sub-SA. The remainder of state can be shared.¶
There is no rekeying overhead, as just a single Child SA needs to be rekeyed.¶
Predictable performance characteristics due to the batched, proactive establishment.¶
Clean failure model due to the all-or-nothing setup.¶

Disadvantages:¶

There are potential security implications, which must be discussed thoroughly, to avoid weakening security at any point.¶
The change in the data plane may seem be a bit more complex change compared to per-CPU child SAs. Nevertheless, fallback SAs like mentioned in [I-D.pwouters-ipsecme-multi-sa-performance] are avoided.¶

Compared to setting up separate IKE or child SAs, it might be argued that the idea of sub-SAs keeps the complexity and overhead away from the VPN's operation. Furthermore, storing an SPI, a 64-bit sequence number, and a replay window for 64 packets for 64 different QoS classes requires a total of 10240 bit. This is significantly less than even the lower boundary established for the approach described in Section 4.4. However, of the discussed alternatives, it is the most complex change to existing standard and implementation semantics.¶

Problem statements and uses cases for lightweight Child Security Associations

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Requirements Language

2. Envisioned Scenarios

2.1. Multicore Software Processing

2.2. Implementing QoS mechanisms

2.3. Multipath

2.4. Multicast

3. Requirements

4. Discussion of possible approaches

4.1. Disabling Replay Protection

4.2. Using multiple IKE SAs

4.3. Using multiple (per-CPU) child SAs

4.4. Increasing anti-replay window sizes

4.5. Using Sub-Child SAs

5. Remark on steering

6. IANA Considerations

7. Security Considerations

8. References

8.1. Normative References

8.2. Informative References

Authors' Addresses