Internet-Draft | LMTA and PTB Notification | March 2023 |
Liu, et al. | Expires 11 September 2023 | [Page] |
This document defines the IKEv2 Link Maximum Atomic Packet Notification and Packet Too Big Extension. This extension enables an egress security gateway to notify its ingress counter part that fragmentation is happening or a packet too big is received (and cannot be decrypted). In both cases, the egress node provides MTU information that enable the ingress node can configure appropriately its Tunnel Maximum Transmission Unit or MTU or simply put Tunnel MTU (TMTU) to prevent fragmentation or too big packets to be transmitted.¶
This extension does not intent to replace ICMP. It provides information ICMP does not provide and even when that information could be provided by ICMP, this extension provides a reliable authenticated channel that ensures the ingress node receive this information even when ICMP messages cannot be received by the ingress node.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 11 September 2023.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Fragments reassembling at the egress security gateway requires additional resources which under heavy load results in service degradations. Then, as detailed in [RFC4963], [RFC6864] or [RFC8900] fragmentation is considered fragile and not sufficiently robust at high data rates. Typically, the 16-bit IPv4 identification field is not large enough to prevent duplication making fragmentation not sufficiently robust at high data rates. In IPv6 the 32 byte identification field makes collision happens less often.¶
Figure 1 depicts various fragmentation scenarios that can occur when Tunnel Transit Packets (TTP) are encapsulated over an IPsec tunnel and carried between a ingress and a egress security gateway as Tunnel Link Packet (TLP).¶
Reassembling is performed by the egress node in two cases. Firstly when mid tunnel fragmentation happens (see 2 Figure 1) -- in which case the TLP header or outer header is using IPv4 with its Do not Fragment bit set to 0 (DF=0). Secondly when Outer fragmentation is performed by the ingress node (see 3 in Figure 1). The main difference between the two scenarios is that with Outer fragmentation, the ingress node is aware that the egress performs reassembly. Note also that in both cases, reassembling the TLP in itself does not prevent the TTP to be deciphered unless the reassembled TLP exceeds the effective MTU to receive (EMTU_R) - that is the maximum size of the IPsec protected packet that can be accepted by the egress node to perform the ESP encapsulation.¶
Figure 2 summarizes the various operations that are expected given the size of an IPsec protected TTP size. The optimal size is the Tunnel maximum atomic packet (TMAP), that is the maximum TTP size that avoids fragmentation. Such TTP generates a Link maximum atomic packet (LMAP) LTP. Note that in this case and unless specified explicitly the link considered is the physical link between the ingress and egress node.¶
This document enables a egress node to inform the ingress node that: * a received packet is fragmented * a too big packet is received¶
As depicted in Figure 3, supporting this extension, the ingress and egress node commit themselves in optimizing the processing of the IPsec tunnel and prevent or at least limit reassembly operation to be performed. More specifically, the ingress security gateway limits as much as possible the use of outer fragmentation and commit to set their TMTU value so that TTP are not fragmented/reassembled. In addition, for TTP with IPv4 addresses and DF=0, the ingress node commit to perform inner fragmentation to prevent reassembly at the egress node.¶
The mechanism is especially useful when the tunnel between the ingress and egress nodes is using IPv4 outer IP addresses with DF=0 as the fragmentation may occur while the ingress may not be aware of it. With IPv4 DF=1 or IPv6, the mechanism essentially enables the egress node to send the TTP ICMP PTB information (being sent to the Source) to the ingress interface as well as to send the LTP ICMP PTB information being sent to the router of the ingress node) over a authenticated channel (IKEv2).¶
This extension does not impact or interfere the ICMP processing and ICMP messages are sent, received and processed as usual. This extension may result in the ingress node receiving MTU indications via two different channels (IKEv2 and ICMP). The use of IKEv2 may provide additional trust than ICMP (see {sec-sec} ). The Source is not involved.¶
This extension does not intent to replace ICMP. It provides information ICMP does not provide (see Section 1.3 for more details) and when that information could be provided by ICMP, this extension provides a reliable authenticated channel that ensures the ingress node receive this information even when ICMP messages cannot be received by the ingress node.¶
This section provides an illustrative example to provide a high level overview.¶
One can reasonably question why setting the IPv4 DF=1 is not sufficient to avoid fragmentation. The reason is that this setting DF=1 might lead to a black holing situation as it is necessary for the ICMP PTB message to not make it back to the ingress node, be validated and then to adjust the TMAP. Setting DF=0 is the way to mitigate this. Suppose the Don't Fragment bit to 1 in the IPv4 Header of the Tunnel Link Packet. If that packet becomes larger than the link Maximum Transmission Unit (LMTU), the packet is dropped by an on-path router and an ICMPv4 message Packet Too Big (PTB) [RFC0792] is returned to the sending address. The ICMPv4 PTB message is a Destination Unreachable message with Code equal to 4 and was augmented by [RFC1191] to indicate the acceptable MTU. Unfortunately, one cannot rely on such procedure as in practice some routers do not check the MTU and as such do not send ICMPv4 messages. In addition, when ICMPv4 message are sent these message are unprotected, and may be blocked by firewalls or ignored. This results in IPv4 packets being dropped without the security gateways being aware of it which is also designated as black holing. To prevent this situation, IPv4 packets often set their DF bit set to 0. In this case, as described in [RFC0792], when a packet size exceeds its MTU, the node fragments the incoming packet in multiple fragments.¶
In addition to the above reasons DF=1 is not appropriate for ESP, there is another important reason that ICMP does not work almost completely for ESP.¶
This is because ICMPv4 PTB has the following format, defined in RFC 1191:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 3 | Code = 4 | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | unused = 0 | Next-Hop MTU | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Internet Header + 64 bits of Original Datagram Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+¶
This means that an ICMP packet contains only the IP header (tunnel IP header) of ESP and the following 8 bytes. Normally, if ingress node processes the sent back ICMP PTB packets, it needs to identify the traffic selector based on the ICMP packet and either forwards the ICMP PTB packet to the device that sends the original packet or processes the packet itself. But this does not work for ESP.¶
For scenarios with UDP or TCP encapsulation, such as NAT, the 8 bytes are only UDP or TCP port numbers and do not even contain SPI information of child SA. Therefore, ingress node cannot identify the traffic selector and proceed to the next step. For scenarios without L4 encapsulation, these 8 bytes are the SPI and Sequence Number, and ingress node can know which child SA it is from the SPI, but this information is also not enough because the traffic selector for a child's SA can be a range: For example, if the traffic selector of child SA is 4.4.0/24==7.7.0/24, then 4.4.4.4==7.7.7.7 or 4.4.4.28==7.7.7.28 can both meet, so ingress node also has no way of knowing which stream sends too big packet.¶
Since the PTB packet returned by ICMP is incomplete, ingress node cannot decrypt the packet and then view the information in the packet to find the exact stream to further handle. Therefore, set DF=1, then ICMP PTB is generated, which has no significance for the IPsec ESP scenario.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Tunnel, Ingress node, Egress node, Ingress Interface, Egress interface, Tunnel Transit Packets (TTP), Tunnel Link Packet (TLP), Link MTU (LMTU), Tunnel MTU (TMTU), Tunnel maximum atomic packet (TMAP), effective MTU to receive (EMTU_R) are defined in [I-D.ietf-intarea-tunnels].¶
During an IKEv2 negotiation, the initiator and the responder indicate their support for the Link Maximum Atomic Packet and Packet Too Big Extension by exchanging the LMAP_AND_PTB_SUPPORTED notifications. This notification MUST be sent in the IKE_AUTH exchange (in case of multiple IKE_AUTH exchanges - in the first IKE_AUTH message from initiator and in the last IKE_AUTH message from responder). If both the initiator and the responder send this notification during the IKE_AUTH exchange, peers may notify each other with an IPv4 Link Maximum Atomic Packet Notification when fragmentation is observed. Upon receiving such notifications, the peers may take the necessary actions to prevent such fragmentation to occur.¶
Initiator Responder ------------------------------------------------------------------- HDR, SA, KEi, Ni --> <-- HDR, SA, KEr, Nr HDR, SK {IDi, AUTH, SA, TSi, TSr, N(LMAP_AND_PTB_SUPPORTED)} --> <-- HDR, SK {IDr, AUTH, SA, TSi, TSr, N(LMAP_AND_PTB_SUPPORTED)}¶
The egress security gateway detects fragmentation occurred when it receives an initial fragment; e.g. with the Flags' More Fragment Bit set to 1 and the Fragment Offset set to 0. Upon receiving such packet, the egress node determines the IP version (IPVersion) and the fragment length FragLen). For an IPv4 packet, FragLen is the Total Length field ( see [RFC0791]). For an IPv6 packet FragLen is the Payload Length (see [RFC8200], Section 3. Note that that these values have different meanings as with an IPv6 fragment, FragLen does not includes the IPv6 header but only the payload.¶
The egress node sends the LMAP notification payload that contains IPVersion and FragLen.¶
The egress node SHOULD send a maximum of one LMAP notification per (reassembled) received packet.
However, since this extension is especially expected nodes dealing with high traffic rates, the notification is expected to be sent at reasonable rates per Security Associations.
More specifically, the use of the IKEv2 provides a reliable channel which makes sending redundant notification unnecessary.
Then, the notification rate needs to account for the time the egress node adjust the TMTU, and that TMTU remains implemented.
More details are provided in Section 10.¶
Egress Security Gateway Ingress Security Gateway ------------------------------------------------------------------- HDR SK { N(LMAP)} -->¶
Upon receiving a LMAP notification, the ingress node derives the tunnel MAP (TMAP) from the Link MAP (LMAP) derived by the FragLen and IPVersion.¶
The IP version of the fragment provided by the LMAP notification (see Section 4). FragLen:¶
The Fragment length provided by the LMAP notification (see Section 4). LMAP:¶
For an IPv4 packet, LMAP is directly provided by the fragment length of the LMAP Notification. For an IPv6 packet, LMAP needs to adds the IPv6 Header length (40 bytes) to the fragment length of the LMAP Notification. outer IP header:¶
The IP header of the LTP
encapsulation overhead:
contains the ESP header, the ESP Trailer including the variable Pad field.
When the padding is minimizing the Pad Len, the encapsulation header is set to 14 (+ the size of the ICV).
The overhead SHOULD also estimate IP options or IP extensions.¶
The ingress security gateway SHOULD propagates the TMAP as the tunnel MTU back to the Source so the size of future TTP packets does not exceeds the TMAP - eventually performing source fragmentation.
To do so, the ingress node sets the LMTU to TMAP for all traffic designated by the SA.
In this case the LMTU is the MTU associated to the link of the router interface of the ingress node that facing the Source's network.
Upon receiving a TLP larger than the TMAP, the packet is discarded and an ICMP PTB message is returned to the Source which then performs Source Fragmentation (5) (See Section 8.2.1. of [RFC4301]).
It is worth mentioning that only future packets will be impacted, and not those causing fragmentation.¶
When the TLP is an IPv4 packet with DF=0, the ingress node SHOULD perform Source Fragmentation of the TTP, also represented as Inner Fragmentation (3), sending chunks that do not exceeds TMAP.¶
Figure 11 in Section 4.2.2 of [I-D.ietf-intarea-tunnels] with tunnel MTU set to TMAP achieves both recommendations, while Figure 12 in Section 4.2.2 describes the inner fragmentation.¶
A packet can be rejected because the size of the LTP exceeds the LMTU (of the router component) or when the (reassembled) LTP exceeds the EMTU_R (of the interface component) and so IPsec decapsulation cannot be done.¶
When the LTP size exceeds the EMTU_R, the egress node SHOULD send a Packet Too Big (PTB) notification that includes the EMTU_R and the LMTU. If the packet results from a reassembly operation, the egress node MUST send a LMAP notification with the LMAP. If the packet does not result from a reassembly operation, the egress node MUST NOT send a LMAP notification.¶
Egress Security Gateway Ingress Security Gateway ------------------------------------------------------------------- HDR SK { N(PTB)} -->¶
Upon receiving a PTB notification, the egress node computes the Tunnel MTU (TMTU) as follows:¶
The value provided in the PTB notification related to the MTU associated to the egress interface (see Section 6) LMTU :¶
The value provided in the PTB notification related to the LMTU associated to the egress router (see Section 6)¶
The ingress node SHOULD proceed with TMAP as described in Section 5.¶
The ingress node MUST ensure the size of the TTP do not exceed the computed TMTU and MUST ensure the size of the LTP do not exceed the LMTU provided in the PTB notification.¶
Figure 6 illustrates the Notify Payload packet format as described in Section 3.10 of [RFC7296] with a 4 bytes path allowed MTU value as notification data. This format is used for both the LMAP_AND_PTB_SUPPORTED, LMAP and PTB notifications.¶
The fields Next Payload, Critical Bit, RESERVED and Payload Length are defined in [RFC7296]. Specific fields defined in this document are:¶
set to zero. SPI Size (1 octet):¶
set to zero. Notify Message Type (2 octets):¶
Specifies the type of notification message. It is set to TBD1 for the LMAP_AND_PTB_SUPPORTED notification, TBD2 for the LMAP notification and TBD3 for the PTB notification. Notification Data:¶
Specifies the data associated to the notification message. It is empty for the LMAP_AND_PTB_SUPPORTED notification or a 4 octets that contains the MTU value for the LMAP and PTB notification - as represented in Figure 7 and Figure 8.¶
with:¶
The IPversion of the received packet
Reserved:¶
Reserved bytes MUST be set by the egress node to zero and MUST be ignored by th eingress node. FragLen (2 bytes):¶
IANA is requested to allocate two values in the "IKEv2 Notify Message Types - Status Types" registry (available at https://www.iana.org/assignments/ikev2-parameters/ikev2-parameters.xhtml#ikev2-parameters-16) with the following definition:¶
+=======+================================+ | Value | NOTIFY MESSAGES - STATUS TYPES | +=======+================================+ | TBD1 | LMAP_AND_SUPPORTED | | TBD2 | LMAP | | TBD3 | PTB | +-------+--------------------------------+¶
This document defines an IKEv2 extension to enable an egress node to notify an ingress node that fragmentation is happening as well as the observed fragment length. In addition, the extension also enable to an egress node to notify an ingress node that a packet too big has been discarded, together with some complementary informations to appropriately update the MTU.¶
These pieces of information are transferred over the authenticated IKEv2 channel which ensures the origin of the message.
Assuming the egress node is trusted, the ingress node can trust what is being reported effectively observed (like fragmentation is happening, the observed fragment length, a packet too big has been received) by the egress node and that some information are effectively accurate such as the egress LMTU and EMTU_R.
When fragmentation happens and a LMAP notification is being sent, the egress node MUST send the notification once the reassembled packet has been decapsulated.
This ensure that fragmentation has been performed over a authenticated TLP and ensure the TLP has not been forged by any attacker.
With IPv6, only outer fragmentation is permitted so, the ingress node can validate the provided information.
However, sending the notification after the IPsec decapsulation enables the egress node to detect potential injection attacks and prevent sending an unnecessary notification, that may be part of a DDoS attack targeting the ingress node itself.
With IPv4 an attacker could set the DF=0 which would allow any mid tunnel fragmentation.
IPsec (ESP or AH) do not cover the DF flag, so the egress cannot trust the fragment length observed has not been forged, and the security considerations related to MTU discovery [RFC0791], [RFC8900], [RFC4963], [RFC6864], [RFC1191] apply here.
Note that information carried by the LMAP notification are never carried by ICMP, and all LMAP may share with ICMP is that this information will be used to update the MTU.¶
The egress node may not be able to decrypt the encrypted TTP packet if the full encrypted TTP cannot be built.
One possibility is that too many fragments are being sent over a too long period of time (slowloris like attacks) (see [RFC8900], Section 3.7).
Another possibility is that one fragment exceeds the LMTU or that he reassembled (unverified) encrypted TTP exceeds the EMTU_R.
In both cases, a PTB notification SHOULD be sent and if fragmentation is observed a LMAP MUST be sent together with the PTB notification.
Information carried by the PTB (LMTU and EMTU_R) can be trusted.
Without this extension this information would have been carried by ICMP.
In many deployments, the ICMP channel may be unprotected and ICMP packets maybe discarded by firewalls and never reach the egress node.
In addition, the description provided by [I-D.ietf-intarea-tunnels] tends to indicate that the ICMP channel remains between the router components of the ingress and egress nodes and as such are not provided to the interfaces component.
Finally, as detailed in Section 1.3 an ICMP PTB message contains a portion of the encrypted ESP packet, which may not sufficient to deduce the SPI and associated traffic selectors, and as such prevent the ingress node to identify the traffic flow that generates the fragmentation.
In any cases, this results in the information not being available to take the appropriate action.
Sending the PTB notification over ICMP solves these issues and ease the correlation with the LMAP notification.
In term of trust, when sufficient information may be sent both on the IKEv2 channel and via a protected ICMP PTB message, the use of the PTB notification achieves similar trust as the one observed with an ICMP PTB message sent over a IPsec protected channel.
For that reason, the ICMP messages SHOULD be protected by IPsec.
The use of two different path may provide some additional reliability as the same information is taking two different paths and that IKEv2 windows ensures the the information is received - as opposed to the (encrypted) ICMP message that can be dropped.
However, information carried by the LMAP notification cannot be trusted and similar security considerations related to MTU discovery [RFC0791], [RFC8900], [RFC4963], [RFC6864], [RFC1191] apply here.¶
Fragmentation happens on a per LTP basis and packet size exceeding EMTU_R happens on a TTP basis. During high packet rates, this sending a notification for each of these packets is likely to be used by an attacker to trigger a DDoS attack to both egress and ingress nodes. As a result, the egress node SHOULD be able to configure the maximum rate at which the notifications are sent. This includes the ability to indicate that LMAP notifications (without PTB) are not sent when the outer IP addresses are of version IPv6. The reasoning is that with IPv6, the egress node observes outer fragmentation, in which case the ingress node is already aware of it. In addition, an egress node SHOULD be able to configure a threshold for number of alert per SAs before a notification is sent, a rate limit per SA.¶
The authors would like to thank Magnus Westerlund, Paul Wouters, Joe Touch for his reviews and valuable comments and suggestions.¶