Internet-Draft | IP Parcels | February 2022 |
Templin | Expires 13 August 2022 | [Page] |
IP packets (both IPv4 and IPv6) are understood to contain a unit of data which becomes the retransmission unit in case of loss. Upper layer protocols such as the Transmission Control Protocol (TCP) prepare data units known as "segments", with traditional arrangements including a single segment per packet. This document presents a new construct known as the "IP Parcel" which permits a single packet to carry multiple segments, essentially creating a "packet-of-packets". IP parcels provide an essential building block for accommodating larger Maximum Transmission Units (MTUs) in the Internet as discussed in this document.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 13 August 2022.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
IP packets (both IPv4 [RFC0791] and IPv6 [RFC8200]) are understood to contain a unit of data which becomes the retransmission unit in case of loss. Upper layer protocols such as the Transmission Control Protocol (TCP) [RFC0793], QUIC [RFC9000], LTP [RFC5326] and others prepare data units known as "segments", with traditional arrangements including a single segment per packet. This document presents a new construct known as the "IP Parcel" which permits a single packet to carry multiple segments. This essentially creates a "packet-of-packets" with the IP layer headers appearing only once but with possibly multiple upper layer protocol segments.¶
Parcels are formed when an upper layer protocol entity (identified by the "5-tuple" source IP address/port number, destination IP address/port number and protocol number) prepares a buffer of data with the concatenation of up to 64 properly-formed segments that can be broken out into smaller parcels using a copy of the IP header. All segments except the final segment must be equal in size and no larger than 65535 octets (minus headers), while the final segment must be no larger than the others but may be smaller. The upper layer protocol entity then delivers the buffer and non-final segment size to the IP layer, which appends the necessary IP headers to identify this as a parcel and not an ordinary packet.¶
Each original parcel can traverse arbitrarily many parcel-capable IP links in the path until arriving at a parcel-capable ingress middlebox at the edge of a wide area Internetwork. The ingress middlebox may break the parcel out into smaller (sub-)parcels and encapsulate them in headers suitable for traversing the Internetwork. These smaller parcels may then be rejoined into one or more larger parcels at an egress middlebox which forwards them further over parcel-capable IP links toward the final destination. Repackaging of parcels is therefore commonplace, while reordering of segments within a parcel or even loss of individual segments is possible but not desirable. But, what matters is that the number of parcels delivered to the final destination should be kept to a minimum, and that loss or receipt of individual segments (and not parcel size) determines the retransmission unit.¶
The following sections discuss rationale for creating and shipping parcels as well as the actual protocol constructs and procedures involved. IP parcels provide an essential building block for accommodating larger Maximum Transmission Units (MTUs) in the Internet. It is further expected that the parcel concept may drive future innovation in applications, operating systems, network equipment and data links.¶
A "parcel" is defined as "a thing or collection of things wrapped in paper in order to be carried or sent by mail". Indeed, there are many examples of parcel delivery services worldwide that provide an essential transit backbone for efficient business and consumer transactions.¶
In this same spirit, an "IP parcel" is simply a collection of up to 64 upper layer protocol segments wrapped in an efficient package for transmission and delivery (i.e., a "packet of packets") while a "singleton IP parcel" is simply a parcel that contains a single segment. IP parcels are distinguished from ordinary packets through the special header constructions discussed in this document.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Studies have shown that by sending and receiving larger packets applications can realize greater performance due to reduced numbers of system calls and interrupts as well as larger atomic data copies between kernel and user space. Large packets in the network also result in reduced numbers of device interrupts and better network utilization in comparison with smaller packet sizes.¶
A first study [QUIC] involved performance enhancement of the QUIC protocol [RFC9000] using the linux Generic Segment/Receive Offload (GSO/GRO) facility. GSO/GRO provide a robust (but non-standard) service very similar in nature to the IP parcel service described here, and its application has shown significant performance increases due to the increased transfer unit size between the operating system kernel and QUIC application.¶
A second study [I-D.templin-dtn-ltpfrag] showed that GSO/GRO also improved performance for the Licklider Transmission Protocol (LTP) [RFC5326] for small- to medium-sized segments. Historically, the NFS protocol also saw significant performance increases using larger (single-segment) UDP datagrams even when IP fragmentation is invoked, and LTP still follows this profile today. Moreover, LTP shows this (single-segment) performance increase profile extending to the largest possible segment size which suggests that additional performance gains may be possible using (multi-segment) IP parcels that exceed 65535 octets.¶
TCP also benefits from larger packet sizes and efforts have investigated TCP performance using jumbograms internally with changes to the linux GSO/GRO facilities [BIG-TCP]. The idea is to use the jumbo payload internally and to allow GSO and GRO to use buffer sizes larger than 65535 octets, but with the understanding that links that support jumbos natively are not yet widely available. Hence, IP parcels provides a packaging that can be considered in the near term under current deployment limitations.¶
The issue with sending large packets is that they are often lost at links with smaller Maximum Transmission Units (MTUs), and the resulting Packet Too Big (PTB) message may be lost somewhere in the path back to the original source. This "Path MTU black hole" condition can degrade performance unless robust path probing techniques are used, however the best case performance always occurs when no packets are lost due to size restrictions.¶
These considerations therefore motivate a design where the maximum segment size should be no larger than 65535 octets (minus headers), while parcels that carry the segments may themselves be significantly larger. Then, even if a middlebox needs to sub-divide the parcels into smaller sub-parcels to forward further toward the final destination, an important performance optimization for both the original source and final destination can be realized.¶
An analogy: when a consumer orders 50 small items from a major online retailer, the retailer does not ship the order in 50 separate small boxes. Instead, the retailer puts as many of the small boxes as possible into one or a few larger boxes (or parcels) then places the parcels on a semi-truck or airplane. The parcels arrive at a regional distribution center where they may be further redistributed into slightly smaller parcels that get delivered to the consumer. But most often, the consumer will only find one or a few parcels at his doorstep and not 50 individual boxes. This greatly reduces handling overhead for both the retailer and consumer.¶
IP parcel formation is invoked by an upper layer protocol (identified by the 5-tuple as above) when it emits a data buffer containing the concatenation of up to 64 segments. All non-final segments MUST be equal in length while the final segment MUST NOT be larger but MAY be smaller. Each non-final segment MUST be no larger than 65535 octets minus the length of the IP header plus extensions, minus the length of an additional IPv6 header in case an encapsulation middlebox is visited on the path (see: Section 5). The upper layer protocol then presents the buffer and non-final segment size to the IP layer which appends a single IP header (plus any extension headers) before presenting the parcel to either the adaptation layer or the outgoing network interface itself (see: Section 5).¶
For IPv4, the IP layer prepares the parcel by appending an IPv4 header with a Jumbo Payload option formed as follows:¶
+--------+--------+--------+--------+--------+--------+ |00001011|00000110| Jumbo Payload Length | +--------+--------+--------+--------+--------+--------+¶
where option code is set to '11' and option length is set to '6' (which distinguishes option '11' from its former (deprecated) use by [RFC1063]). "Jumbo Payload Length" is a 32-bit unsigned integer value (in network byte order) set to the lengths of the IPv4 header plus all concatenated segments. The IP layer next sets the IPv4 header DF bit to 1, then sets the IPv4 header Total Length field to the length of the IPv4 header plus the length of the first segment only. Note that the IP layer can form true IPv4 jumbograms (as opposed to parcels) by instead setting the IPv4 header Total Length field to the length of the IPv4 header plus options (see: Section 9).¶
For IPv6, the IP layer forms a parcel by appending an IPv6 header with a Jumbo Payload option [RFC2675] the same as for IPv4 above where "Jumbo Payload Length" is set to the lengths of the IPv6 Hop-by-Hop Options header and any other extension headers present plus all concatenated segments. The IP layer next sets the IPv6 header Payload Length field to the lengths of the IPv6 Hop-by-Hop Options header and any other extension headers present plus the length of the first segment only. As with IPv4 the IP layer can form true IPv6 jumbograms (as opposed to parcels) by instead setting the IPv6 header Payload Length field to 0 (see: [RFC2675]).¶
An IP parcel therefore has the following structure:¶
+--------+--------+--------+--------+ | | ~ Segment J (K octets) ~ | | +--------+--------+--------+--------+ ~ ~ ~ ~ +--------+--------+--------+--------+ | | ~ Segment 3 (L octets) ~ | | +--------+--------+--------+--------+ | | ~ Segment 2 (L octets) ~ | | +--------+--------+--------+--------+ | | ~ Segment 1 (L octets) ~ | | +--------+--------+--------+--------+ | IP Header Plus Extensions | ~ {Total, Payload} Length = M ~ | Jumbo Payload Length = N | +--------+--------+--------+--------+¶
where J is the total number of segments (between 1 and 64), L is the length of each non-final segment which MUST NOT be larger than 65535 octets (minus headers as above) and K is the length of the final segment which MUST NOT be larger than L. The values M and N are then set to the length of the IP header plus extensions for IPv4 or to the length of the extensions only for IPv6, then further calculated as follows:¶
Note: a "singleton" parcel is one that includes only the IP header plus extensions with a single segment of length K, while a "null" parcel is a singleton with K=0, i.e., a parcel consisting of only the IP header plus extensions with no octets beyond.¶
The IP layer next presents the parcel to the outgoing network interface. For ordinary IP interfaces, the IP layer simply forwards the parcel over the underlying link the same as for any IP packet after which it may then be forwarded by any number of routers over additional parcel-capable IP links. If any next hop IP link in the path either does not support parcels or configures an MTU that is too small to transit the parcel without fragmentation, the router instead opens the parcel and forwards each enclosed segment as a separate IP packet (i.e., by appending a copy of the parcel's IP header to each segment but without including the Jumbo Payload option). Or, if the router does not recognize parcels at all, it drops the parcel and (for IPv6) may return an ICMP "Parameter Problem" message according to [RFC2675].¶
If the outgoing network interface is an OMNI interface [I-D.templin-6man-omni], the OMNI Adaptation Layer (OAL) of this First Hop Segment (FHS) OAL node forwards the parcel to the next OAL hop which may be either an OAL intermediate node or the Last Hop Segment (LHS) OAL node (which may also be the final destination itself). The OAL assigns a monotonically- incrementing (modulo 127) "Parcel ID" and subdivides the parcel into sub-parcels no larger than the maximum of the path MTU to the next hop or 65535 octets (minus the length of encapsulation headers) by determining the number of segments of length L that can fit into each sub-parcel under these size constraints. For example, if the OAL determines that a sub-parcel can contain 3 segments of length L, it creates sub-parcels with the first containing segments 1-3, the second containing segments 4-6, etc. and with the final containing any remaining segments. The OAL then appends an identical IP header plus extensions to each sub-parcel while resetting M and N in each according to the above equations with J set to 3 and K set to L for each non-final sub-parcel and with J set to the remaining number of segments for the final sub-parcel.¶
The OAL next performs IP encapsulation on each sub-parcel with destination set to the next hop IP address then inserts an IPv6 Fragment Header after the IP encapsulation header, i.e., even if the encapsulation header is IPv4, even if no actual fragmentation is needed and/or even if the Jumbo Payload option is present. The OAL then assigns a randomly-initialized 32-bit Identification number that is monotonically-incremented for each consecutive sub-parcel, then performs IPv6 fragmentation over the sub-parcel if necessary to create fragments small enough to traverse the path to the next OAL hop while writing the Parcel ID and setting or clearing the "Parcel (P)" and "(More) Sub-Parcels (S)" bits in the Fragment Header of the first fragment (see: [I-D.templin-6man-fragrep]). (The OAL sets P to 1 for a parcel or to 0 for a non-parcel. When P is 1, the OAL next sets S to 1 for non-final sub-parcels or to 0 if the sub-parcel contains the final segment.) The OAL then forwards each IP encapsulated packet/fragment to the next OAL hop.¶
When the next OAL hop receives the encapsulated IP fragments or whole packets, it reassembles if necessary. If the P flag in the first fragment is 0, the next hop then processes the reassembled entity as an ordinary IP packet; otherwise it continues processing as a sub-parcel. If the next hop is an OAL intermediate node, it retains the sub-parcels along with their Parcel ID and Identification values for a brief time in hopes of re-combining with peer sub-parcels of the same original parcel identified by the 4-tuple consisting of the IP encapsulation source and destination, Identification and Parcel ID. The combining entails the concatenation of the segments included in sub-parcels with the same Parcel ID and with Identification values within 64 of one another to create a larger sub-parcel possibly even as large as the entire original parcel. Order of concatenation is not important, with the exception that the final sub-parcel (i.e., the one with S set to 0) must occur as the final concatenation before transmission. The OAL then appends a common IP header plus extensions to each re-combined sub-parcel while resetting M and N in each according to the above equations with J, K and L set accordingly.¶
This OAL intermediate node next forwards the re-combined sub-parcel(s) to the next hop toward the LHS OAL node using encapsulation the same as specified above. (The intermediate node MUST ensure that the S flag remains set to 0 in the sub-parcel that contains the final segment.) When the parcel or sub-parcels arrive at the LHS OAL node, the OAL re-combines them into the largest possible sub-parcels while honoring the S flag. If the LHS OAL node is also the final destination, it delivers the sub-parcels to upper layers which act on the enclosed 5-tuple information supplied by the original source. If the LHS OAL node is not the final destination, it instead forwards each sub-parcel the same as for an ordinary IP packet the same as discussed above.¶
Note: while the LHS OAL node may be tempted to re-combine the sub-parcels of multiple different parcels with identical upper layer protocol 5-tuples and with non-final segments of identical length, this process could become complicated when the different parcels each have final segments of diverse lengths. Since this might interfere with any perceived performance advantages, the decision of whether and how to perform inter-parcel concatenation is an implementation matter.¶
Note: some IPv6 fragmentation and reassembly implementations may require a well-formed IPv6 header to perform their operations. When the encapsulation is based on IPv4, such implementations translate the encapsulation header into an IPv6 header with IPv4-Mapped IPv6 addresses before performing the fragmentation/reassembly operation, then restore the original IPv4 header before further processing.¶
To determine whether parcels are supported over at least a leading portion of the forward path toward the final destination, the original source can send a "Parcel Probe" IP parcel that contains an upper layer protocol probe segment (e.g., a data segment, an ICMP Echo Request message, etc.). The purpose of the probe is to elicit either a "Parcel Reply" or an ordinary upper layer protocol probe reply from the final destination.¶
If the original source receives either form of reply, it marks the path as "parcels supported" and ignores any ICMP [RFC0792][RFC4443] and/or Packet Too Big (PTB) messages [RFC1191][RFC8201] concerning the probe. If the original source instead receives no reply, it marks the path as "parcels not supported" and may regard any ICMP and/or PTB messages concerning the probe as indications of a possible middlebox restriction.¶
The original source can therefore send Parcel Probes in parallel with sending real data as ordinary IP packets. If the original source receives a probe reply, it can begin using IP parcels.¶
Parcel Probes extend the Jumbo Payload option to include a 4-octet "Path MTU" value into which routers write the minimum link MTU observed the same as described in [RFC1063][I-D.ietf-6man-mtu-option]. Parcel Probes can also include a probe segment to test for link restrictions the same as in [RFC4821][RFC8899].¶
The original source sends Parcel Probes unidirectionally in the forward path to the final destination to elicit a probe reply, since it may be the case that IP parcels are supported only in the forward path and not in the return path. The Parcel Probe may be dropped in the forward path by any node that does not recognize IP parcels, but a probe reply must not be dropped even if IP parcels are not recognized in the return path.¶
In order to support forward path probing and return path probe replies, the Jumbo Payload option in a Parcel Probe is extended as follows:¶
+--------+--------+ | Type | Length | +--------+--------+--------+--------+ | Jumbo Payload Length | +--------+--------+--------+--------+ | PMTU | +--------+--------+--------+--------+ | Nonce | +--------+--------+--------+--------+¶
For IPv4, the original source sets Type to '00001011' and Length to '00001110' - this reuses the (obsoleted) IPv4 Probe MTU option originally defined in [RFC1063]. The original source then sets Jumbo Payload Length according to the length of the included probe segment, sets PMTU to the MTU of the directly-connected first-hop network, sets Nonce to a random 32-bit value and sends the probe. According to [RFC7126], middleboxes (i.e., routers, security gateways, firewalls, etc.) that do not observe this specification SHOULD drop IP packets that contain an IPv4 Probe MTU option. Middleboxes that observe this specification instead process it as an extended Jumbo Payload option according to the above format and compare the PMTU field with the MTUs of the inbound and outbound links for the probe. If either MTU is lower than the value in the PMTU field of the option, the middlebox sets the option value to the lower MTU and forwards the probe to the next hop.¶
For IPv6, the original source sets Type to '11000010' and Length to '00001100' - this provides a new form of the Jumbo Payload option originally defined in in [RFC2675]. The original source then sets Jumbo Payload Length according to the length of the included probe segment, sets PMTU to the MTU of the directly-connected first-hop network, sets Nonce to a random 32-bit value and sends the probe. According to [RFC2675], middleboxes (i.e., routers, security gateways, firewalls, etc.) that do not observe this specification SHOULD drop packets with a non-zero IPv6 Payload Length that also include a Jumbo Payload option. Middleboxes that observe this specification instead process the option according to the above format and compare the PMTU field with the MTUs of the inbound and outbound links for the probe. If either MTU is lower than the value in the PMTU field of the option, the middlebox sets the option value to the lower MTU and forwards the probe to the next hop.¶
For both IPv4 and IPv6, if any next hop IP link in the path either does not support parcels or configures an MTU that is too small to transit the parcel without fragmentation, the router instead opens the parcel and forwards the enclosed segment as an ordinary IP packet (i.e., by appending a copy of the parcel's IP header to the segment but without including the Jumbo Payload option).¶
As a result of the above forwarding, the final destination may receive either an ordinary IP packet containing an upper layer protocol probe segment or a properly-formed Parcel Probe. In the former case, the destination returns an ordinary probe reply according to the upper layer protocol. In the latter case, the destination discards the included upper layer protocol probe segment and returns a Parcel Reply message.¶
The destination prepares a Parcel Reply consisting of an IP header of the same protocol version that appeared in the Parcel Probe with source and destination addresses reversed, with {Protocol, Next Header} set to the value '60' (i.e., "IPv6 Destination Option") and with an IPv6 Destination Option header with Next Header set to the value '59' (i.e., "IPv6 No Next Header") [RFC8200]. The destination next copies the extended Jumbo Payload option received in the Parcel Probe as the sole Destination Option (and for IPv4 resets Type to '11000010' and Length to '00001100') and includes no other octets beyond the end of the option. The destination finally sets the IP header {Total, Payload} Length field according to the length of the included Destination Option and returns the message to the source. (Since filtering middleboxes may drop IPv4 packets with Protocol '60' the destination should wrap an IPv4 Parcel Reply in UDP/IPv4 headers with the IPv4 source and destination addresses copied from the Parcel Reply and with UDP port numbers set to the UDP port number for OMNI [I-D.templin-6man-omni].)¶
After sending a Parcel Probe the original source may therefore receive either an ordinary upper layer protocol probe reply or a properly-formed Parcel Reply (see above). In the former case, the original source discovers that IP parcels are supported over a leading portion of the path toward the final destination and that segments of the size indicated by the probe reply can reach the final destination. In the latter case, the original source matches the Parcel Reply Jumbo Payload Length and Nonce values with the values it sent in the Parcel Probe and discards the message if the values do not match. Otherwise, the original source discovers that IP parcels are supported over the entire path to the final destination and that both Parcels and segments of the size indicated by the PMTU value are supported over the entire path. (If the PMTU value is larger than 65535 octets, the maximum Parcel size may be set to this larger value, while the maximum segment size is limited to 65535 octets minus headers).¶
Note: In some environments, Parcel Replies may require an authentication signature encapsulation for added security (see: [I-D.templin-6man-omni]).¶
Each segment of a (multi-segment) IP parcel includes its own upper layer protocol integrity check. This allows for IP parcels to support much stronger integrity for the same amount of upper layer protocol data in comparison with an ordinary IP packet or Jumbogram containing only a single segment. The integrity checks must then be verified at the final destination, which accepts any segments with correct integrity while discarding any corrupted segments and counting them as a loss event.¶
IP parcels can range in length from as small as only the IP header sizes to as large as the IP headers plus (64 * (65535 minus headers)) octets. Although link layer integrity checks provide sufficient protection for contiguous data blocks up to approximately 9KB, reliance on the presence of link-layer integrity checks may not be possible over links such as tunnels. Moreover, the segment contents of a received parcel may arrive in an incomplete and/or rearranged order with respect to their original packaging.¶
For these reasons, the OAL at each hop of an OMNI link includes an integrity check when it performs IP fragmentation on a sub-parcel, with the integrity verified during reassembly at the next hop.¶
Section 3 of [RFC2675] provides a list of certain conditions to be considered as errors. In particular:¶
Implementations that obey this specification ignore these conditions and do not consider them as errors.¶
By defining a new IPv4 Jumbo Payload option, this document also implicitly enables an IPv4 jumbogram service defined as an IPv4 packet with Total Length set to the length of the IPv4 header plus extensions only, and with a Jumbo Payload option in the IPv4 extension headers. All other aspects of IPv4 jumbograms are the same as for IPv6 jumbograms [RFC2675].¶
Common widely-deployed implementations include services such as TCP Segmentation Offload (TSO) and Generic Segmentation/Receive Offload (GSO/GRO). These services support a robust (but not standardized) service that has been shown to improve performance in many instances. Implementation of the IP parcel service is a work in progress.¶
The IANA is instructed to change the "MTUP - MTU Probe" entry in the 'ip option numbers' registry to the "JUMBO - IPv4 Jumbo Payload" option. The Copy and Class fields must both be set to 0, and the Number and Value fields must both be set to 11'. The reference must be changed to this document (RFCXXXX).¶
Original sources match the Jumbo Payload Length and Nonce values in received Parcel Replies with the Parcel Probes they send. If the values match, the Parcel Reply is likely an authentic response to the Parcel Probe. In environments where stronger authentication is necessary, the encapsulating authentication services of OMNI can be used [I-D.templin-6man-omni].¶
Communications networking security is necessary to preserve confidentiality, integrity and availability.¶
This work was inspired by ongoing AERO/OMNI/DTN investigations. The concepts were further motivated through discussions on the intarea and 6man lists.¶
A considerable body of work over recent years has produced useful "segmentation offload" facilities available in widely-deployed implementations.¶
.¶