Internet-Draft | Metadata Path | October 2023 |
Dunbar, et al. | Expires 21 April 2024 | [Page] |
This draft describes a new Metadata Path Attribute and some Sub-TLVs for egress routers to advertise the Metadata about the attached edge services (ES). The Edge Service Metadata can be used by the ingress routers in the 5G Local Data Network to make path selections not only based on the routing cost but also the running environment of the edge services. The goal is to improve latency and performance for 5G edge services.¶
The extension enables an edge service at one specific location to be more preferred than the others with the same IP address (ANYCAST) to receive data flow from a specific source, like a specific User Equipment (UE).¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 21 April 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
[CATS-Edge-Service] describes the 5G Edge Computing background and how BGP can be used to advertise the running status and environment of the directly attached 5G edge services. Besides the Radio Access, 5G [TS.23.501-3GPP] is characterized by having edge services closer to the Cell Towers reachable by Local Data Networks (LDN) . From IP network perspective, the 5G LDN is a limited domain [RFC8799] with edge services a few hops away from the ingress nodes. Only selective UE services are considered as 5G low latency Edge Services.¶
This document describes a new Metadata Path Attribute added to a BGP UPDATE message [RFC4271] for egress routers to advertise the Metadata about the directly attached edge services. The Edge Service Metadata in this document includes the site availability index, the site preference, and the service delay prediction index, which are further explained in Section 4.¶
Note: The proposed Edge Service Metadata are not intended for the best-effort services reachable via the public internet. The Edge Service Metadata can be used by the ingress routers to make path selections for selective low latency services based on not only the network distance but also the running environment of the edge cloud sites. The goal is to improve latency and performance for 5G ultra-low latency services.¶
The extension is targeted for a single domain with RR controlling the propagation of the BGP UPDATE. The Edge Service Metadata is only attached to the services (routes) hosted in the 5G edge cloud sites, which are only a small subset of services initiated from UEs. E.g., not for UEs accessing many internet sites.¶
The following conventions are used in this document.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The goal of this Edge Service Metadata Path Attribute is for egress routers to propagate the metrics about their running environment to ingress routers so that the ingress routers can make path selections based on not only the routing cost but also the running environment of the edge services. Many factors can impact the service delay in an edge data center, such as computing service capability information, computing service states, and computing resource states of the corresponding edge site. Computing service capability information can be used to record information of the computing power node or initialization deployment information for computing service initialization. Computing service states can include one of the service connection numbers, service duration, and so on. Computing resource states can be detailed information on computing resources such as CPU/GPU. They can also be an abstract metric from these detailed parameters to indicate the resource status of the edge site. Many more metrics about the running environment are being discussed at CATS WG [draft-ldbc-cats-framework]. This document illustrates a few examples of Sub-TLVs of the metrics under the Edge Service Metadata Path Attribute:¶
This section specifies how those Metadata impact the ingress node's path selections.¶
When an ingress router receives BGP updates for the same IP prefix from multiple egress routers, all these egress routers' loopback addresses are considered as the next hops for the IP prefix. For the selected low latency edge services, the ingress router BGP engine would call an Edge Service Management function that can select paths based on the Edge Service Metadata received. [CATS-Edge-Service] has an exemplary algorithm to compute the weighted path cost based on the Edge Service Metadata carried by the Sub-TLV(s) specified in this document.¶
Section 5 has the detailed description of the Edge Service Metadata influenced optimal path selection.¶
When the ingress router receives a packet and does a lookup on the route in the FIB, it gets the destination prefix's whole path. It encapsulates the packet destined towards the optimal egress node.¶
For subsequent packets belonging to the same flow, the ingress router needs to forward them to the same egress router unless the selected egress router is no longer reachable. Keeping packets from one flow to the same egress router, a.k.a. Flow Affinity, is supported by many commercial routers. Most registered EC services have relatively short flows.¶
How Flow Affinity is implemented is out of the scope for this document. Appendix A has one example illustrating achieving flow affinity.¶
When a UE moves to a new 5G gNB which is anchored to the same UPF, the packets from the UE traverse to the same ingress router. Path selection and forwarding behavior are same as before.¶
If the UE maintains the same IP address when anchored to a new UPF, the directly connected ingress router might use the information passed from a neighboring router to derive the optimal Next Hop for this route. The detailed algorithm is out of the scope of this document.¶
The Metadata Path Attribute is an optional transitive BGP Path attribute to carry metrics and metadata about the edge services attached to the egress router. The Metadata Path Attribute, to be assigned by IANA [RFC2042], consists of a set of Sub-TLVs, and each Sub-TLV contains information for specific metrics of the edge services.¶
Most BGP UPDATE messages don't include the Metadata Path Attribute. For the limited edge services that need to advertise the metadata about the services, the Metadata Path Attribute can be included in a BGP UPDATE message [RFC4271] together with other BGP Path Attributes [IANA-BGP-PARAMS], such as Communities [RFC4360], NEXT_HOP, Tunnel Encapsulation Path Attribute [RFC9012], etc.¶
The BGP Metadata Path attribute MAY be attached to BGP IPv4/IPv6 Unicast prefixes, BGP Labeled IPv4/IPv6 prefixes [RFC8277], and IPv4/IPv6 Anycast prefixes [RFC4786]. In order to prevent distribution of the BGP Metadata Path Attribute beyond its intended scope of applicability, attribute filtering SHOULD be deployed to remove the BGP Metadata Path attribute at the administrative boundary.¶
A BGP speaker that advertises a path received from one of its neighbors SHOULD advertise the BGP Metadata Path attribute received with the path without modification as long as the BGP Metadata Path attribute was acceptable. If the path did not come with a BGP Metadata Path attribute, the speaker MAY attach a BGP Metadata Attribute to the path if configured to do so.¶
The Metadata Path Attribute MUST contain at least one metadata Sub-TLV. Multiple Metadata Sub-TLVs can be included in a Metadata Path Attribute in one BGP UPDATE message. The content of the Sub-TLVs present in the BGP Metadata Path attribute is determined by the configuration. When a BGP Speaker does not recognize some of the Sub-TLVs within one Metadata Path Attribute in a BGP UPDATE message, the BGP Speaker should forward the received BGP UPDATE message without any change if the BGP UPDATE message is marked as transitive. The domain ingress nodes SHOULD process the recognized Sub-TLVs carried by the Metadata Path Attribute and ignore the unrecognized Sub-TLVs. By default, a BGP speaker does not report any unrecognized Sub-TLVs within a Metadata Path Attribute unless configured to send a notification to its management system. The ingress node should be configured with an algorithm to combine the recognized metrics carried by the Sub-TLVs within a Metadata Path Attribute of the received BGP UPDATE message.¶
The metrics Sub-TLVs included in the Metadata Path Attribute apply to all the address families carried in the NLRI field of the BGP UPDATE message [RFC4271]. For a multi-protocol BGP UPDATE message [RFC4760] [RFC7606], the metrics Sub-TLVs included in the Metadata Path Attribute apply to all the AFIs/SAFIs address families carried by the MP_REACH_NLRI.¶
All values in the Sub-TLVs are unsigned 32 bits integers.¶
This section specifies a set of metadata Sub-TLVs for the 5G edge services. A BGP speaker MUST NOT include multiple instances with the same type for the Sub-TLVs specified in this document in one Metadata Path Attribute. A BGP speaker SHOULD NOT include more than one Metadata Path Attribute in one BGP Update message.¶
A BGP UPDATE message that includes the Metadata Path Attribute doesn't change the BGP Error Handling procedure specified in the [RFC7606]. Where more than one sub-TLVs specified in this document are present in a Metadata Path Attribute, they are processed independently. If one of the Sub-TLVs has an invalid value, e.g., out of its specified ranges, the Sub-TLV with the invalid value is ignored by the BGP receiver. By default, no notification is required unless configured to send a notification to its management system. All other Sub-TLVs within the Metadata Path Attribute with the valid values MUST be processed.¶
Different services might have different preference index values configured for the same site. For example, Service-A requires high computing power, Service-B requires high bandwidth among its microservices, and Service-C requires high volume storage capacity. For a DC with relatively low storage capacity but high bisectional bandwidth, its preference index value for Service-B is higher and lower for Service-C. Site Preference Index can also be used to achieve stickiness for some services.¶
It is out of the scope of this document how the preference index is determined or configured.¶
The Preference Index Sub-TLV has the following format:¶
When the Preference Index value is outside the range of 1-100, the value carried in this Sub-TLV is ignored.¶
Capacity Availability Index indicates if an edge site, which can be a building, a floor, a pod, a row of server racks, etc., has full capacity, reduced capacity, or is completely out of service. Therefore, the value is 0-100, with 100% indicating the site is fully functional, 0% indicating the site is entirely out of service, and 50% indicating the site is 50% degraded.¶
Cloud Site/Pod failures and degradation include but are not limited to, a site capacity degradation or an entire site going down caused by a variety of reasons, such as fiber cut connecting to the site or among pods, cooling failures, insufficient backup power, cyber threats attacks, too many changes outside of the maintenance window, etc. Fiber-cut is not uncommon within a Cloud site or between sites.¶
When those failure events happen, the edge (egress) router is running fine. Therefore, the ingress routers with paths to the egress router can't use BFD to detect the failures.¶
When there is a failure occurring at an edge site (or a pod), many instances can be impacted. In addition, the routes (i.e., the IP addresses) in the site might not be aggregated nicely. Instead of many BGP UPDATE messages to the ingress routers for all the instances impacted, the egress router can send one single BGP UPDATE indicating the capacity availability of the site. The ingress routers can switch all or a portion of the instances that are associated with the site depending on how much the site is degraded.¶
The Capacity Availability Index Sub-TLV:¶
An egress router must append the Site Capacity Availability Index Sub-TLV with a BGP ROUTE UPDATE message for the registered low latency edge services so that the ingress routers can associate the Site reference Identifier to the route in the Routing table.¶
However, it is unnecessary to include the Site Capacity Availability Index for every BGP Update message if there is no change to the site-reference identifier or the Capacity Availability value for the service instances.¶
When an ingress router receives a BGP update message from Router-X with a prefix of the loopback for Router-X and the Metadata Path Attribute with the Capability Availability Index Sub-TLV, the new capability availability index value is applied to all route that have the following two constraints: a) have router-X as their next hop, and b) associated with site-ID. When there are failures or degradation to a site, the corresponding egress router can send one BGP UPDATE with the Capacity Availability Site Index with the egress router's loopback address.¶
It is desirable for an ingress router to select a site with the shortest processing time for an ultra-low latency service. But it is not easy to predict which site has "the fastest processing time" or "the shortest processing delay" for an incoming service request because:¶
Even though utilization measurements, like those below, are collected by most data centers, they cannot indicate which site has the shortest processing time. A service request might be processed faster on Site-A even if Site-A is overutilized.¶
The remaining available resource at a site is a more reasonable indication of process delay for future service requests.¶
The Service Delay Prediction Index is a value that predicts processing delays at the site for future service requests. The higher the value, the longer of the delay.¶
While out of scope, we assume there is an algorithm that can derive the Service Delay Prediction Index that can be assigned to the egress router. When the Service Delay Prediction value is updated, which can be triggered by the available resources change, etc., the egress router can attach the updated Service Delay Predication value in a Sub-TLV under the Metadata Path Attribute of the BGP Route UPDATE message to the ingress routers.¶
When data centers detailed running status are not exposed to the network operator, historic traffic patterns through the egress nodes can be utilized to predict the load to a specific service. For example, when traffic volume to one service at one data center suddenly increases a huge percentage compared with the past 24 hours average, it is likely caused by a larger than normal demand for the service. When this happens, another data center with lower-than-average traffic volume for the same service might have a shorter processing time for the same service.¶
Here are some measurements that can be utilized to derive the Service Delay Predication for a service ID:¶
The Service Delay Prediction Index can be derived from LoadIndex/24Hour-Average. A higher value means a longer delay prediction. The egress router can use the ServiceDelayPred sub-TLV to indicate to the ingress routers of the delay prediction derived from the traffic pattern.¶
Note: The proposed IP layer load measurement is only an estimate based on the amount of traffic through the egress router, which might not truly reflect the load of the servers attached to the egress routers. They are listed here only for some special deployments where those metrics are helpful to the ingress routers in selecting the optimal paths.¶
When ingress routers have embedded analytics tool relying on the raw measurements, it is useful for the egress router to send the raw measurement.¶
Raw Load Measurement Sub-TLV has the following format:¶
- Raw-Load-Measurement Sub-Type =4 (specified in this document): Raw measurements of packets/bytes to/from the Edge Service address.¶
- The receiver nodes can compute the Service Delay Prediction for the Service based on the raw measurements sent from the egress node and preconfigured algorithms.¶
- Measurement Period: BGP Update period in Seconds or user-specified period.¶
As the service metrics and network delays are in different units, here is an exemplary algorithm for an ingress router to compare the cost to reach the service instances at Site-i or Site-j.¶
SerD-i * CP-j Pref-j * NetD-i Cost-i=min(w *(----------------) + (1-w) *(------------------)) ServD-j * CP-i Pref-i * NetD-j¶
When a set of service Metadata is converted to a simple metric, a decision process is determined by the metric semantics and deployment situations. The goal is to integrate the conventional network decision process with the service Metadata into a unified decision-making process for path selection.¶
When an ingress router receives BGP updates for the same IP address from multiple egress routers, all those egress routers are considered as the next hops for the IP address. For the selected services configured to be influenced by the Edge Service Metadata, the ingress router BGP Decision process [IDR-CUSTOM-DECISION] would trigger the Edge Service Management function to compute the weight to be applied to the route's next hop in the forwarding plane. The decision process is influenced by the Edge Service Metadata associated with the client routes, such as Capacity Availability Index, Site Preference, and Service Delay Prediction Index, in addition to the traditional BGP multipath computation algorithm, such as the Weight, Local preference, Origin, MED, etc., shown below:¶
When any of those metadata value goes to 0, the effect is the same as the routes becoming ineligible via the egress router who originates the metadata UPDATE. But when any of those metadata just degrade, there is possibility, even though smaller, for the egress router to continue as the optimal next hop.¶
Suppose a destination address for aa08::4450 can be reached by three next hops (R1, R2, R3). Further, suppose the local BGP's Decision Process based on the traditional network layer policies and metrics identifies the R1 as the optimal next hop for this destination (aa08::4450). If the Edge Service Metadata results in R2 as the optimal next hop for the prefix, the Forwarding Plane will have R2 as the next-hop for the destination address of aa08::4450.¶
The Edge Service Metadata influencing next hop selection is different from the metric (or weight) to the next hop. The metric to a next hop can impact many (sometimes, tens of thousands) routes that have the node as their next hop. while as the Edge Service Metadata only impact the optimal next hop selection for a subset of client routes that are identified as the edge services.¶
When the BGP custom decision [idr-custom-decision] is used, the Edge Service Management function would have algorithm to combine the Edge Service Metadata attributes with the custom decision to derive the optimal next hop for the Edge service routes.¶
Note: For a BGP UPDATE message that includes the Edge Servuce Metadata Path Attribute with the egress router's loopback prefix, the Site Capacity Availability Index value is applied to all the NLRIs with the Site-ID indicated in the Edge Service Metadata Path Attribute.¶
Service Metadata are only distributed to the relevant ingress nodes interested in the Service, which can be configured or automatically formed.¶
For each registered low-latency Service, BGP RT Constrained Distribution [RFC4684] can be used to form the Group interested in the Service. The "Service ID", an IP address prefix, is the Route Target. When an ingress router receives the first packet of a flow destined to a Service ID, the ingress router sends a BGP UPDATE that advertises the Route Target membership NLRI per RFC4684. The ingress router must assign a Timer for the Service ID, as the UE that uses the Service ID might move away. Upon receiving a packet destined for the Service ID, the ingress router must refresh the Timer. The ingress router must send a BGP Withdraw UPDATE for the Service ID upon expiration of the Timer.¶
As the metrics change can impact the path selection, the Minimum Interval for Metrics Change Advertisement is configured to control the update frequency to avoid route oscillations. Default is 30s.¶
Significant load changes at EC data centers can be triggered by short-term gatherings of UEs, like conventions, lasting a few hours or days, which are too short to justify adjusting EC server capacities among DCs. Therefore, the load metrics change rate can be in the magnitude of hours or days.¶
The Metadata Path Attribute contains a sequence of Sub-TLVs. The Metadata Path Attribute's length determines the total number of octets for all the Sub-TLVs under the Metadata Path Attribute. The sum of the lengths from all the Sub-TLVs under the Metadata Path Attribute should equal the length of the Metadata Path Attribute. If this is not the case, the TLV should be considered malformed, and the "Treat-as-withdraw" procedure of [RFC7606] is applied.¶
If a Metadata Path attribute can be parsed correctly but contains a Sub-TLV whose type is not recognized by a particular BGP speaker, that BGP speaker MUST NOT consider the attribute to be malformed. Rather, it MUST interpret the attribute as if that Sub-TLV had not been present. If the route carrying the Metadata path attribute is propagated with the attribute, the unrecognized Sub-TLV remains in the attribute.¶
The Edge Service Metadata described in this document are only intended for propagating between Ingress and egress routers of one single BGP domain, i.e., the 5G Local Data Networks, which is a limited domain with edge services a few hops away from the ingress nodes. Only the selective services by UEs are considered as 5G Edge Services. The 5G LDN is usually managed by one operator, even though the routers can be by different vendors.¶
The proposed Edge Service Metadata are advertised within the trusted domain of 5G LDN's ingress and egress routers. The ingress routers should not propagate the Edge Service Metadata to any nodes that are not within the trusted domain.¶
IANA is requested to assign a new path attribute from the "BGP Path Attributes" registry. The symbolic name of the attribute is "Metadata", and the reference is [This Document].¶
+=======+======================================+=================+ | Value | Description | Reference | +=======+======================================+=================+ | TDB1 | Metadata Path Attribute | [this document] | +-------+--------------------------------------+-----------------+¶
IANA is requested to create a new sub-registry under the Metadata Path Attribute registry as follows:¶
+========+==========================+=================+ |Sub-Type| Description | Reference | +========+==========================+=================+ | 0 | reserved | [this document] | +--------+--------------------------+-----------------+ | 1 | Site Preference Index | [this document] | +--------+--------------------------+-----------------+ | 2 | Site Availability Index | [this document] | +--------+--------------------------+-----------------+ | 3 | Service Delay Predication| [this document] | +--------+--------------------------+-----------------+ | 4 | Raw Load Measurement | [this document] | +--------+--------------------------+-----------------+ | 5-254 | unassigned | [this document] | +--------+--------------------------+-----------------+ | 255 | reserved | [this document] | +--------+--------------------------+-----------------+¶