Internet-Draft | Avoid RPKI State in BGP | February 2024 |
Snijders, et al. | Expires 26 August 2024 | [Page] |
This document provides guidance to avoid carrying Resource Public Key Infrastructure (RPKI) derived Validation States in Transitive Border Gateway Protocol (BGP) Path Attributes. Annotating routes with attributes signaling validation state may flood needless BGP UPDATE messages through the global Internet routing system, when, for example, Route Origin Authorizations are issued, revoked, or RPKI-To-Router sessions are terminated.¶
Operators SHOULD ensure Validation States are not signalled in transitive BGP Path Attributes. Specifically, Operators SHOULD NOT group BGP routes by their Prefix Origin Validation state into distinct BGP Communities.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 26 August 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The Resource Public Key Infrastructure (RPKI) [RFC6480] allows for validating received routes, e.g., for their Route Origin Validation (ROV) state. Some operators and vendors suggest to use distinct BGP Communities [RFC1997] [RFC8092] to annotate received routes with their validations state. The claim is that this practice is useful, as validation state can be signalled, e.g., to iBGP speakers, without requiring each iBGP speaker to perform their own route origin validation.¶
However, annotating a route with a transitive attribute means that a BGP update message has to be send to each neighbor if such an attribute changes. This means that when, for example, Route Origin Authorizations [RFC6482] are issued, revoked, or RPKI-To-Router [RFC8210] sessions are terminated, a BGP UPDATE message will be sent for a route that was previously annotated with a BGP Community. Furthermore, given that BGP Communities are a transitive attribute, this BGP UPDATE will have to propagate through the whole default free zone (DFZ).¶
Hence, this document provides guidance to avoid carrying Resource Public Key Infrastructure (RPKI) [RFC6480] derived Validation States in Transitive Border Gateway Protocol (BGP) Path Attributes Section 5 of [RFC4271]. Specifically, Operators are SHOULD NOT group BGP routes by their Prefix Origin Validation state [RFC6811] into distinct BGP Communities [RFC1997] [RFC8092]. Not using BGP Communities to signal RPKI validation state prevent needless BGP UPDATE messages from being flooded through the global Internet routing system.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document discusses signaling of RPKI validation state to BGP neighbors using transitive BGP attributes. At the time of writing, this pertains to the use of BGP Communities [RFC1997] [RFC8092] to signal RPKI ROV using ROAs. Note that this includes all operator specific BGP Communities to signal validation state, as well as any current or future documented well-known BGP Communities marking validation state, as, e.g., described for extended BGP Communities in [RFC8097].¶
However, beyond that, this document also applies to all current and future transitive BGP attributes that may be implicitly or explicitly used to signal validation state to neighbors. Similarly, it applies to all future validation mechanics of RPKI, e.g., ASPA [I-D.ietf-sidrops-aspa-profile] and any other future validation mechanic build upon the RPKI.¶
This section outlines the risks of signaling RPKI Validation State using BGP Communities. While the current description is specific to BGP communities, the observations hold similar for all transitive attributes that may be added to a route. Furthermore, we will present data on the measured current impact of BGP Communities being used to signal RPKI Validation state.¶
Here, we describe examples for how a large amount of RPKI ROV changes may occur in a short time, leading to a large amount of BGP Updates being send.¶
Large-Scale ROA issuance should be a comparatively rare event for individual networks. However, several cases exist where issuance by individual operators or (malicious) coordinated issuance of ROAs by multiple operators may lead to a high churn triggering a continuous flow of BGP Update messages caused by operators using transitive BGP attributes to signal RPKI validation state.¶
Specifically:¶
Large-Scale ROA revocation should be a comparatively rare event for individual networks. However, several cases exist where revocations by individual operators or (malicious) coordinated revocation of ROAs by multiple operators may lead to a high churn triggering a continuous flow of BGP Update messages caused by operators using transitive BGP attributes to signal RPKI validation state.¶
Specifically:¶
Similar to the issuance/revocation of routes, the validation pipeline of an operator may encounter issues. Issues may occur on the router side or on the validator side, with network connectivity issues having specific impact on either of the two.¶
While, in general, implementations should not have bugs, operators should not make mistakes, and the network should be reliable, this is usually not the case in practice. Instead, the worst-case of sudden and unexpected, yet unintentional, loss of validation state is an event that, however unlikely in a specific system, may and will happen. Hence, systems should be resilliant in case of unexpected issues, and not further amplify issues by creating a BGP UPDATE storm.¶
Below, we provide examples of events for both categories that may lead to the validation state of routes in one or multiple routers of an operator changing from Valid to NotFound. This list serves illustrative purposes and does not claim completeness.¶
The following events may impact a validator's ability to provide validation information to routers:¶
The above non-exhaustive listing suggests that issues in general operations, CA operations, and RPKI cache implementations simply are unavoidable. Hence, Operators MUST plan and design accordingly.¶
For each change in validation state of a route, an Autonomous System in which the local routing policy sets a BGP Community based on the ROV-Valid validation state, would need to send BGP UPDATE messages for roughly half the global Internet routing table if the validation state changes to ROV-NotFound. The same, reversed case, would be true for every new ROA created by the address space holders, whereas a new BGP update would be generated, as the validation state would change to ROV-Valid.¶
Furthermore, adding additional attributes to routes increases their size and memory consumption in the RIB of BGP routers. Given the continuous growth of the global routing table, operators should be--in general--conservative regarding the additional information they add to routes.¶
The aforementioned scaling issues are not confined to singular UPDATE events. Instead, changes in validation state may lead to floods and/or cascades of BGP UPDATES throughout the Internet.¶
Flooding events are caused by an individual operator losing validation state. If that operator annotates validation state using BGP communities, the operator will send updates for all routes that changed from Valid to NotFound to its downstreams, as well as updates for routes received from downstreams to its upstreams.¶
Following an RPKI service affecting outage (Section 3.1), given that half the global Internet routing table with close to 1,000,000 prefixes [CIDR_Report] nowadays is covered by RPKI ROAs [NIST], such convergence events represent a significant burden. See [How-to-break] for an elaboration on this phenomenon.¶
For events that are not specific to one operator, e.g., a malicious widthdrawel of a ROA, loss of a major CA, or an unexpected downtime of a major centralized RTR service, events can also cascade for ASes annotating validation state using BGP communities. Given that routers' view of the RPKI with RTR is only eventually consistent, update messages may cascade, i.e., one event affecting validation state may actually trigger multiple subsequent BGP UPDATE floods.¶
Assume, for example, that AS65536 is a downstream of AS65537 (both annotating validation state with BGP Communities and using a 300 second RTR cycle), and a centralized RTR service fails. In the example, AS65536 has their routers updated from that cache a second before the service went down, while AS65537 was due for a refresh a second thereafter.¶
This means that a second after the RTR service went down, AS65537 will trigger a BGP UPDATE flood down its cone. AS65536 will ingest and propagate these BGP UPDATES down its own cone as well.¶
When, rughly 300 seconds later, AS65536 fails to retrieve validation state as well, he community of AS65536 will again change for ROA covered routes, and it will again trigger a BGP UPDATE flood and propagate this down its cone.¶
Even if either or both of AS65536 and AS65537 use a cache after RTR expirery, the underlying issue would not change, assuming the RTR service downtime spans beyond the cache TTL. Assuming a 30 minute cache TTL, both ASes using a cache would only move the cascading event 30 minutes later. If only one of the two uses a cache, the two flood events get moved further apart. However, the overall issue of two independent floods due to one event remains.¶
In February 2024, a data-gathering initiative [Side-Effect] reported that between 8% and 10% of BGP updates seen on the Routing Information Service - RIS, contained well-known communities from large ISPs signaling either ROV-NotFound or ROV-Valid BGP Validation states. The study also demonstrated that the creation or removal of a ROA object triggered a chain of updates in a period of circa 1 hour following the change.¶
Such a high percentage of unneeded BGP updates constitutes a considerable level of noise, impacting the capacity of the global routing system while generating load on router CPUs and occupying more RAM than necessary. Keeping this information inside the realms of the single autonomous system would help reduce the burden on the rest of the global routing platform, reducing workload and noise.¶
RTR has been developed to communicate validation information to routers. BGP Attributes are not signed, and provide no assurance against third parties adding them, apart from BGP communities--ideally--being filtered at a networks edge. So, even in iBGP scenarios, their benefit in comparison to using RTR on all BGP speakers is limited.¶
For eBGP, given they are not signed, they provide even less information to other parties except introspection into an ASes internal validation mechanics. Crucially, they provide no actionable information for BGP neighbors. If an AS validates and enforces based on RPKI, Invalid routes should never be imported and, hence, never be send to neighbors. Hence, the argument that adding validation state to communities enables, e.g., downstreams to filter RPKI Invalid routes is mute, as the only routes a downstream should see are NotFound and Valid. Furthermore, in any case, the operators SHOULD run their own validation infrastructure and not rely on centralized services or attributes communicated by their neighbors. Everything else circumvents the purpose of RPKI.¶
As outlined in Section 3, signaling validation state with transitive attributes carries significant risks for the stability of the global routing ecosystem. Not signaling validation state, hence, has tangible benefits, specifically:¶
Hence, operators SHOULD NOT signal RPKI validation state using transitive BGP attributes.¶
The use of transitive attributes to signal RPKI validation state may enable attackers to cause notable route churn by issuing and withdrawing, e.g., ROAs for their prefixes. DFZ routers may not be equipped to handle churn in all directions at global scale, especially if said churn cascades or repeats periodically.¶
To prevent this, operators SHOULD NOT signal validation state to neighbors. Furthermore, validation state signaling SHOULD NOT be accepted from a neighbor AS. Instead, the validation state of a received announcement has only local scope due to issues such as scope of trust and RPKI synchrony.¶
None.¶
The authors would like to thank Aaron Groom and Wouter Prins for their helpful review of this document.¶