Internet-Draft | Babel RTT Extension | June 2023 |
Jonglez & Chroboczek | Expires 24 December 2023 | [Page] |
This document defines an extension to the Babel routing protocol that uses symmetric delay in metric computation and therefore makes it possible to prefer lower latency links to higher latency ones.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 December 2023.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The Babel routing protocol [RFC8966] does not mandate a specific algorithm for computing metrics; existing implementations use a packet-loss based metric on wireless links and a simple hop-count metric on all other types of links. While this strategy works reasonably well in many networks, it fails to select reasonable routes in some topologies involving tunnels or VPNs.¶
Consider for example the following topology, with three routers A, B and D located in Paris and a fourth router located in Tokyo, connected through tunnels in a diamond topology.¶
+------------+ | A (Paris) +---------------+ +------------+ \ / \ / \ / \ +------------+ +------------+ | B (Paris) | | C (Tokyo) | +------------+ +------------+ \ / \ / \ / +------------+ / | D (Paris) +---------------+ +------------+¶
When routing traffic from A to D, it is obviously preferable to use the local route through B, as this is likely to provide better service quality and lower monetary cost than the distant route through C. However, the existing implementations of Babel consider both routes as having the same metric, and will therefore route the traffic through C in roughly half the cases.¶
In this document, we specify an extension to the Babel routing protocol that enables precise measurement of the round-trip time (RTT) of a link, and allows its usage in metric computation. Since this causes a negative feedback loop, special care is needed to ensure that the resulting network is reasonably stable (Section 3).¶
We believe that this protocol may be useful in other situations than the one described above, such as when running Babel in a congested wireless mesh network or over a complex link layer that performs its own routing; the high granularity of the timestamps used (1ms) should make it possible to experiment with RTT-based metrics on this kind of link layers.¶
We assume that every Babel speaker maintains a local clock, that counts milliseconds from an arbitrary origin. We do not assume that clocks are synchronised: clocks local to distinct nodes need not share a common origin. The protocol will eventually recover if the clock is stepped, so clocks need not persist across node reboots.¶
Every Babel speaker maintains a Neighbour Table, described in Section 3.2.4 of [RFC8966]. This extesion extends every etry in the Neighbour Table with the following data:¶
Both values are initially undefined.¶
The RTT to a neighbour is estimated using an algorithm due to Mills [MILLS], originally developed for the HELLO routing protocol and later used in NTP [NTP].¶
A Babel speaker periodically sends Hello messages to its neighbours (Section 3.4.1 of [RFC8966]). Additionally, it ocasionally sends a set of IHU messages, at most one per neighbour (Section 3.4.2 of [RFC8966]).¶
In order to enable the computation of RTTs, a node A MUST include in every Hello that it sends a timestamp t1 (according to A's local clock). When a node B receives A's Hello equipped with a timestamp, it computes the time t1' at which the Hello was received (according to B's local clock). It then MUST record the value t1 in the Origin Timestamp field of the Neighbor Table entry corresponding to A, and the value t1' in the Receive Timestamp field of the Neighbour Table entry.¶
When B sends an IHU to A, it checks whether both timestamps are defined. If that is the case, then it MUST ensure that its IHU TLV is sent in a packet that also contains a timestamped Hello TLV (either a normally scheduled Hello, or an unscheduled Hello, see Section 3.4.1 of [RFC8966]). It MUST include in the IHU both the Origin Timestamp and the Receive Timestamp stored in the neighbor table.¶
This is illustrated in the followsing sequence diagram:¶
A B | | t1 + | |\ | | \ | | \ | Hello(t1) | \ | | \ | | \| | + t1' | | | | | | | + t2' | /| | / | | / | | / | Hello(t2') | / | IHU(t1, t1') |/ | t2 + | | | v v¶
Upon receiving B's packet, A MUST verify that it contains both a timestamped Hello and an IHU containing two timestamps. If that is the case, it computes the time t2 (according to its local clock) at which it was received. A then computes the value d = (t2 - t1) - (t2' - t1') (where all computations are done modulo 2^32), which is a measurement of the RTT between A and B.¶
This algorithm has a number of desirable properties. First, since there is no requirement that t1' and t2' be equal, the protocol remains asynchronous: the only change to Babel's message scheduling is the requirement that a packet containing an IHU also contains a Hello. Second, since it only ever computes differences of timestamps according to a single clock, it does not require synchronised clocks. Third, it requires very little additional state: a node only needs to store the two timestamps associated with the last hello received from each neighbour. Finally, since it only requires piggybacking one or two timestamps on each Hello and IHU packet, it makes efficient use of network resources.¶
In principle, this algorithm is inaccurate in the presence of clock drift (i.e. when A's and B's clocks are running at different frequencies). However, t2' - t1' is usually on the order of seconds, and significant clock drift is unlikely to happen at that time scale.¶
Timestamp values are a count of milliseconds stored as a 32-bit unsigned integer; thus, they wrap around every 50 days or so. What is more, a node may occasionally reboot and restart its clock at an arbitrary origin. For these reasons, very old timestamps or nonsensical timestamps MUST NOT be used to yield RTT samples.¶
We suggest the following algorithm to achieve that. When a node receives a packet containing a Hello and an IHU, it SHOULD compare the current local time t2 with the Origin Timestamp contained in the IHU; if the Origin Timestamp appears to be in the future, or if it is in the past by more than a time T (the value T=3 minutes is RECOMMENDED), then the timestamps are still recorded in the neighbour table, but SHOULD NOT be used for computation of an RTT sample.¶
Similary, the node compares the Hello's timestamp with the Receive Timestamp recorded in the Neighbour Table; if the Hello's timestamp appears to be older than the recorded timestamp, or if it appears to be more recent by an interval larger than the value T, then the packet SHOULD NOT be used for RTT computation.¶
The protocol described above yields a series of RTT samples. While these samples are fairly accurate, they are not directly usable as an input to the route selection procedure, for at least three reasons.¶
First of all, in the presence of bursty traffic, routers experience transient congestion, which causes occasional spikes in the measured RTT. Thus, the RTT signal is often noisy, and requires smoothing before it can be used for route selection.¶
Second, using the RTT signal for route selection gives rise to a negative feedback loop: when a route has a low RTT, it is deemed to be more desirable, which causes it to be used for more data traffic, which may lead to congestion, which in turn increases the RTT. Without some form of hysteresis, using RTT for route selection would lead to oscillations between parallel routes, which might lead to packet reordering and negatively affect upper-layer protocols (such as TCP).¶
Third, even in the absence of congestion, the RTT tends to exhibit some variation. If two parallel routes have their RTT oscillate around a common value, using the RTT as input to route selection will cause frequent routing oscillations, which, again, indicates the need for some form of hysteresis.¶
In this section, we describe an algorithm that integrates both smoothing and hysteresis and has been shown to behave well both in simulation and experimentally over the Internet [DELAY-BASED]. This algorithm is considered experimental, since we do not currently have a theoretical understanding of its behaviour, and in particular we do not know by how much it could be simplified without impairing its functionality.¶
The RTT samples provided by Mills algorithm are fairly accurate, but rather noisy: individual samples may be outliers and indicate a value much larger than the correct one. Thus, some smoothing needs to be applied first, in order to get rid of these outliers.¶
Our current implementation uses a simple exponential average, as described in [DELAY-BASED]. Other algorithms have also been considered, such as a moving average or a moving minimum.¶
The smoothed RTT value obtained in the previous step needs to be mapped to a link cost, suitable for input to the metric computation procedure (Section 3.5.2 of [RFC8966]). Obviously, the mapping should be monotonic (larger RTTs imply larger costs). In addition, in order to enhance stability (Section 3), the mapping should be bounded: above a certain RTT, all links are equally bad, and hence their costs no longer oscillate.¶
We currently use the following function for mapping RTTs to link costs, parameterised by three parameters rtt-min, rtt-max and max-rtt-penalty:¶
cost ^ | | | C + max-rtt-penalty | +--------------------------- | /. | / . | / . | / . | / . | / . | / . | / . | / . | / . C +------------+ . | . . | . . | . . | . . 0 +----------------------------------------------------> 0 rtt-min rtt-max RTT¶
For RTTs below rtt-min, the link cost is just the nominal cost of a single hop, C. Between rtt-min and rtt-max, the cost increases linearly; above rtt-max, the constant value max-rtt-penalty is added to the nominal cost.¶
Even after applying a bounded mapping from smoothed RTT to a cost value, the cost may fluctuate when a link's RTT is between rtt-min and rtt-max. This is effectively mitigated by using a robust hysteresis algorithm, such as the one described in Appendix A.3 of [RFC8966].¶
This protocol extension stores the data that it requires within sub-TLVs of Babel's Hello and IHU TLVs. As discussed in Appendix D of [RFC8966], implementations that do not understand this extension will silently ignore the sub-TLVs while parsing the rest of the TLVs that they contain. In effect, this extension supports building hybrid networks consisting of extended and unextended routers, and while such networks might suffer from sub-optimal routing, they will not suffer from blackholes or routing loops.¶
If a sub-TLV defined in this extension is longer than expected, the additional data is silently ignored. This provision is made in order to allow a future version of this protocol to extend the packet format with additional data, for example higher-precision timestamps or absolute timestamps.¶
This extension defines the Timestamp sub-TLV whose Type field has value 3. This sub-TLV can be contained within a Hello sub-TLV, in which case it carries a single timestamp, or within an IHU sub-TLV, in which case it carries two timestamps.¶
Timestamps are encoded as 32-bit unsigned integers, expressed in units of one microsecond, counting from an arbitrary origin. Timestamps wrap around every 4295 seconds, or slightly more than one hour.¶
When contained within a Hello TLV, the Timestamp sub-TLV has the following format:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 3 | Length | Transmit timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+¶
Fields:¶
If the Length field is larger than the expected 4 octets, the sub-TLV MUST be processed normally and any extra data contained in this sub-TLV MUST be silently ignored.¶
When contained in an IHU TLV destined for node A, the Timestamp sub-TLV has the following format:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 3 | Length | Origin timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Receive timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+¶
Fields:¶
If the Length field is larger than the expected 8 octets, the sub-TLV MUST be processed normally and any extra data contained in this sub-TLV MUST be silently ignored.¶
IANA is instructed to add the following entry to the "Babel Sub-TLV Types" registry:¶
Type | Name | Reference |
---|---|---|
3 | Timestamp | (this document) |
This extension adds additional timestamping data to two of the TLVs sent by a Babel router. Since the timestamps use an arbitrary origin, they do not leak private data, such as a node's timezone. However, by broadcasting the value of a reasonably accurate local clock, they might make a node more susceptible to timing attacks.¶
The authors are indebted to Jean-Paul Smetz, who prompted the investigation that originally lead to this protocol. Toke Høyland-Jørgensen, Maria Matejka and Ondřej Zajiček provided helpful comments about a draft version of this document.¶