TOC 
Network Working GroupR. Whittle
Internet-DraftFirst Principles
Intended status: ExperimentalJanuary 13, 2010
Expires: July 17, 2010 


Ivip Mapping Database Fast Push
draft-whittle-ivip-db-fast-push-02.txt

Abstract

From the base of draft-whittle-ivip-arch-03 and later, this ID describes in greater detail Ivip's fast-push mapping distribution system. This accepts mapping changes from end-user networks or organizations they authorise to make these changes. The mapping changes are handled by RUAS (Root Update Authorization Server) companies who collectively run a small set of Launch servers and a global network of Replicator servers. Each second, the Launch servers send sets of packets with mapping updates to a larger number of level 0 Replicators, each of which gets at least two feeds of these mapping updates from different Launch servers. Each Level 1 Replicator fans out the mapping changes to multiple Level 2 Replicators, which also receive at least two feeds from upstream Level 1 Replicators. In this way, within a fraction of a second, the mapping changes are fanned out securely and reliably to full database query servers (QSDs) in ISPs and some end-user networks all over the Net. Additionally, QSDs can download missing packets and snapshots of segments of the mapping database. A WAG of 4 billion mapping changes a year gives a raw data rate for IPv6 mapping changes of only 32kbps. TTR mobility only involves mapping changes if the MN moves a large distance, such as 1000km. Multihoming service restoration updates would be infrequent. Mapping changes for TE could be numerous depending on cost. It is hard to imagine a scenario where mapping changes would present significant difficulties in terms of bandwidth or in terms of the capacity of QSDs to handle them.

Status of this Memo

This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on July 17, 2010.

Copyright Notice

Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License.



Table of Contents

1.  Introduction
    1.1.  Outline of the RUAS, Launch and Replicator systems
    1.2.  Assumptions
    1.3.  It may not be so daunting...
2.  Goals, Non-Goals and Challenges
    2.1.  Goals
    2.2.  Non-goals
    2.3.  Challenges
3.  Definition of Terms
    3.1.  SPI - Scalable PI space
        3.1.1.  Conventional global unicast address space
    3.2.  MAB - Mapped Address Block
    3.3.  UAB - User Address Block
    3.4.  Micronet
    3.5.  RUAS - Root Update Authorisation System
    3.6.  UAS - Update Authorisation System
    3.7.  UMUC - User Mapping Update Command
    3.8.  SUMUC - Signed User Mapping Update Command
    3.9.  MABUS - Update Stream specific to one MAB
    3.10.  Launch server
    3.11.  Replicator
    3.12.  QSD - Query Server with full Database
    3.13.  QSC - Query Server with Cache
4.  Update Authorities and User Interfaces
    4.1.  RUAS Outputs
        4.1.1.  Updates every second
        4.1.2.  MAB snapshots
        4.1.3.  Missing packet servers
    4.2.  Authentication of RUAS-generated data
        4.2.1.  Snapshot and missing packet files
        4.2.2.  Mapping updates
    4.3.  RUAS - UAS interconnection
5.  The Launch system
    5.1.  Phase 1 - collecting updates from RUASes
    5.2.  Phase 2 - checksum comparison
    5.3.  Phase 3 - identical update streams
6.  Replicators
    6.1.  Scaling limits
    6.2.  Managing Replicators
7.  Security Considerations
8.  IANA Considerations
9.  Informative References
§  Author's Address




 TOC 

1.  Introduction

The aim of this I-D is to establish that Ivip's fast-push mapping distribution system (FMS) is practical and desirable for very large numbers of micronets (EIDs in LISP terminology) and rates of change of the mapping database. Please refer to [I‑D.whittle‑ivip‑arch] (Whittle, R., “Ivip (Internet Vastly Improved Plumbing) Architecture,” January 2010.) for an explanation of Ivip in general. A glossary of Ivip and some general scalable routing terms and acronyms is: [I‑D.whittle‑ivip‑glossary] (Whittle, R., “Glossary of some Ivip and scalable routing terms,” January 2010.).

This is a revision of the 00 and 01 versions, with the only substantial change being a much lower estimate of the worst-case number of updates, with a correspondingly lower worst-case required bandwidth.

The most unusual and demanding part of Ivip's fast-push system is the network of "Replicator" servers which fan the mapping updates out to potentially hundreds of thousands of full database query servers (QSDs) at ISP and end-user network sites all over the world.



 TOC 

1.1.  Outline of the RUAS, Launch and Replicator systems

The largest part of the FMS is comprised of thousands (perhaps several hundred thousand in the long term future) of essentially identical "Replicator" servers. There may be other, better, approaches, but this is ID describes the current design. This cross-linked, tree-like, structure of Replicators in some ways resembles a tree of multicast routers. However, each Replicator receives at least two streams of identical mapping data, so it is much less likely to miss a packet from this stream than if it only received the packets from a single source.

The first level of Replicators (level 1) is driven by a small set of Launch servers, which are geographically and topologically diverse, but which work as a team to reliably send the mapping update packets to the Level 1 Replicators. The Launch servers gain this information, second-by-second, from a small number (ten to a few dozen at most) RUAS systems (Root Update Authorization Servers), each belonging to a different RUAS company.

At the first level, each Replicator receives two identical streams, over separate authenticated and encrypted links, from two different Launch servers in different geographical locations, and over different physical long distance links. The Launch system and perhaps the first level (1) of Replicators will probably be implemented with private network links, rather than relying on open Internet addresses which are subject to flooding attacks.

If a packet goes missing from one stream, it will probably be present in the second. As the packets arrive, the Replicator takes the first one from either stream and sends its contents out simultaneously on a larger number of similar links to the next level of Replicators. Consequently, the delay time for update information passing through a Replicator will be no more than a few to ten milliseconds, and is comparable to the delays imposed by a packet traversing a router.

In this way, each Replicator consumes two identical streams from geographically and topologically different sources, and fans the content of the streams out to some larger number of Replicators or QSDs at the next level. This number of output streams per Replicator may be in the tens to one hundred range, depending on the volume of updates. Initially, it would be quite high, when update rates are low - meaning that the initial global Replicator network could serve the growing number of QSDs with few levels of Replicators, and with each one fanning out updates to a large number of Replicators at the next level.

After some number of levels of replication, determined by local conditions, the streams deliver the update information at a QSD. Ideally, each QSD will receives two streams from two geographically dispersed Replicators. These need not be at the same level, so the system is relatively flexible, and each Replicator will generally be sending a complete streams of packets.

The Launch system generates the stream as a variable number of packets on a regular schedule, such as every second. Data within each packet enables QSDs to authenticate the mapping information, and to request from remote servers any packets which did not arrive.

Snapshots of segments of the mapping database are taken regularly by each RUAS. Each snapshot contains a complete copy of the mapping of one MAB (Mapped Address Block) at a particular instant. At that point in time, a hash function of the mapping data for this MAB is generated and within a few seconds is sent to all QSDs. This enables each QSD to verify its copy of the mapping for this QSD is fully up-to-date.

During initialisation, and if an error is found in the local copy of the mapping for a particular MAB, the QSD downloads snapshots from HTTP servers provided by the RUAS companies. The QSD buffers all updates for the MAB which arrive after the snapshot and hash message. Once the snapshot is downloaded and unpacked into the QSDs copy of the mapping database, the buffered updates are applied and the database then contains an up-to-date copy of mapping for this MAB. Updates are then applied as they arrive from the two or more upstream Replicators.



 TOC 

1.2.  Assumptions

For the purposes of this discussion, it is assumed there will be a single global Ivip system, with multiple organisations being responsible for the management of the various blocks of address space which are managed with Ivip.

It would also be possible for an organisation to establish an Ivip-like system, without reference to any IETF RFCs, and to conduct a business renting out address space in small, flexible, chunks, with portability and multihoming via any ISP who provides the requisite, relatively simple, ETRs. The most likely scenario is this being done, with one or more independent Ivip-like systems operated by different companies, primarily for supporting TTR mobility [TTR Mobility] (Whittle, R. and S. Russert, “TTR Mobility Extensions for Core-Edge Separation Solutions to the Internets Routing Scaling Problem,” August 2008.), but also usable for portability, multihoming and inbound Traffic Engineering for non-mobile end-user networks.

For simplicity, this ID assumes that Ivip development will be coordinated into a single global system, as DNS is, following appropriate IETF engineering work and administrative decisions in RIRs and other relevant organisations. A development timeframe of 2010 to ca. 2014 is assumed, with widespread deployment being achieved later in the decade, for IPv4 at least.

The IPV4 FMS for is identical in principle to the IPv6. The server software which implements the Replicators will probably remain as two separate items, but a single server could run them both, independently, and so be both an IPv4 and IPv6 Replicator. Each RUAS would have both IPv4 and IPv6 sections, with separate outputs of mapping data. The Launch servers for IPv4 would be physically different and independent of those for IPv6.

In addition to the global fast push database update distribution system discussed in this ID, Ivip also involves Query Servers sending "notifications" to ITRs which recently requested mapping for a micronet whose mapping has just changed. This is a second form of push - on a local scale - and is outlined in [I‑D.whittle‑ivip‑arch] (Whittle, R., “Ivip (Internet Vastly Improved Plumbing) Architecture,” January 2010.) .

This ID concentrates on IPv4, since the future map-encap scheme is more urgently required for IPv4 than for IPv6. In principle, the same arrangements will apply for IPv6, with a different and more verbose data format than the 12 or so bytes required for each IPv4 mapping update. It may make sense to defer finalisation of any future IPv6 map-encap scheme until substantial operational experience was gained with the IPv4 scheme.



 TOC 

1.3.  It may not be so daunting...

Ivip documentation is written with a preference for detailed discussion over terseness. So Ivip IDs may appear rather daunting at first. Hopefully these IDs will be clearly understandable, and the reader will recognise that this scalable routing solution is a momentous development, requiring detailed consideration. Ivip goes beyond the formal RRG requirements of providing portability (the only way of allowing free choice of alternative ISPs) multihoming and inbound traffic engineering, by also providing with TTR mobility, a global mobility system for both IPv4 and IPv6. While no mapping changes are required unless the Mobile Node moves a large distance, such as 1000km or more, it is important that the Ivip FMS be able to scale to very large numbers of updates and cope with mapping databases for up to 10^10 micronets.

This ID focuses on handling billions of micronets and potentially thousands or tens of thousands of updates a second. These data-rates may sound high today, but domestic customers are already downloading full quality video in real-time. By the time such large levels of adoption arise, the bandwidth needed for these will not be a significant obstacle.

Also, during initial deployment, the demands on the fast push system will be far lighter than those anticipated below, so the system might initially be somewhat simpler. In the initial stages of introduction, there may be little need to deploy dedicated servers for the "Replicator" functions, since the volume of updates may be so light as to make it practical to run this software on existing servers, such as nameservers.

Furthermore, in the early years of introduction, when there are hundreds of thousands or a few million micronets, the low level of update packets (compared to the highest imaginable levels contemplated below) should enable each Replicator to fan out to many more next-level Replicators than would be possible when hundreds of millions or billions of micronets are handled by the system. This would mean fewer levels of Replicators, fewer Replicators and generally faster delivery of the mapping information than would be possible with current technology if the system was handling billions of micronets.

So this ID explores how the FMS would be structured in the most demanding future scenarios which can be realistically expected. Building the initial FMS for trials and early services won't be as daunting as it may look from the diagrams and discussions below.



 TOC 

2.  Goals, Non-Goals and Challenges



 TOC 

2.1.  Goals

The overall goal of the fast push system is to enable end-users, who manage the mapping of their one or more micronets of address space, to securely, reliably and easily communicate their mapping change command to some organisation with which they have a business relationship, so that that change will be propagated to every QSD as soon as possible.

"As soon as possible" means typical delay times of a few seconds, ideally zero seconds, but in practice probably four to five seconds. (Most of this delay is in the RUAS and Launch systems, which could be optimised in the future to process the updates much faster than this, without affecting the much larger Replicator system.

"Reliably" means that in the great majority of cases, the QSDs receive every mapping change as expected and that in the relatively rare event of this being impossible due to packet loss, that the QSD can recover from this situation within one or at the most two seconds by requesting a copy of the packet from a remote HTTP server provided by the RUAS company whose mapping update packet was lost.

Reliability also involves robustness against DoS attacks. This can never be completely protected against for any device on the open Internet, since its link(s) can easily be flooded by packets sent from botnets etc. A workaround for DoS attacks would be to run the first few levels of Replicators via global private network links. These levels would be owned and operated by the RUAS companies working together. This would enable reliable feeds to hundreds or perhaps a thousand or so Replicators all over the Net, which would mean that a DoS attack against a small number of Replicators could only affect a smaller portion of the total system.

"Securely" means that each QSD which receives the updates will be able to instantly verify that the updates are genuine, rather than the result of an attacker who might, for instance, send forged packets to that device or to some other part of the fast push system. The data format for the mapping update packets is TBD. It is possible that each packet's contents could be signed by the RUAS which originated it. In the present design, the use of DTLS RFC 4347 links between each Launch Servers, Replicators and QSDs is assumed to provide sufficient security. The data format needs to provide for open-ended extensions in the future and to support authentication at the time of reception.

The mapping change command, as sent by the end-user, or by some other organisation or device which has the end-user's credentials, would involve the length of the micronet being checked to ensure it is the same as the currently configured length of the micronet which starts at that location. The end-user's command might be part of an encrypted exchange involving a challenge-response protocol and the end-user's private key. Alternatively, an encrypted link could be used, such as via HTTPS, and a conventional username and password given as part of the command.

The end-user would previously have communicated directly or indirectly with their RUAS to configure their total assigned address space into one or more micronets. This ID concentrates on the changes of ETR address for existing micronets, but the mapping change packets will also contain information about how existing micronets have been deleted and replaced by other micronets, smaller or larger and with different start and end-points.

RUASes and the multiple servers of the Launch system are few in number and will be administered carefully, so this ID does not consider automated aids to their management and debugging. However, the Replicators will be numerous and operated by a wide range of organisations. Future work will concern maximising the degree to which the Replicator system can be robustly and easily managed, rather than requiring a great deal of manual configuration etc.

In order to debug the way the Ivip system is used, such as transient erroneous or malicious mapping updates which cause packets to be tunnelled to addresses where they are not welcome, there will need to be a system which monitors all mapping changes and keeps a lasting record of them. Then, aggrieved parties can search such a system for the address on which the received the unwanted packets, and so determine the micronet involved. This will enable the aggrieved party to complain to the RUAS which is responsible for that micronet. This "mapping history" function could be performed by one or multiple separate systems, each simply taking a feed from the Replicator system.



 TOC 

2.2.  Non-goals

Apart from checking the ETR address against any specific exclusion lists (such as specific prefixes, private RFC 1198 and multicast space) and to ensure it is not part of a Mapped Address Block (MAB - a BGP advertised prefix containing micronets), the entire Ivip system takes no interest in whether there is a device at that address, whether the address is advertised in BGP, whether there is or was an ETR at that address, whether the ETR is reachable or whether the ETR can deliver packets to the micronet's destination device.

These are all matters which fall under the responsibility of the end-user network whose micronet this ETR address is for.

It is not a goal of the system to keep mapping changes secret from any party. This would be impossible. Therefore, it cannot be a goal of this or probably any core-edge elimination scheme that in a mobile setting, the movement of an individual's device could not be inferred by anyone who monitors the mapping updates. However, the mapping only concerns the currently active TTR. MNs can still use a TTR no-matter where they are physically connected, and using a TTR hundreds or even thousands of km distant will probably present no serious difficulties due to path-length or lost packets. So mapping changes need not indicate much, or anything, about the physical location of the MN.

Replicators perform a best-effort copying of mapping update packets. They do not store these packets for any appreciable time or attempt to request a packet in the sequence which is missing from their two or more input streams.



 TOC 

2.3.  Challenges

There are obvious challenges building a global network which is distributed, to avoid any single point of failure whilst also being highly reliable, coordinated and secure. For this network to propagate information from one of many input points to a very large number (potentially millions) of endpoints, with very low levels of loss, is a further challenge on the open Internet.

Although the Launch system and level 1 and perhaps level 2 Replicators could operate over private network links. However, the final levels of the Replicator system - those which drive the QSDs need to operate on the open Internet, as do the end-users' methods of interaction with the RUASes, directly or indirectly.

The closest existing technology to what is required may be Reliable Multicast, but this is optimised for long block lengths. This technology should be considered in greater depth as an alternative to what is proposed here, but the rest of this ID is based on the assumption that novel techniques are required.



 TOC 

3.  Definition of Terms



 TOC 

3.1.  SPI - Scalable PI space

Once Ivip is operational, a growing subset of the global unicast addresses will be handled by ITRs tunnelling the packets to an ETR, which delivers the packets to the destination. This subset is used by end-user networks and provides portability, multihoming and inbound traffic engineering in a manner which is highly scalable - does not overly burden DFZ routers.

SPI space is "mapped" by Ivip and this mapping system can divide it into smaller sections than is possible with BGP in the DFZ - a 256 IP address granularity for IPv4, due to a widely enforced convention on the lengths of routes which are accepted.

The granularity with which Ivip maps SPI space - dividing it into micronets (described below) is single IP addresses for IPv4, and /64 prefixes for IPv6.



 TOC 

3.1.1.  Conventional global unicast address space

This is global unicast address space as it is used today. With Ivip, this will be a subset of the full unicast space - the part which is not used for SPI space. The LISP term for this is "RLOC" space.



 TOC 

3.2.  MAB - Mapped Address Block

A MAB is a BGP advertised prefix which is used as SPI space. DITRs (Default ITRs in the DFZ) all over the Net advertise this prefix, tunnelling the packets to ETRs according to the current mapping for the destination address of each packet.

A MAB could, in principle, be as large as a /8. Larger MABs are preferred in general, because each one burdens the BGP system with only a single advertisement, but includes the SPI space of potentially hundreds of thousands of end-user networks. However, for reasons discussed below - including load sharing between ITRs and ease of initially loading snapshots of the mapping database - it may be best if MABs are more typically in the /12 to /17 range for IPv4.



 TOC 

3.3.  UAB - User Address Block

Each MAB typically contains address space which has been assigned by some means to many (perhaps tens of thousands) separate end-users. A UAB is a contiguous range of addresses within a MAB which is assigned to one end-user. UABs are important divisions for the RUAS company, but UABs are not specifically mentioned or needed in the mapping update packets handled by Launch servers and Replicators. Nor are UABs relevant to the operation of QSDs, QSCs (caching query servers), ITRs or ETRs.

A MAB could be assigned entirely to one end-user - as might be the case if the end-user converted a prefix of theirs which was previously conventional PI space to be managed as SPI space by the Ivip system. Generally speaking, MABs are ideally large (short prefixes) and each contains space for multiple end-users. Generally, MABs are owned or at least administered by MAB companies, who rent SPI space to end-user networks.

An end-user might have multiple UABs in a MAB, UABs in multiple MABs from the same company or UABs in MABs from multiple MAB companies. For simplicity, this ID assumed each end-user has a has a single UAB. UABs are specified by starting address and length, in units as mentioned above IPv4 addresses or IPv6 /64s. While a MAB is always on power of two boundaries of these units, since it is a prefix advertised in the DFZ, UABs and micronets have arbitrary starting points and lengths - they are not at all constrained by binary "prefix" boundaries.



 TOC 

3.4.  Micronet

Following Bill Herrin's suggestion, the term "micronet" refers to a range of SPI space for which all addresses have the same mapping. In LISP, these are known as EID prefixes. In Ivip, a micronet need not be on binary boundaries - it is specified by a starting address and a length, in units of single IPv4 addresses or IPv6 /64 prefixes.

An end-user could use their entire UAB as a single micronet, or they could split it into as many micronets as they wish, and change these divisions dynamically.

Any micronet which is mapped to zero (its ETR address is 0.0.0.0 in IPv4) will cause ITRs to drop any packets addressed to this micronet. A micronet can be defined within the whole or part of a contiguous range of address space which is currently mapped to zero, by the fast push mapping distribution system carrying an update message specifying the new micronet's starting address, its length, and a non-zero address for its mapping. (Future work: decide exactly what instructions are needed and which sequences of operations are allowable for making new micronets in place of existing ones.)



 TOC 

3.5.  RUAS - Root Update Authorisation System

Multiple RUASes collectively generate the total stream of mapping update messages. Each RUAS is responsible for one or more MABs. There may be a dozen to a few dozen RUASes. (More RUAS companies is good for competition and innovation, creates some difficulties for the Launch servers, which must reach an agreement on which updates to send from all these RUASes.) Each RUAS either receives mapping updates directly from end-user networks (or their appointed Multihoming Mapping companies) - or may receive these indirectly via intermediate organisations, each of which runs a UAS.



 TOC 

3.6.  UAS - Update Authorisation System

A UAS is the system of an organisation which accepts mapping change commands from end-users, and conveys them directly - or perhaps indirectly via another UAS - to the RUAS which handles the relevant MAB. An RUAS which accepts mapping update commands from end-users does so via its own UAS system.

A UAS accepts upstream input from end-users and/or other UASes. It generates output to downstream RUASes and/or other UASes. One UAS may have relationships with multiple RUASes. A MAB may be assigned to an RUAS and control of parts of this may be delegated to multiple UASes. A single UAS may work only with a single RUAS, or with multiple and perhaps all RUASes.

Whether the MAB itself is administratively assigned (by an RIR, or some national Internet Registry) to the UAS or to the RUAS is not important in a technical sense. End-users will choose address space according to the RUAS (and any UASes) it depends upon with care, because the reliability of this MAB's address space will forever be dependent on these organisations.

The number of RUASes will be limited to enable them to efficiently and reliably work together with their jointly operated system of Launch servers to create a single stream of updates for the entire Ivip system. The ability of companies with UASes to act as agents for RUAS companies and/or to have their own MABs which they contract a RUAS to handle the mapping for, will enable a large number of organisations to compete in the rental of SPI space.



 TOC 

3.7.  UMUC - User Mapping Update Command

A UMUC is whatever action the end-user performs on one or more different user-interfaces of whatever UAS they use to change the mapping of their one or more micronets. The system would also be able to tell the user the current mapping and also confirm that a requested change to the mapping was acceptable. In other words, the system lets end-user networks (and/or whichever Multihoming Monitoring company they contract to control the mapping of their micronets) to "see" (server-to-human and server-to-server) how their UAB is broken into micronets and what ETR addresses those micronets are mapped to.

The system could also provide diagnostics such as testing the reachability of their network via one or more ETR addresses. The system would also enable trialling mapping changes and altered micronet boundaries without actually executing the changes - so the end-user network operators can manually test their proposed changes are valid, before actually making them.

QSDs will only accept certain kinds of updates, and it is vital that the mapping updates are applied in the order they are sent - and that these updates are in themselves valid. For instance, it may be best (from the point of view of QSDs sending updates to their queriers, and therefore directly or indirectly to ITRs) for micronets to be mapped to an ETR address of 0.0.0.0 before being split or joined.

In addition to testing proposed changes for validity, the UAS system should be able to combine multiple updates into a single set, to be executed in order, but at the same time. The complete set would be sent on the FMS in a single second. For instance, mapping an 8-long micronet's ETR address to zero, and splitting it into three smaller micronets and then setting the ETR address of each.

When testing proposed changes, or deciding whether to accept changes which have been ordered with the end-user network's credentials, the UAS system would generate an error if the mapping was to a disallowed address - multicast, SPI space, private address space or to some other prefixes which the Ivip system does not support the tunnelling of packets. Similarly, and error would be generated if the end-user attempted to change the mapping for some address space outside their UAB, or if they defined a new micronet within that space with non-zero mapping, or which overlapped some addresses for which the mapping was currently non-zero.

For the sake of discussion, it will be assumed that all UMUCs have passed these validity sanity tests at the UAS and are for valid mapping addresses - so a UMUC is a successfully accepted update command from the end-user, or some person or system or with the end-user's credentials.

There could be many methods by which this command is communicated, including HTTPS web forms with username and password authentication. SSL sessions might be more suitable for automated mapping change systems, such as those of a Multihoming Monitoring company which the end-user authorises to control the mapping of some or all of their UAB.

In addition to authentication, the command takes the form of the starting address of the micronet, the length of the micronet, and a single ETR IP address to which this micronet will have its mapping changed to.



 TOC 

3.8.  SUMUC - Signed User Mapping Update Command

This is the information contained in a UMUC, signed by the UAS which accepted it from the user (or by some other UAS), being handed down the tree to another UAS or to the RUAS of the tree, so that the recipient UAS/RUAS can verify the signature and regard the UMUC as authoritative.



 TOC 

3.9.  MABUS - Update Stream specific to one MAB

This is a stream of data by which the real-time updates to the mapping data for any one MAB are conveyed. For the purposes of discussion, the RUASes and the Launch system are assumed to work in a synchronized fashion, generating a body of updates for each MAB once a second. (Probably the case of no updates will be codified specifically in the update stream, rather than just resulting in no mention of the MAB.)

Each RUAS will generate one MABUS for each of its MABs. So each second, the RUASes collectively generate a variable length body of update information for every MAB in the Ivip system. The MABUS includes mapping changes (altering ETR addresses of existing micronets), changes to micronet boundaries and snapshot messages (described above). The data format would be extensible for purposes not yet anticipated.

The contents of each MABUS may be digitally signed at some stage (before or after being broken and assembled with other MABUSes into multiple packets). This is if it the Ivip design involves the QSDs being able to authenticate all the mapping changes, snapshot messages etc. they receive for each MAB, such as via the public key of the RUAS which is responsible for this MAB.



 TOC 

3.10.  Launch server

A small (such as 8) number of widely dispersed Launch servers are operated by the RUASes and work together to generate, every second, multiple identical streams of packets to Replicators in the first level (1) of the Replicator system. Each Launch server receives its input in the previous second from the RUASes.



 TOC 

3.11.  Replicator

A cross-linked, tree-like, system of Replicators form a redundant, reliable, high-speed distribution system for delivering mapping updates to full database ITRs and Query Servers all over the Net.

Each Replicator receives one or more (typically two) streams of update packets from an upstream Replicator or Launch server. These two source streams should come from widely topologically separated sources, ideally over two separate physical links. For instance a Replicator in Berlin might receive its update streams from London and Berlin, two sources in Berlin which are in different ISP networks, or in any combination which minimises the likelihood that both sources will be disrupted by any one fault.

The Replicator identifies the packets in each input stream by a simple sequence number in the start of the payload, and another number for which second in time the packet belongs to. The Replicator uses data in the received packets to tell it how many packets to receive in each second. For packets of each sequence number in a given second, the first packet to arrive with this sequence number has its data extracted from the DTLS packet, and this data is used to create a separate DTLS-protected packet for each of the 20 or so downstream Replicators on the next (numerically 1 greater) level.

In this way, unless the same numbered packet is lost from both input streams, each Replicator receives the full set of mapping update packets for this second, and sends them to tens or perhaps hundreds of downstream devices, which are other Replicators, or QSDs. 20 output streams is assumed in examples below. Since the recipient Replicators are assumed to receive two streams, each level of Replicators in these examples has an amplification factor of 10.

The receive and send links use DTLS, which prevents an attacker from spoofing these packets and so altering the behavior of ITRs.

Replicators could be implemented in routers, but are probably best implemented in ordinary software on a GNU-Linux/BSD etc. COTS (Commercial Off The Shelf) server. Replicators do not cache information and need no hard drive storage. A server performing as a QSD could also operate as a Replicator.



 TOC 

3.12.  QSD - Query Server with full Database

QSDs get a full feed of updates from one or more Replicators. When they boot, they download individual snapshot files for each MAB in the Ivip system.

QSDs respond immediately to queries from nearby ITRs and from caching Query Servers (QSCs) - and send notifications to these if mapping data changes for a micronet which was the subject of a recent query.

QSDs have no routing or traffic handling functions. In a full-scale billion-plus micronet deployment they need a lot of memory, so the best way to implement a QSD is probably on an ordinary server with one or more gigabit Ethernet interfaces. No hard drive is required, except perhaps for logging purposes.



 TOC 

3.13.  QSC - Query Server with Cache

A QSC could be implemented in a router or more likely a COTS server. It does not route packets, and its memory and computational requirements are likely to be modest compared to those of a QSD. There is no need for a full feed of updates from the Replicator system. However, each QSD must be able to get mapping information from one or more upstream QSDs - or via upstream QSCs which themselves access upstream QSDs.

The easiest way to implement a QSC would be software on a modest server, which would only need a hard drive for logging purposes.



 TOC 

4.  Update Authorities and User Interfaces

This section is a detailed discussion of the fast push mapping distribution system itself, starting with the systems which accept commands from end-users (or their authorised representatives or systems) and prepare the information for the Launch system.

This is the early stage of an ambitious design, so a number of options are contemplated. This section of the system may not need IETF standardised protocols, since only a small number of organisations need to interact to make it work. The Replicators and the data format of mapping updates do need to be standardized. The purpose of exploring the RUAS and Launch server systems is to estimate the difficulty of constructing them - and hopefully to show that an approach like this is feasible and desirable. There may well be easier approaches than the ones explored here.

Probably the closest thing to them would be the large scale systems for managing DNS, such as for .com and other major TLDs. I don't know anything about these and people with experience in such systems could probably design the UAS, RUAS and perhaps Launch server systems better than I could.

The real-time nature of these systems of controlling ITR behavior has no precedent. Generally, the system should work on a continual basis. However, if there is a technical problem or the system is stopped for a few minutes to do an upgrade or whatever, the Internet is not going to grind to a halt. In that downtime, end-user networks which experience a multihoming failure will have to wait for their connectivity to be restored. Likewise, end-user networks which send mapping changes for inbound TE will have to wait. The effect on TTR mobility would be minor, since mapping changes are not required when the MN changes its physical connections, including when moving to an entirely different access network. The delay in mapping changes means that those few MNs which have chosen a new, closer, TTR will need to wait for traffic to be tunneled to that new TTR - meaning they will need to keep up the tunnel to the old, and now more distant, TTR for these minutes. Normally, with mapping changes getting to ITRs in a few seconds, the MN could terminate the tunnel to the old TTR within a few seconds of the ITRs beginning their tunneling to the new TTR.

The final authority to control mapping information is fully devolved to end-users, who by means of a username and password or some other authentication method, are able to issue commands to define micronets within their UAS, and to map each micronet to any ETR address.

However the physical authority to control the mapping of all Mapped space within a single MAB rests with a single RUAS. That RUAS may be acting for a UAS who is administers a MAB. The RUAS may administer it - perhaps on behalf of another company - and may delegate control of parts of it to one or more UASes. The RUAS may have relationships directly to the end-users of this MAB, through its own UAS. Here we discuss the flow of information and trust between these various entities, in real-time, so that every second (for example, the actual time period will need to be carefully considered) each RUAS assembles a body of update information for each of its MABs.

In the diagrams below, each RUAS or UAS is depicted as a single entity. Each such entity acts as a single functional block, but will typically be implemented as a redundant system over several servers.



 TOC 

4.1.  RUAS Outputs



 TOC 

4.1.1.  Updates every second

Every second (or some other time-period not exceeding two or three seconds), for each MAB the RUAS is authoritative for, the RUAS generates a set of mapping updates, and works with other RUASes to integrate this into the next second's output from the Launch system.

As previously mentioned, these updates are primarily actual mapping updates for individual micronets within the MAB, but also contain occasional messages to the effect that a snapshot of this MAB's full mapping database has been made and is, or soon will be, available via various servers.



 TOC 

4.1.2.  MAB snapshots

Every few minutes (or some other time period, as chosen by the RUAS, but with some reasonable maximum defined by a BCP) the RUAS makes a copy of the complete mapping information for a MAB. Snapshots for each MAB are independent of each other, and so can be done with different frequencies.

The snapshot is in a format which needs to be standardized, so it can be downloaded and understood by any ITRD or QSD, now and in the future. This data format needs to be extensible to cover new kinds of mapping information and other functions not yet anticipated - which will be ignored by devices which are not capable of these functions.

The exact format for this is for future work, but for instance would begin with some identifying information about the MAB, a block defining that the following data concerns IPv4 micronet mapping information (and snapshot announcements), with the possibility of other blocks containing different kinds of data. Binary format would probably be best, and the file could then be compressed with gzip etc.

Each such file will be given a distinctive name, according to a standardised format, which indicates at least the MAB starting address and length, and the time of the snapshot.

The snapshot process will take a second or two to complete from the time it is initiated, and the resulting file will be copied to a number of servers, ideally located in a variety of locations around the Net.

Each such server would be run by the RUAS directly, or as part of all RUASes working together. The servers can probably be conventional HTTP servers, so that QSDs can download the snapshots when needed. There is scope for some careful design with DNS so that there is an automatic structure in the domain names of these servers, enabling an expandable system to be automatically used by QSDs without manual configuration.

These files will be publicly available, and need to be made available for somewhat longer than the cycle time of snapshots. So with a ten minute snapshot cycle, the previous snapshot should be available for a while - probably 10 minutes or so - after the new one is available.

Snapshots are downloaded by QSDs when they boot, and if they suffer a disruption in mapping updates which necessitates a reload of this part of the complete mapping database. To facilitate this, MABs should not be too large in terms of IPv4 addresses or /IPv6 /64s - or at least should not contain too many micronets - which would make individual snapshot files excessively large.

At boot time, or when re-synching, the QSD will monitor the update streams for each MAB until a snapshot announcement is found. It will then buffer all subsequent updates and download the snapshot as soon as it is available. Once the snapshot has arrived, and been unpacked to RAM, the buffered updates are applied to it. Then, this MAB's part of the mapping database is up-to-date and the ITR can begin advertising this MAB, and therefore tunnelling all packets which are addressed to this MAB.

In order to reduce total path lengths for these file downloads, and likewise for retrieving missing packets from the same servers, it would be desirable if each QSD in a given location could access a nearby snapshot server. It may be desirable to have every snapshot of every MAB in a single server, or a single set of servers which are accessed by geographically close QSDs. Anycast is not a good technology for this, since file retrieval is best done via TCP sessions. The ITR system itself can't be used, to avoid circular dependencies - so the servers must be on conventional addresses. Likewise, any DNS servers involved in this server system need to be strictly on conventional addresses.

Each QSD needs to be configured with, or to automatically discover, two or more such servers - at least one of which is relatively close - so the data can be found despite one server being down.

From the point of view of the QSC, seeking an update for a given MAB of a particular RUAS, the address to request the file from could be made up from the RUAS identifier yyyy which is contained in the snapshot announcement (in the stream of mapping updates), concatenated with a locally configured "xxxxx" and "ipv4.ivipservers.net". In the event that this server was unavailable one or more locally configured alternatives to this initial "xxxxx" value could be tried - including one or more for nearby countries.

The most significant 24 bits of the MAB's starting address (probably 48 bits for IPv6, assuming this is the granularity of BGP advertisements) for would be transformed into a text string such as 150.101.072. A similar transformation of the precise time of the snapshot would result in a second text string, and these would be used to reliably identify the appropriate directory and file in the server.



 TOC 

4.1.3.  Missing packet servers

The cross-linked tree-structured Launch and Replicator systems should provide a robust method of delivering the complete set of MAB updates every second, to every ITRD and QSD. There may be more subtle and efficient methods than this somewhat brute-force approach, which involves typically a doubling of the amount of update traffic in the pursuit of robustness. However, the rate of updates will only be problematic by current standards at a date so far in the future that the technology of the day will render the task far less daunting that it would now be.

In the event that an ITRD or QSD misses one or more packets, it will be able to easily identify which are missing, due to the sequence numbers built into their payloads. This will transform easily into an address to use by which the missing one or more packets can be retrieved, probably via HTTP, from one of the servers described previously which provide snapshots.



 TOC 

4.2.  Authentication of RUAS-generated data

Careful consideration must be given to how QSDs can quickly and reliably ensure that the information they receive ostensibly from each RUAS is genuine. Perhaps the DTLS links to two upstream Replicators will be considered good enough. But that places too much trust in Replicators which are probably controlled by other organisations than the one running the QSD and its dependent ITRs. Being able to direct traffic to an attacker's site, by means of altering the mapping information in an ITR, is such a threat to security, and such an attractive proposition for attackers, that some kind of digital signing of the update packets themselves will almost certainly be required. At this early stage of development, the model is pretty simple.



 TOC 

4.2.1.  Snapshot and missing packet files

Each RUAS has a key pair and signs the MAB snapshot and missing packet files with its private key. QSDs can verify the signature with the RUAS's public key, subject to a PKI arrangement of certificates, or some other simpler arrangements.

Both these types of files are only handled occasionally, so the overhead in performing crypto operations is insignificant.



 TOC 

4.2.2.  Mapping updates

This principle does not apply to the update information contained in packets received from the Replicator system. It would be onerous to individually authenticate each packet, or each body of updates from each RUAS contained in potentially multiple packets. At present, I can't see an alternative. The system needs to be highly secure against attack, because even a second or two of an ITR mapping packets to the attacker's site constitutes an unacceptable breach.

At least two types of attack can be contemplated. Firstly, the attacker could send spoofed packets to a QSD or Replicator intending them to be received before the genuine packets. This would be essentially impossible with DTLS protection of the packets coming from the upstream Replicator. Secondly, the attacker could somehow gain control of the upstream replicators for a given QSD. The protocols can't protect against an attacker gaining control of a QSD, RUAS or UAS system. Neither is there any protection against an attacker who has obtained the credentials to send mapping changes for the victim's micronets.

The second attack - gaining control of Replicators outside the network of the QSD - is still credible. Internet communications are always vulnerable to attackers gaining control of a router. If we assume or somehow require that Replicators are as robust against general attack, and have their passwords as closely guarded, as DFZ routers - then perhaps the level of threat is so similar to the existing level that no further measures need to be taken.

Here is an exploration of possible attacks and defences. Today, to snoop on packets, divert packets and/or to perform man-in-the-middle attacks, an attacker needs to gain control of a router. As far as I know, this is not a serious problem in the DFZ or in ISP networks today - but nonetheless, SSH and SSL/TLS are routinely used for the many transactions which do need to be secure.

If we consider a QSD in the network of ISP-A, and assume the ITRs in this network, and any connected end-user networks, use this QSD, then the question is how can protocols protect QSD's mapping data if ISP-A does everything right. (If ISP-A is sloppy with security, then protocols can't protect the QSD itself against being compromised.) Digital signing of every packet would work fine - but is expensive and the signature wastes valuable space in each packet. Digital signing of data in multiple packets looks more attractive.

The reliance on two upstream Replicators, outside ISP-A's network, and presumably in networks of two other ISPs or transit providers, might appear to make the attacker's task more difficult. However, this is not the case. Only a transient success in altering the mappings would still be a security breach - and if a single compromised Replicator sent packets a little sooner than the uncompromised one, then the QSD would never notice that the packets which arrived second differed from the first ones. The first received would be used to altering the mapping and so control the behavior of ITRs. (Any such attack would probably require the QSD to download a snapshot for the affected MAB - so we must protect against such attacks also from a DoS perspective.)

It would be possible to have QSCs check the streams of packets received from both Replicators against each other. This would be inexpensive, but it would not really help, since the attacker could launch a flood of packets to temporarily disrupt some router which carries the packets from the non-compromised Replicator. The QSD has to operate entirely from one Replicator's packets if the other Replicator dies. So it seems that the attacker needs to gain control only of a single upstream Replicator to be successful.

An attacker gaining control of a Replicator one level above the immediate upstream replicator might also succeed, since by sending its packets a little earlier, its packets would be accepted by the next level and so sent to multiple QSDs.

It is not good enough to detect the forged mapping information after it has been used to update the mapping database. So it seems there is no alternative to signing the update packets themselves - or more likely the contents of multiple such packets as a single unit.

If the body of data to be signed was spread over 5 packets, then the QSD couldn't use any of this information if a single packet was missing. Therefore, perhaps the "missing packet" system could be simplified to work not on packets, but on entire blocks of data - the same size block which is signed.

Another approach would be to have the Launch system add one or more packets to the stream, containing MD5 (or some better function) hash of either each packet, or each body of update information from each RUAS. This packet would be signed by some authority - the consortium formed by the RUAS companies which runs the Launch servers and first few levels of Replicators. It would be trivial to have a checksum for the entire second's worth of updates, but then a single missing packet would make it impossible to check the rest.

Perhaps each RUAS's set of updates can be broken into sections, such as packets or something typically bigger than packets, with hashes for each section enclosed in another packet, with that set of hashes signed by the RUAS.

The MD5 checksums could be sent twice, for robustness, and some care would be needed in deciding how much update information each one covers. A separate hash for every packet would be conceptually simple and enable individual packets to be accepted immediately, even if another packet was not received and so required a "missing packet" request. However, this would increase the number of hashes to transmit.

The current proposal is to have a hash for the updates for each MAB for which updates are received, which may be less than a packet, or perhaps more.

There are multiple ways of solving this problem. I doubt anyone would argue that it is so difficult as to warrant the abandonment of the entire fast-push, local query server concept. With more work later, I believe a satisfactory method can be found of the QSD ensuring the updates are authentic before applying them.



 TOC 

4.3.  RUAS - UAS interconnection

This section depicts a single tree of delegated responsibility for the user control of mapping of one MAB. The Root UAS at the base of the tree is run by Company X - RUAS-X. RUAS-X could be authoritative for other MABs, and each such tree of delegation may have the same set of other UAS systems, or it could be different. Each delegation tree is separate from the delegation trees of other MABs, even if they look similar, because the tree includes specific subsets of the whole MAB address range as one of the defining characteristics of its branches and leaves.

The initial action which leads to the database being changed is a user generated (manually or by the user's equipment or by a system authorised by the user) UMUC (User Mapping Update Command).

For authorising and feeding UMUCs to the RUAS-X, there is a tree as depicted in Figure 1. Delegation of authority flows up the tree as the total address range of the MAB is split at each branching junction. This tree structure involves data, in the form of SUMUCs (Signed User Mapping Updated Commands) flowing down towards the root of the tree. (Data would also flow up the tree so each user-interface leaf could tell end-users what their current mapping was, could test their requests against constraints etc.) The idea is that RUAS-X could delegate control of one or more subsets of the MAB's total range of addresses to some other system, which in turn could delegate control to other systems. There would be no absolute limit on the height (usually called depth) of these hierarchies.

The servers which handle the end-user interaction needs to be one of the leaves of this tree structure, so as not to burden the RUAS-X database servers themselves with details of user interaction. This enables various companies to give different kinds of control for the mapping of the SPI space their branch of the tree controls. Figure 1 does not show RUAS-X having any user interface servers, but it could. The simplest arrangement would be the RUAS having simply a user-interface server and no tree of other UASes.

There would need to be IETF standardised methods by which some server could execute a UMUC with the user-interface servers of any of these UASes. This standardisation would be especially important for multihoming, because some reasonably trusted company could run an automated monitoring system, and have the credentials (username, password, key etc.) stored in their system so their system can change the mapping of one or more micronets the moment one link was detected to be faulty. It is vital that there be a standardised method by which all multihoming monitoring companies could send these mapping change commands (and queries about the current state of mapping) to UASes. Also, the company (such as X, Y or Z in Figure 1) which controls a particular range of the Mapped space may offer such a multihoming monitoring system itself.

The tree in this example controls an MAB with the address range 20.0.0.0 to 20.3.255.255. In this example, company X has been assigned by an RIR the entire range 20.0.0.0 to 20.3.255.255. Company X leases to Y a quarter of this: 20.1.0.0 to 20.1.255.255. These divisions are on binary boundaries, but they need not be. It would be just as possible for X to delegate to Y an arbitrary subset of the whole range, or the entire range - or just one IPv4 address or IPv6 /64.

X's Root Update Authorisation Server (RUAS) has a private key for signing all the MAB snapshot files it periodically creates and makes available. The same key would be used for signing the list of hashes which are used to authenticate the updates for each MAB, as mentioned previously.

In this example, company Y delegates control of some of its space to company Z, and Z has an end-user U, who needs to control the mapping of a UAB containing one or more micronets in Z's range.

Z has various interfaces by which U can do this, with its own arrangements for authentication, for monitoring a multihoming system and making changes automatically etc. Ideally there might be one or more automated, host-to-server, IETF-standardised protocols so all end users and their appointed multihoming monitoring companies could have standardised software for talking to whichever company's servers they use to control the mapping of their IP address(es).


           User-R   User-S  User-T  User-U       Multihoming
                 \        \      |       |       Monitoring
                  \        \     |       |       Inc.
                   \      .................     /
                    \----. Web interface   .---/
                         . other protocols .
                         . etc.            .
                          ....UAS-Z........
                                |
Other companies                 |
like Y and Z                    |
                     /-----<----/
|   |           \ | /
|   |            \|/
|   |           UAS-Y
\   |             |
 \  |  /----<-----/
  \ | /
   \|/
 RUAS-X    Root Update Authorisation Server company X
    | \
    |  \
    V   \->-[ Multiple web servers for MAB snapshot ]
    |       [ and missing packet files.             ]
    |
    |      Other RUASes like RUAS-X, each authoritative
    |      for mapping one or more MABs and producing
    |      regular MAB snapshots and update streams to
    |      which are sent to all Query Servers.
     \
      \        |    |    |        /
       \       |    |    |       /
        \      |    |    |      /
         \     |    |    |     /
          \    |    |    |    /
           \   |    |    |   |
           |   |    |    |   |
           V   V    V    V   V
           |   |    |    |   |

         Each line depicts 8 streams of packets with
         identical payloads - one stream for each of
         the 8 Launch servers.


Figure 1: Delegation tree of UASes above one RUAS.

When user-U (or a device or system with user-U's credentials) changes the mapping of their micronet via a web interface this is achieved via Z's website, authenticating him-, her- or it-self, by whatever means Z requires. This causes UAS-Z to generate a signed copy of this update command (a SUMUC) and to send it to UAS-Y. This may include multiple commands to be executed in order.

The simplest SUMUC would be a change to the ETR address of an existing micronet. This would consist of three items (assuming IPv4 for simplicity): A starting address for which micronet this update covers, the number of IP addresses covered by the micronet to be changed (>=1) (or alternatively the last address of the micronet), and a new mapping value - a 32 bit ETR address. The SUMUC could also consist of a time in the future the update should be executed. In that case, it would be stored by RUAS-X and sent to the FMS at the appointed time.

Mapping change commands would also include commands to join and split micronets. Sequences of these commands would be sent, in order - and the UAS should check their validity before putting them into a SUMUC. So a SUMUC consists of one or multiple mapping change commands concerning a particular micronet, or perhaps a set of micronets. The commands will be executed in order, but as if at once.

If the SUMUC consists simply of changing a micronet's ETR address, including zeroing it, then this will be applied by every QSD and updates sent to any ITRs which need it. Multiple such changes all together in the one SUMUC would cause the same effects, for multiple micronets. However, if the changes involved a sequence of changes affecting the same SPI addresses, the QSD will update ITRs (its queriers, which could be ITRs or QSCs) to the final state of the mapping after the changes.

For instance a sequence of changes could zero two micronets (set their ETR address to 0.0.0.0) and then join them into one micronet. The resulting micronet could then be split into five micronets and each one mapped to a different ETR address. The QSD may have a querier which is caching the mapping for the first original micronet, but not the other. It will send that querier updates which define the new mapping arrangements for exactly that range of SPI addresses which the original response covered. This avoids the ITR (or the QSC, if that is the querier) having to be told about a larger amount of SPI space than it was told about in the initial reply. As noted previously, the caching time for these newly defined micronets, each of which will now be in the cache of the ITR or QSC, will be flushed from the cache at the same time as the originally cached micronet would have been.

UAS-Y trusts this SUMUC because it can authenticate UAS-Z's signature. It strips off the signature and adds its own, before passing the SUMUC down to the next level: RUAS-X.

RUAS-X likewise has a copy of UAS-Y's public key and within a fraction of a second of U initiating the UMUC, the master copy of this MAB's database, in RUAS-X is altered accordingly. (This would be a distributed, redundant, database system.)

Authority is delegated up the tree, because UAS-Y will only accept update commands if they are signed by one of its branch UASes, and for the particular address range that UAS has been authorised to control.

User-U may have given their username and password etc. to Multihoming Monitoring Inc. so this company can monitor their multihoming links and change the mapping as soon as one link goes down. UAS-Z doesn't know or care who actually makes the change - as long as they can authenticate themselves for whatever micronet they want to change the mapping of. UAS-Z would keep an audit trail of all interactions such as with User-U or Multihoming Monitoring Inc.



 TOC 

5.  The Launch system

In this discussion 8 Launch servers will be assumed. The exact number could be varied over time. Initial introduction could no-doubt be done with a simpler system, but the purpose of this discussion is to explore how a the system could scale to very large numbers of micronets (billions) and large numbers of updates per second.

The exact logic of the Launch system remains to be determined. The following is a rough guide to how it might be done. I understand there are some protocols for making distributed decisions, including in a robust way if not all participants are active or have the full amount of information others have.

The task of the Launch system is every cycle - in this example every second - to collate the update information from all the RUASes, agree on what has been collected, and then to generate multiple streams of packets containing that information, from multiple locations, to the widely geographically dispersed level 1 Replicators. Links between the Launch servers would best be done via private links to avoid packet flooding attacks. Likewise the links to level 1 Replicators.

Each Launch server has a link to every other Launch server, and every RUAS has a link to every Launch server. This may seem rather over-engineered, but the system will be robust in the event of failure of quite a few of these links, and the task at hand is a momentous one, deserving considerable effort to make it fast and reliable.

The exact details of how packets are handled, information combined into packets etc. remains for future work.

Each Launch server may be a single physical server, with a live backup at the same address, or a redundant cluster of servers which behaves as if it is one device.

While the Launch servers are sending out the update packets for one second, they are comparing notes about updates to be sent in the next second and collecting updates to be sent in the second after that. Perhaps this one second timing clock will prove to be too ambitious, or the operations may be broken into four phases, rather than three.



 TOC 

5.1.  Phase 1 - collecting updates from RUASes

In phase 1, all RUASes attempt to send their complete set of updates to every Launch server, where they are buffered in readiness for Phase 2. The Launch server authenticates this information, by standard cryptographic means based on the public key of each RUAS or simply via using SSH for the communications protocol.

The contents of each RUAS's updates are then collected, and an MD5 (or some other hash algorithm) hash is created for each one.



 TOC 

5.2.  Phase 2 - checksum comparison

Each Launch server sends to every other Launch server its record of the hashes of the updates received from each RUAS.

This enables each Launch server to identify its state as one of the following:

Each Launch server now sends a signed message to the other Launch servers, containing the state determined above: Normal, invalid updates or missing updates.

Those Launch servers which are in the Normal state count how many others are also in this state. If the number is above some "quorum" constant, say 4 in an 8 server system, then each such Launch server is ready to send the collected updates in phase 3. These Launch servers independently process the same update data into a series of packets, with sequence numbers which can easily be identified by the recipient devices - initially level 1 Replicators but ultimately QSDs. Those packets are stored, ready for transmission in phase 3.

Normally, all 8 Launch servers will receive the same information correctly, and so will participate in phase 3. The purpose of this constant is to ensure that there will not be a condition in which only one or two Launch servers participate in phase 3. The idea is that the updates will be launched into the Replicator network robustly, or not at all. Robustly means 4 or more of the 8 Launch servers all launch the same information, and the others launch nothing. If only 3, 2 or 1 Launch servers sent the information, or if some Launch servers sent different information from the others, then it is possible that some QSDs would not get the full set of updates.

With further development work, it should be possible to fine-tune this system to adequately guard against single or multiple points of failure, but also to ensure that the system only sends out data when it can send from at least four, or some constant number of Launch servers. Careful analysis will be required to anticipate various failure modes. There's quite a lot of work devising this, but it only needs to be done once, for this one set of Launch servers. Updates to the software can be done without much fuss - it is not like having to change the functionality of all QSDs.

RUASes monitor the output of the Launch system, and if a particular second's worth of updates are not sent, then the RUAS will send them again soon.

This raises some potential ordering difficulties, where one second contains a command to map a micronet to zero, and the next second contains a command to map part of it to some valid address. While these should be combined in the one second, if they were not, and the first second was not sent, then the second second's command would fail in the QSD, because it would be defining a new smaller micronet in part of a micronet which was not at the time mapped to zero. If a QSD for some reason misses all or part of the updates for a given MAB, it needs to buffer subsequent updates until it can retrieve the missing packet(s) - since updates must be applied in the correct order.

The above algorithm will need to be extended so that a flaky RUAS, which only transmits to a few Launch servers, will not cause the quorum test to fail, due for instance to two Launch servers getting its updates, and the rest recognising that they didn't.



 TOC 

5.3.  Phase 3 - identical update streams

Those Launch servers which have the full set of update data now send the packets they generated, in separate DTLS protected streams, to level 1 Replicators. It would probably be best if the packets are sent in numeric sequence, with sending times decided to spread the packets over the whole second. Exactly how many level 1 Replicators there are, and how many are driven by each Launch server, will be a matter for further work.

The result will be in each cycle that either the full set of updates are sent out, robustly, by all or most all Launch servers. Due to the cross-linked nature of the level 1 Replicators receiving at least two feeds from separate Launch servers, in all but the most pathological cases, every level 1 Replicator will receive the same set of information and so launch it to the level 1 Replicators. Even if there is a relatively high packet loss from some or many of these, and some broken links, all, or almost all level 2 Replicators will receive a full set of packets. This pattern of redundancy, for a doubling in bandwidth used, continues all the way to QSDs.



 TOC 

6.  Replicators

Further work is required to reach a more precise description of how the update information is placed in packets, and signed in such a way that QSDs can be sure they have received the correct information. If we assume that this problem can be solved, then the following description of the functionality of individual Replicators and the way they are arranged will lead to an understanding of how they will form a robust, packet amplifying, global network for delivering the output of the Launch system to a million or more QSDs.

(See "Figure 2 Tree of UASes above one RUAS".)

 \  |  /   }  Update information from end-users - directly
  \ V /    }  or indirectly - to one of a dozen or so RUASes.
   \|/
 RUAS-X ->--------------[snapshot & missing packet HTTP servers]
   /|\
  / V \       Streams of packets containing identical real-time
    |         mapping updates to the 8 Launch servers.
    |
\   \    |    /   /     Each of the 8 Launch servers gets a
 \   \   V   /   /      stream from each RUAS.
  \   \  |  /   /
<>[Launch server N]<>   The 8 Launch servers have links with each
     / / | \ \          other.  Each second, each one sends a set
    / /  V  \ \         of updated packets to 20 level 1
   / /   |   \ \        Replicators.   Each level 1 Replicator
         |              receives two streams, each from a
         |              different Launch server.
         \
          \         /   Even with packet losses and link failures,
           \       /    most of the 80 level 1 Replicators receive
  level 1   \     /     a complete set of update packets, each
         [Replicator]   second, which they each replicate to 20
           / / | \ \    level 2 Replicators.
          / /  V  \ \
         /  |  |  |  \  In this example, each Replicator consumes
                  |     two feeds from the upstream level, and
                  /     generates 20 feeds to Replicators in
                 /      the level below (numbered one above the
      \         /       current level).  So each level involves
       \       /        10 times the number of Replicators.
level 2 \     /
     [Replicator]       These figures might be typical of later
       / / | \ \        years with 10^9+ micronets and 100k+
      / /  V  \ \       of QSDs to drive.  In the first five or
     /  |  |  |  \      ten years, with fewer updates, the
    /   |  |  |   \     amplification ratio of each level could
   /    |  |  |    \    be much higher, with fewer levels.
  /     |  |  |     \
        |  |  |         Replicators are well-connected COTS
        |     |         servers at peering points and ISP data
        |     |         centers, though the 8000 Level 3 and
    [Levels 3 and 4]    80,000 Level 4 Replicators may be in
    [Replicators   ]    ISPs and larger end-user networks.
    \   |    \     /
     \  |     \   /     Up to 800k QSDs get two or more ideally
      \ |      \ /      identical full feeds of updates.
      QSD      QSD

Figure 2: Multiple levels of Replicators drive hundreds of thousands of QSDs.



 TOC 

6.1.  Scaling limits

The Replicator system is scalable to any size simply by adding Replicators. Assuming two input streams for each Replicator, N output streams gives an N/2 amplification of stream numbers per level. N could be quite high in the early years of introduction, when the number of micronets and updates is small by comparison with the design target of one to ten billion micronets, with accompanying update rates driven by their use for inbound TE for multihomed non-mobile end-user networks and by mobile devices selecting new TTRs.

First, a maximal IPv4 example will be considered. Assume a billion micronets, most of them for single IP addresses. Presumably most of these will be for individual end-users, at home or with mobile devices. The update rate will be relatively low for multihoming the home and office-based micronets.

The update rate due to inbound TE is impossible to predict. Being able to steer traffic dynamically to maximise utilization of multiple links is economically highly attractive. Market mechanisms will tend to set prices for updates which balance competing concerns. If the price is too low, there will be more of them and the Replicator system will need to be improved to cope with them - so the price would rise to either reduce the number, or pay for the upgrades.

It is possible that the RUASes collectively could set prices low enough to cover their make a profit running their operation and many of the Replicators - with a very high volume of TE updates. If this grew to the point where those operating QSDs found they had to spend money upgrading their QSDs just to cope with the volume, then there would be the possibility that they could instead program their QSDs to ignore the most frequent updates which had patterns resembling TE updates.

Then, in order for the RUASes to be able to continue charging for these TE updates, the RUASes might need to pay QSD operators to accept such a high level of updates. This would probably be excessively expensive - so RUASes would be under strong pressure to limit the total rate of updates to a level the great majority of QSD operators are happy with. The price of updates will not deter their use for multihoming service restoration - and this would represent a small proportion of total updates. Higher prices per update would reduce the number for TE, in a highly elastic manner. Likewise, higher prices per update would cause mobile users (or more directly the TTR companies, who are paying for each update) not to change TTRs as often.

So overall, it is impossible to state with confidence what update rates might be expected.

Even with the entire Earth's population owning a mobile device with its own micronets, if we pick some figure, such as 1000 km, within which there is no significant benefit in choosing a closer TTR, then a WAG (Wild-Ass Guess) could be based on airline passenger numbers. If we assume that each such trip would be long enough to require a new TTR, then we would get some very approximate worst-case figure.

Statistics from the International Air Transport Association [IATA‑2009] (, “Fact sheet: industry statistics,” September 2009.) indicate that commercial airlines carried 2.271 billion passengers in 2008. I have not been able to find estimates for the number of people travelling large distances by road or train, but it is reasonable to assume these are relatively small compared to the numbers of airline passengers. Most travel by car and train involves trips short enough, with a return trip home, that there will be no need to use a closer TTR during the whole trip. Truck drivers crossing continents might be an exception, but the number of such trips would be small compared to the 2 billion airline passenger figure.

There could be growth in passenger numbers and it is possible that on long trips, the aircraft's satellite link would connect to several ground stations, with the MNs in the aircraft therefore (ideally) changing their mapping to a new TTR near the ground station. (This is explored in [TTR Mobility] (Whittle, R. and S. Russert, “TTR Mobility Extensions for Core-Edge Separation Solutions to the Internets Routing Scaling Problem,” August 2008.). There are various ways of extrapolating these figures, such as with population growth. For simplicity, I will double the 2 billion figure and use this to roughly include all mapping changes due to multihoming service restoration and TE. So I have WAG of 4 billion mapping changes a year.

This is about 128 updates a second.

The raw data for change to an IPv6 micronet's ETR address is 32 bytes: 64 bits for the micronet's starting /64, another 64 bits for its length or end, and 128 bits for the ETR address. 128 of these a second is 4k bytes a second - 32kbps. There would be peaks and troughs, and there could be peaks due to a major outage driving many end-user networks to switch ETRs for multihoming service restoration.

If there were 5 or 10 billion mobile devices, each with a micronet, many of these would keep using the same TTR from one year to the next. There would be a mapping change when the micronet was assigned to a given handset, and then another when the handset was no longer used, or replaced by another. So there would also be a significant background level of administrative mapping changes with billions of micronets for mobile devices.

It is hard to imagine a scenario in which the update rate would require prohibitive volumes of data, even by today's standard, for any substantial ISP. The flow of update packets would be somewhat greater than this raw data rate due to the need for packing them into some kind of robust format, having hashes of them with digital signatures etc. The total amount of mapping data coming into an ISP would be 2 to 4 times this due to the need for feeds from two or more Replicators. Still, by the times such high levels of adoption could occur, the bandwidth they require will surely not present a significant difficulty for any ISP, or for larger end-user networks which want to run their own ITRs and wish to have their own QSDs, rather than relying on the QSDs of their ISPs.



 TOC 

6.2.  Managing Replicators

Replicators should be easy to create and deploy. Any substantial server with the requisite software, in a suitable location, will do the job - but it should be well secured against attackers gaining root access. A successful system will require some mechanisms which ensure reliable operation with a minimal amount of configuration and ongoing management.

In the current model, each Replicator normally receives feeds from two upstream Replicators, and generates some figure N feeds for downstream devices. Each Replicator should be able to request and quickly gain a replacement feed from another upstream Replicator if one of those it is using becomes unavailable, or unreliable.

This requires that Replicators in general be operating below capacity, so that when others in their level fail, they can take up the slack. This needs to be locally configured beforehand, with upstream Replicators of organisations which have agreed to provide the feeds, and with downstream Replicators of organisations who have requested them.

It is possible to imagine a sophisticated, distributed, management system for the Replicator network. This could be developed over time, since for initial deployment, considerable manual configuration and less automation would be acceptable.



 TOC 

7.  Security Considerations

This ID mentions some authentication and security problems and possible solutions to them, but full consideration of security can only occur when the architecture is fleshed out in greater detail.



 TOC 

8.  IANA Considerations

For future work.



 TOC 

9. Informative References

[I-D.whittle-ivip-arch] Whittle, R., “Ivip (Internet Vastly Improved Plumbing) Architecture,” draft-whittle-ivip-arch-03 (work in progress), January 2010 (TXT).
[I-D.whittle-ivip-glossary] Whittle, R., “Glossary of some Ivip and scalable routing terms,” draft-whittle-ivip-glossary-00 (work in progress), January 2010 (TXT).
[IATA-2009] Fact sheet: industry statistics,” September 2009.
[TTR Mobility] Whittle, R. and S. Russert, “TTR Mobility Extensions for Core-Edge Separation Solutions to the Internets Routing Scaling Problem,” August 2008.


 TOC 

Author's Address

  Robin Whittle
  First Principles
Email:  rw@firstpr.com.au
URI:  http://www.firstpr.com.au/ip/ivip/