TOC |
|
By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”
The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.
This Internet-Draft will expire on September 30, 2007.
Copyright © The IETF Trust (2007).
I propose a simple, low-cost, low-power, Static RAM (SRAM) based architecture for the Forwarding Information Base (FIB) function of transit and border routers in the Default Free Zone (DFZ) of the Internet. This will provide direct hardware forwarding irrespective of the size of the "global BGP routing table", within the current IPv4 convention of limiting advertised prefixes to no longer than /24. Routers with this or a similar architecture provide the only elegant hardware solution to the problem of route disaggregation, which is unavoidable due to increasing numbers of ISPs and Autonomous System (AS) end-users who need to advertise their prefixes on topologically diverse parts of the network, for purposes including multihoming and traffic engineering.
Router hardware limitations with respect to route disaggregation could also be eliminated for IPv6, by adding further SRAMs or, on a more limited basis, by using spare space in the SRAM which is required for IPv4. Two additional SRAMs and a reallocation of the existing 2000::/3 global unicast allocations to a smaller range - for instance 2000::/10 - would provide for Provider Independent (PI) /32 allocations to 4 million ISPs and multihomed end-users. Each /32 assignment could be advertised as up to eight /35 prefixes - each of which provides 8192 /48 user networks. A less disruptive alternative to reallocating existing IPv6 global unicast addresses would be to define a /10 prefix - inside or outside 2000::/3 - for new PI assignments to ISPs and AS end-users with the long-term assurance of rapid SRAM-based forwarding for prefixes as short as /35, without concern for route aggregation or network topology. It may be feasible, at first, to handle IPv6 without an addition SRAM chip. Unused space in the IPv4 chip (or two chips for larger routers) would map 2,097,152 prefixes - for instance to support PI assignments of /32 prefixes to 262,144 ISPs and AS end-users.
1.
Introduction
2.
Summary
3.
Route Aggregation and FIB Technologies
3.1.
Route aggregation
3.2.
Route disaggregation
3.3.
The Default Free Zone
3.4.
Routing Information Base (RIB)
3.5.
Forwarding Information Base (FIB)
3.6.
Forwarding Equivalent Class (FEC)
3.7.
Linear search list implementation of the FIB
3.8.
Tree-structured implementation of the FIB
3.9.
TCAM implementation of the FIB
3.9.1.
TCAM devices
3.9.2.
TCAM and SRAM produce FEC
3.9.3.
TCAM power consumption and other problems
3.9.4.
TCAM capacity is driven by route disaggregation
3.10.
Proposed SRAM architecture for the FIB
4.
The Crisis in Routing and Addressing
4.1.
Scalability
4.2.
Addressing, Topology and Rekhter's Law
4.3.
IPv6 - Future Routing Swamp?
4.4.
Costs, Benefits and IETF Policy
4.5.
Multihoming is mandatory for ISPs and many users
4.6.
Who, Where and extending the TCP/IP protocols
4.7.
Moore's Law
4.8.
Power consumption and heat dissipation
4.9.
Incremental changes
5.
SRAM-based FIB for IPv4
5.1.
The SRAM chip
5.2.
Encoding FEC
5.3.
IPv4 usage and policy
5.4.
BGP performance and stability
5.5.
Alternative arrangements for IPv4
6.
SRAM-based FIB and addressing changes for IPv6
6.1.
Impractical with current global unicast allocations
6.2.
Reallocating IPv6 global unicast addresses
6.3.
Defining a separate IPv6 prefix for SRAM-based forwarding
6.4.
Minimal memory space for IPv6
6.4.1.
Using free space in the IPv4 SRAM
6.4.2.
Using a smaller SRAM for IPv6
7.
Security Considerations
8.
IANA Considerations
9.
Informative References
§
Author's Address
§
Intellectual Property and Copyright Statements
TOC |
The purpose of this Internet Draft is to argue that future designs of high-end routers used in the Internet's Default Free Zone (DFZ) be equipped with a Static RAM (SRAM) based hardware architecture to achieve single clock cycle Forwarding Information Base (FIB) classification of incoming packets. This architecture would be tailored to current global administrative arrangements regarding IPv4 address management and routing. An extension to this architecture for IPv6 would match a proposed compact reallocation of IPV6 global unicast addresses.
Based on a single 1.4 watt, USD$70 72 Mbit SRAM, this system can classify 250M IPv4 packets/sec to be forwarded by one of 14 interfaces. By mapping destination address bits 31 to 8 to the SRAM address, the correct interface number can be read in 4ns, and can be different for each of the 14.6 million /24 prefixes in which the IPv4 address space could be separately advertised. This system can be extended with a second SRAM to routers with up to 510 interfaces.
I propose that by carefully coordinating global policies for address allocation and BGP prefix advertisement with a new standard of router hardware optimised for the Internet's DFZ, that one of the major threats to Internet communications can be eliminated. This is the growth in what is often referred to as the "global BGP routing table", which is projected to overwhelm the hardware capabilities of transit and border routers in the next five or so years.
Part of the problem of routing table growth is the increased BGP traffic and the stability, memory and CPU utilisation problems this entails. The other part of the problem is that the current generation of routers will soon be unable to support full line-speed packet rates with a sufficiently large number of FIB entries. Efforts to constrain routing table growth are likely to fail because a growing number of ISPs and end-users can only achieve their robustness and performance imperatives by separately advertising prefixes for purposes including multihoming and traffic engineering.
This looming crisis was the subject of a two day IAB Routing and Addressing Workshop, held in Amsterdam, Netherlands, in October 2006. [I‑D.iab‑raws‑report] (Meyers, D., “Report from the IAB Workshop on Routing and Addressing,” February 2007.) [IAB‑RAWS‑website] (Meyers, D., “IAB Workshop on Routing and Addressing - resources and presentations,” December 2006.) . The participants foresaw no solution to the problems. They were unable to see how routing efficiency could be maintained with the growing advertisement of prefixes at locations where network topology largely prevents them from being aggregated. Yet the needs of multihomed organisations for stable, provider independent (PI) IP addresses with multiple topologically diverse upstream connections necessarily results in many advertised prefixes not aggregating with neighbouring prefixes, except perhaps to distant routers.
I first review the problems related to routing, including the hardware forwarding, routing table management in routers, the BGP protocol and how address allocation has to date been constrained by the need for route aggregation. I then describe the scope of the proposed upgrade to routers. I present a single-chip design for IPv4 for small router - those with 14 or fewer interfaces - and discuss its expansion for larger routers. I then propose an IPv6 implementation with a particular set of constraints on IPv6 global unicast allocation. Following this is a discussion of some other options which might be considered now and for the future. Finally, I suggest a timeline by which the proposed changes to policy might be made to give router manufacturers and their customers confidence, firstly in the continued routability of the Internet and secondly in the capacity of a new generation of routers to handle the demands of growing Internet traffic for longer than the current typical 5 year lifespan. In Appendix 1 I discuss some low-level hardware details of how the proposed architecture could be implemented with currently available SRAMs.
TOC |
In order to gather together the main points of this Internet Draft near the beginning, here is a summary of the major points, stripped of most of the details and qualifications.
SRAM table-based FIBs were never considered practical because they couldn't handle the full range of router functionality. As a result, it is widely believed that the only way to route packets is with a small enough number of rules to fit into TCAM (Ternary Content Addressable Memory) or iterative tree-structured FIB systems. This has lead to two decades of apparently absolute belief in the requirement for route aggregation - a position perhaps strengthened by various unsuccessful attempts to find alternatives by introducing new elements into the TCP/IP protocols.
There are a rapidly growing number of ISPs and end-users who absolutely require multihoming (of entire networks, not just individual nodes as SHIM6 is intended to provide) and who also strongly desire or need some traffic engineering capacity. These can only be provided by this increasing number of users having an increasing number of prefixes which they advertise in topologically diverse ways - which is completely at odds with the requirement of route aggregation.
The Internet routing system is shared global resource - since all users depend upon it, burden it with traffic and pay for it indirectly. So there has been an increasing concern about the future of the Internet, manifesting in calls for greater pressure or constraints to be placed upon ISPs and end-users to curtail their multihoming and prefix-splitting ways. In the absence of any constraints to route disaggregation and assuming that router technology does not change in principle, then the capacity of routers in the DFZ to handle traffic in the future clearly depends on continued purchases of new routers. An attempt has been made to secure IPv6 routability by banning end-users from having Provider Independent (PI) addresses. This places the custodians of the Internet (RIRs and whoever acts to protect the interests of operators of routers in the DFZ) in opposition with the immediate interests of most large ISPs and other AS operators. Being the Fun-Police in the global Internet is a thankless - and probably futile - task.
Fortunately, modern SRAM chips - which are far beyond what could have been imagined when IPv4 or even IPv6 was designed - are a perfect fit for a fast, low-power, elegant, simple and easy-to-program hardware-based FIB system for existing IPv4 usage in the DFZ. An SRAM-based FIB needs to be an adjunct to existing TCAM etc. architectures, rather then replacing them, because these systems are capable of many important tasks the SRAM system can't perform. However, for most or all of the traffic of transit routers - and for the upstream traffic of border routers - the SRAM-based FIB is sufficient for almost all packets. An SRAM-based system needs to be able to map particular prefixes to be handled by the traditional TCAM etc. FIB architecture, for instance to handle packets which match a small number of longer prefixes.
It is difficult or impossible to imagine an easier hardware solution than this SRAM-based architecture for simple IP forwarding in the DFZ. It is faster and simpler than MPLS, and requires no changes to current IPv4 usage. The cost of implementing this or a similar approach for IPv4 will be far less than not implementing it and having to pay much higher prices for routers with power-hungry, over-complex TCAM or other types of FIB, scaled up for the larger routing tables of the near future. Nonetheless, border and transit routers will typically require some TCAM or other more flexible systems to cope with MPLS and/or the few prefixes which are longer than /24.
By prompting discussion of this architecture now, I hope to hasten its arrival for IPv4. I also suggest that this SRAM FIB architecture is the only way IPv6 traffic can be globally routed if and when it becomes widely adopted. So I argue that urgent consideration be given to a new, standardised, SRAM-based FIB architecture for routers in the DFZ and to new arrangements for IPv6 global unicast addresses. Initially, I discuss the re-allocation of IPv6 global unicast addresses to suit this architecture, but a less disruptive alternative is to define a specific subset of global unicast space which future routers will handle with SRAM-based forwarding, and so which can be assigned to ISPs and AS end-users without concern about route disaggregation.
Assuming BGP and FIB software can be made stable and responsive with much larger numbers of prefixes and updates, I propose that the new FIB architecture will enable a much more extensive splitting up of the IPv4 address space into freely advertisable and generally small PI prefixes. This would be anathema to route aggregation principles which prevail in the absence of an SRAM-based FIB architecture - but should enable a much more flexible and efficient use to be made of the IPv4 address space, while enabling ISPs and end-users to split and advertise their prefixes freely.
TOC |
This section begins with basic principles which will be well understood by all readers who are familiar with the crisis. However, this section introduces a new approach to thinking about forwarding and therefore routing. This new approach is the basis of the SRAM FIB architecture and of my suggestion for aligning address allocation and BGP management policies with a router architecture optimised to solve the biggest problem the Internet faces today.
TOC |
In the absence of a hardware FIB architecture such as proposed here, it seems that all participants - ISPs, major end-users, protocol designers, address administrators and router manufacturers - will continue to act as if it is impractical to route packets on the global Internet unless the forces which lead to 'route disaggregation' are strictly controlled. Full route aggregation occurs when each topological branch of the Internet carries only IP addresses which are part of a given prefix, and where each other branch carries addresses within another non-overlapping prefix. This means that if there were four 'first level' branches from the notional 'centre' of the Internet, all the sub-branches of branch A would contain only IP addresses within a prefix 'a' and likewise all the sub-branches of branch B would contain only IP addresses within a second prefix 'b', which is not a subset of prefix 'a'. Then, a router at the junction of these four first-level branches could have a very simple routing table.
This centre router would have four interfaces, one connecting to each branch, which I will also name A, B, C and D. (An "Interface" in this context means an Ethernet, ADSL, SDH/SONET fibre connection etc. This is commonly referred to as a "port" of the router, but I will use "interface" to avoid confusion with "ports" in the context of TCP, UDP and SCTP.) When a packet arrives at any of the four interfaces, the routing table contains rules by which it should be forwarded to one of these four interfaces. Simply testing the packet's destination address against these four rules will lead to it either being forwarded to one of the four interfaces, or being dropped. (For simplicity, I am ignoring the fact that the router itself must have some IP addresses outside the four prefixes a, b, c and d.) This example of complete route aggregation leads to a tree-structured network, without redundant paths to cope with failure of links or routers.
TOC |
An example of complete route disaggregation would be a situation in which 256 separate prefixes were allocated to users, and these users were connected to various branches and sub-branches, with no discernable pattern in the addresses of these prefixes with respect to their location in the network topology. In this instance, the router at the centre would need as many rules about forwarding as there were prefixes, except for the rare occasion in which two prefixes with adjacent address ranges could be accessed by the one interface. In that case, a single rule covering the total range of the two prefixes would work just as well as a separate rules for each.
TOC |
A router within a network which uses a small proportion of the Internet's address space needs only to maintain rules for the prefixes in that address space, with one final rule as the default to be followed when a packet's destination address does not match any of the other rules. The default rule forwards packets to whichever interface leads towards the "rest of the Internet". For instance, a border router which has a single connection to an ISP on interface D, has a series of rules for each of the network's prefixes, followed by the final default rule: to direct all non-matching packet to the ISP, where they will be routed to their destinations on the rest of the Internet. In internal router will have its default rule to forward packets to whichever interface is the best route to the border router.
There are two primary types of router which are considered to be in the "Default Free Zone", by virtue of them not being able to rely on a single default rule to forward all the packets which do not match any one of a relatively short list of rules. The first type is a border router with two links to two or more separate ISPs (or peering points, or routers of other systems) where both links carry outgoing packets to the rest of the Internet. Such a border router needs rules for every globally advertised BGP prefix, because the best path for some prefixes will be to one link and the best for the others will be to another. The second type of router in the DFZ is a "transit router", which is not serving a local network, but connects to two or more other routers handling general Internet traffic. As with the multihomed (two links to ISPs) border router, the transit router needs to participate in the global BGP routing system and maintain separate rules for which interface to forward packets to, for each of the tens of thousands of advertised prefixes.
The proposed SRAM-based FIB architecture is only required for routers in the DFZ. Internal routers and those with a single outgoing link do not need an RIB and FIB with separate entries for each advertised BGP prefix, so conventional router architectures are perfectly adequate. Nonetheless, it is possible that economies of scale and the desire for flexibility may result in the SRAM-based approach becoming standard in all high-end routers, including those which are not initially deployed in the DFZ.
TOC |
The Routing Information Base (RIB) is the body of data maintained by each router which contains these rules. The RIB's rules may be manually configured, or some mixture of manual and automatically generated rules. For our discussion of the problems faced by routers in the Internet's DFZ, most of the rules are generated automatically by the router's software running a Border Gateway Protocol (BGP) agent which communicates with other similar routers, so that each router can decide the best interface to forward packets to, depending on their destination address. In this Internet Draft, we are only concerned with the external BGP (eBGP) interaction between transit routers and the border routers of Autonomous Systems (ASes). The border routers of some ASes also communicate via an internal BGP (iBGP) system.
For the purpose of this discussion, the action of "forwarding" refers to directing a packet to one of the interfaces, to dropping it, or perhaps to subjecting it to some other processing. For the main body of payload traffic (as distinct from administrative traffic) the action of forwarding must be performed very quickly. Because the major problem faced by transit and border routers is the task of simply forwarding packets, rather than doing anything more complex with them such as queuing them in the output interface, I will consider "forwarding" to be simply making the first, and usually the only, decision regarding the packet: which interface to send it to, if any.
Typically, the RIB is processed by software in the router to generate a simpler body of data more suitable for rapid classification of packets. This body of is known as the 'Forwarding Information Base' (FIB), but I will also use this term to refer to the hardware and software which processes the packets according to this body of data.
Routers may use some combination of software, specialised hardware and software, or purely hardware (without any conventional CPU or software) to classify the packets regarding which interface they should be forwarded to. Originally, routers had a central design with relatively "dumb" interfaces. All high-end routers now place an FIB system on each interface. This is for several reasons. Firstly, the total data rate of the router exceeds that of any single FIB. Secondly, funnelling packets to a central point when some of them might be sent back to the ingress interface is inefficient. Thirdly, local processing in the interface enables the interface to decide, without any per-packet central involvement, which interface to send it to via the router's "backplane" or "switching fabric" - a fast, any-to-any interconnect between all the interfaces.
The RIB is traditionally structured as a list of prefixes, each of which has an associated body of data. One entry in the RIB may refer to an entire /8 prefix - which in IPv4 covers 16,777,216 IP addresses. Another may cover a /24 or longer prefix, covering 256 or less IPv4 addresses. This organisation reflects the way routing information is structured in BGP and in most other contexts.
The RIB of a DFZ router is used for more than simply generating an FIB. Firstly, the router uses the RIB to store some of the information it receives from its peers - other transit and border routers which participate in the global BGP system. Secondly, the RIB is used to generate BGP messages sent to these peers. Often, the RIB is processed to generate a similar, simplified and separate body of data on which the BGP outgoing messages are based.
TOC |
Traditionally, the FIB is structured similarly to the RIB - as a set of rules, each applying to a particular prefix. Where two rules A and B refer to prefixes a and b respectively, and where prefix b is a subnet of a, then the FIB must be structured so that packets whose destination address is within prefix b are subject to rule B rather than rule A. Rule A is applied to all packets within a but not within b. This algorithm of giving precedence to the routing rule with the "most specific" address match is known as "longest prefix match". For instance, rule A is for a subnet with addresses such as "0110 01xx" (where 'x' means the address bits can be 0 or 1). This is a prefix fixing 6 bits. Rule B is for addresses in the range "0110 010x" - a 7 bit long prefix.
TOC |
For simplicity, in what follows, I will use the term "FEC" to refer to a numeric value which the router interface needs to compute rapidly for each incoming packet. This value controls whether the packet will be dropped, forwarded out the same interface, forwarded to another interface or subjected to further analysis and processing. In practice there may be other aspects of FEC which can be derived from other attributes of the packet, such as the DiffServ Code Points which are used to select which output queue the packet is sent to on the output interface. However, for this discussion, I will consider "FEC" to be simply a binary number created by the input interface's FIB, based solely on the packet's destination address.
TOC |
The simplest software approach to forwarding involves an iterative, linear search through the FIB's list of rules, comparing each rule with the packet's destination address. In this approach, the FIB's rules are either the same as those in the RIB or are a somewhat simplified version, such as by combining two rules with the same forwarding outcome (the same drop, process or deliver to a particular interface information) which have adjacent and aggregatable address ranges. For instance, if the same rule applies to "0010 000x" and "0010 001x" then this can be replaced by a single rule for "0010 00xx". Likewise "0010 001x" and "0010 0xxx" can be combined into "0010 0xxx" if their rules have the same forwarding outcome.
While the order of rules in the RIB may not be important, in the FIB the order must follow the "longest prefix match" principle. Any longer prefix must appears before the shorter prefix which encompasses its address range. For instance Rule B in the previous example, with its longer prefix "b", must be found by the search algorithm before it finds rule A.
Each rule contains a number which directs the router to forward the packet to a particular interface, to drop it, or to subject it to further processing. There is no other absolute requirement about the ordering of the rules, but shorter processing times would be achieved by placing those rules at the front of the FIB which match the largest proportion of packets in the current traffic environment. This linear search algorithm could also be implemented in hardware, or by a micro-programmed processor specifically designed for the task.
TOC |
Where the number of rules exceeds a few dozen, it would typically be faster to find the correct rule for each packet by structuring the rules in a tree-like manner in memory, so software or dedicated hardware could locate the correct rule in a limited number of iterated cycles. For instance, an algorithm might first select one of two first level branches in the FIB depending on the state of bit 31 of an IPv4 address. Then it selects between two second level branches from whichever first level branch it chose in the first cycle. This process continues, potentially for 32 cycles, until it finds a leaf - a node in the tree which is the longest prefix match for this address, and so has no further branches leading from it.
This is an onerous task with IPv4 in the DFZ, because a significant number of packets need to be matched to prefixes 15 to 24 bits long. The average length of longest prefixes matched would depend on the specific location of the router and the type of traffic. Processing speed is boosted firstly by switching initially on the most significant 8 bits, since all routing rules have prefixes at least this long, and by more sophisticated approaches to the tree structure. For instance some chains of such a tree have many levels and no branches, which increases the time to reach the end node and the storage requirements. A "Patricia trie" is an improvement on the standard binary radix tree which solves this problem. Detailed explanations of routing and forwarding approaches can be found at Pankaj Gupta's site [Gupta] (Gupta, P., “Pankaj Gupta's thesis and other material on routing and forwarding,” August 2006.) - in particular Chapter 2 of his thesis. Trees can also be made with more than two branches per node. For instance a 16-way branch handles 4 address bits per node traversal operation, potentially reducing the search time, but this raises problems with memory storage efficiency.
TOC |
In high-end routers, the most common technique for classifying incoming packets is dedicated hardware based on Ternary Content Addressable Memory (TCAM) chips. "Ternary" means that each functional cell of the TCAM has three states: "match 0", "match 1" and "don't care". TCAM is always used in routers, rather than the simple "CAM", in which each functional cell has only two states: "match 0" and "match 1". Nonetheless, the term "TCAM" is sometimes loosely shortened to "CAM" in discussions about routers. Ethernet switches need to match every bit of a 48 bit MAC address, so they use true CAM (Content Addressable Memory), but routers need to be more flexible, and be able to ignore the state of many bits.
While TCAM and some highly optimised iterative techniques are the fastest approaches for the very broad general purpose nature of router functionality, they do not scale easily - or perhaps at all - to handling millions of prefixes, each with a potentially different "Forwarding Equivalent Class" (FEC).
TOC |
The large, fast, TCAM chips needed for high-end routers are exotic, complex, power-hungry devices. Data is written into the memory cells (usually Static RAM flip-flops) by the CPU of the router or of the interface card on which the FIB resides. There is more to a complete FIB than one or more TCAMS, but in this explanation we will consider the use of a single TCAM and a second, conventional, SRAM chip, solely for determining the FEC of each incoming packet. I will describe a TCAM and its associated SRAM in some detail, because this technology is the most likely one to be scaled up in order to cope with the routing table explosion, unless a direct SRAM-based technique such as I am proposing is employed.
I will describe a simple, imaginary, TCAM with 32 addresses and 8 data bits. Each "cell" consists of two flip-flops, and all the cells in our example have previously been written by the router's CPU to implement the currently required rules for classifying packets to create the proper FEC for each one. The TCAM is the first and most demanding part of the process. This example FIB can contain up to 32 rules, and in this example will be working from an 8 bit destination address of a packet. The 8 "data" input signals enter at the top of the device, and each one is split into two lines which run vertically to the bottom edge. For instance, for bit 0, there is a true bit 0 line and an inverted bit 0 line. So there are 16 lines which may change state every time some new "data" is presented to the chip. The 32 "addresses" are implemented as 32 horizontal rows. Each intersection of an "address" row and a pair of "data" lines contains two memory cells, as just described, and two comparators. The outputs of all the 16 comparators in an "address" row can pull down a horizontal line I will call the "match" line for this address.
Each pair of flip-flops and their associated comparators implements the ternary comparison function. When both flip-flops are low, the cell does not care about the state of the true and inverted data lines which pass downwards across it. When the left flip flop is set to "1", its comparator will pull down the match line if the true data line is low. Similarly, when the right flip flop is set to "1", its comparator will pull down the match line if the inverted data line is low. Further details can be found in [Taylor‑Spitznagel] (Taylor, D. and E. Spitznagel, “On using content addressable memory for packet classification,” March 2005.), where power dissipation figures of 20 to 30 watts are quoted for 2002 technology TCAMs of 18Mbit capacity.
In our example, the first address row at the top - address 31 - has its cells set to match the following pattern of address bits, where "x" means "don't care": "0110 1xxx". (In these examples, the most significant bit is on the left.) Address 30 is set to match "0001 111x" address 29 is set to match "100x xxxx". It can easily be seen how a TCAM can, in a single clock cycle, compare the address bits of an incoming packet, which are driven to the "data" pins of the TCAM, with the rules which have previously been stored inside the device.
Along the right edge of the example device, the 32 match lines enter a priority encoder, which is a simple arrangement of logic gates so that a 5 bit binary number emerges from output pins, corresponding to the highest numbered match line which remains high. (The term "pin" refers to physical electrical connections of an integrated circuit, despite most large chips now being in packages which use solder balls rather than pins.)
At the start of the cycle, all match lines and all the true and inverted "data" lines are pulled high. When the true and inverted data lines are set according to the packet's address, none, one or multiple match lines will remain high. In this example, the packet's address bits are "1000 1101" so the match line of address row 29 remains high. Other lower numbered match lines may be high as well, but the priority encoder ignores them. The TCAM chip produces an output (on its five "address" pins) of the binary number for "29".
There is some specialised terminology for TCAMs. They are sometimes marketed as "network search engines". When performing their comparison function, the input, to the "data" pins in this example, is known as the "key" and the output is called the "address". A recent paper describing TCAM usage in IP packet classification, with a particular emphasis on optimising the speed of rewriting the cells when a routing withdrawal or addition occurs, is: Gesan Wang and Nian-Feng Tzeng 2006 [Wang‑Tzeng] (Wang, G. and N. Tzeng, “TCAM-Based Forwarding Engine with Minimum Independent Prefix Set (MIPS) for Fast Updating,” February 2006.).
TOC |
In the example, the TCAM output 29 does not tell the router which interface to send the packet to. This information is stored in a standard SRAM chip, which has its address inputs driven by the output of the TCAM. The router's software has previously written, to each location in the SRAM, the correct FEC data for each rule in the TCAM.
More complex processing can be achieved by extending this architecture to involve analysing a packet with one set of TCAM rules, with the data read out from the SRAM determining whether the packet will be matched against further rules, or whether the result of the previous operation contains sufficient information to determine the packet's fate. In this way, complex multi-cycle programs of analysis can be performed on packets.
TOC |
A TCAM is a sophisticated, flexible, massively parallel comparison system. TCAM chips are relatively exotic devices - since they are primarily used only in routers and networking equipment. Some reasons for their high power consumption include that each "bit" really consisting of two flip-flops and two comparators, and that in every comparison cycle, all the match lines are precharged high, after which most or all of them will be pulled low.
TCAM's power consumption and limited capacity are significant problems. The devices are often partitioned so only certain sections are active for a particular "search" cycle. Modern devices, such as the Renesas R8A20211BG operate at high rates, such as 133M "searches" per second for a 72 bit key with 262144 address rows. No power consumption figures are available for this device [Renesas] (Renesas, “R8A20210BG data sheet,” February 2005.). Its sample cost in 2005 was about USD$175.
Another major problem is that updates to the routing table often require significant rewriting of the contents of the TCAM and the SRAM, since rules may be added and deleted. The order of rules is crucial, since the order determines which of multiple true "match" lines will be recognised by the priority encoder. When a change to the RIB occurs, implementing the required change in the FIB (in this case consisting of the data in the TCAM and the SRAM) may involve many rules being rewritten to other locations in order to fit the newly structured list of rules into the available space. While the data is being rewritten, the FIB cannot be used to classify packets, so packets may be dropped. Devavrat Shah and Pankaj Gupta, in 2001, considered optimisations for the way data is structured to improve upon occasional worst-case rewrites involving 64k locations, at 50MHz cycle time, which would hold up packet processing for 1.2ms [Shah‑Gupta] (Shah, D. and P. Gupta, “Fast incremental updates on Ternary-CAMs for routing lookups and packet classification,” January 2001.). The SRAM would require the same demanding pattern of writes by the router's CPU when large numbers of rules are moved in the TCAM.
TOC |
TCAMs are a necessary part of most router architectures. However, when an ISP or end-user adds another prefix to the global routing table and when (as is often the case) the prefix is advertised in a location such that from the point of view of a subset of transit and border routers, packets addressed to this prefix must be forwarded to a different interface from those addressed to the neighbouring prefixes, then TCAM-based routers in this subset will need to use another address line of their TCAMs.
This adds directly to the costs of thousands of routers, and directly contributes to their energy consumption. TCAM memory is often difficult or impossible to expand. In order to ensure routers can handle whatever expansion of the "global BGP routing table" that may occur in their 5 or more year projected service life, network operators typically need to pay up front for this expensive hardware in every interface, and the router software to manage it.
If no single TCAM chip can store the required number of rules, they may be used sequentially or in parallel arrays. Both these approaches slow down processing and add power consumption.
TOC |
I propose simple implementation of the FIB, which involves the router's CPU processing the RIB to create a 4 to 9 bit word (depending on the number of interfaces the router has) for the FEC of every prefix of a certain size. For IPv4, this is the /24 prefix. 2^24 of these must be calculated and stored in one or more SRAM chips. Then, destination address bits 31 to 24 are used to drive the address pins of the SRAM, in read mode, with the output being the FEC for this packet. There are no complex algorithms for ordering rules, or requirements to move other rules as new rules are added. The most common update, which is either the withdrawal of a /24 or a change in its FEC, involves a single write operation. The hardware multiplexes access to the address lines of the SRAM so that CPU write cycles can be interspersed with read cycles for packet classification. Appendix 1 contains further low level details of how this might be done.
Here I describe a practical approach to implementing this architecture with currently available memory chips. There may be other approaches, including extending existing router architectures with additional SRAM to achieve the same functionality. In practice, this architectural block would be part of the larger FIB function, with packets first being handled by the SRAM system. Address ranges for packets which require further processing by TCAM or other techniques will have the SRAM data for the /24 prefixes in each range set to a values which selects this further processing.
The SRAM would be driven by hard-wired, FPGA or micro-coded systems which firstly determine the nature of the packet, such as IPv4, IPv6 or MPLS. Although it would be possible to map the 1M MPLS label space into unused parts of an SRAM which is used for IPv4, MPLS requires other functions and data storage, including the storage of a 20 bit new label value to write to the packet, and possibly information about how to prioritise it in one of the potentially multiple output queues in the interface which sends it to the next hop. In this section, I will consider only IPv4 packets.
Certain conditions must be met for the system to be effective. Firstly, the vast majority of packets - essentially all user traffic packets - must have their FEC defined in a single cycle of this SRAM-based FIB. Secondly, the system must be able to cope reliably, but not necessarily as quickly, with the smaller number of packets which need to be matched to prefixes longer than /24.
The third requirement is a fast, simple method of updating the FIB when the RIB changes. The SRAM design provides this, unless a very short prefix is changed, which would require writing thousands or hundreds of thousands of locations. Fortunately, while such an extensive rewrite of the locations covered by this prefix was taking place, it could be interspersed with accesses for packet classification. During this time, some of the /24 prefixes within the larger prefix being updated would have the old FEC value and others would have the new. This would result in some packets being sent to the wrong interface, but it is the interface they were previously sent to. During the rewrite, there is no impact on packets outside the range being changed. Changing an entire /8 would take 64k cycles, which might take a few milliseconds. Worst-case TCAM updates may take this long, but the TCAM generally cannot be used while its contents are being rewritten.
In terms of simplicity, low power consumption and compact size, the SRAM approach could only be bettered, perhaps, by use of less expensive DRAM. However, DRAM cycle times are much longer than SRAMs, which can typically produce their read results in a nanosecond or so, and complete a read or write cycle in 4 nanoseconds. Whereas TCAMs and iterative search approaches are complex, in need of many optimisations and so are the subject of much academic research, there is little to write about using a simple SRAM chip except that it is a straightforward engineering solution which is easy to understand and program.
The greatest single benefit of the SRAM approach is that its performance is optimal no matter how many rules are contained in the RIB. This includes the worst-case situation of complete route disaggregation, in which packets addressed to every successive /24 are forwarded to a different interface. Implementing this architecture in all the Internet's DFZ routers would remove all hardware-based pressure to achieve route aggregation.
In sections below discuss specific hardware and BGP policy proposals for both IPv4 and IPv6.
TOC |
The report and presentations from the October 2006 IAB Routing and Addressing Workshop in October 2006 is the best reference for the problem I am addressing [I‑D.iab‑raws‑report] (Meyers, D., “Report from the IAB Workshop on Routing and Addressing,” February 2007.) [IAB‑RAWS‑website] (Meyers, D., “IAB Workshop on Routing and Addressing - resources and presentations,” December 2006.). Below, I quote some of the key statements of the report and presentations.
TOC |
From the report: "While several scalability features of the routing and addressing systems were discussed, most related to the size of the DFZ routing table (frequently referred to as the Routing Information Base, or RIB) and its implications. Those implications included (but were not limited to) the sizes of the DFZ RIB and FIB (the Forwarding Information Base), the cost of recomputing the FIB, concerns about the BGP convergence times in the presence of growing RIB and FIB sizes, and the costs and power (and hence heat dissipation) properties of the hardware needed to route traffic in the core of the Internet."
TOC |
Yakov Rekhter's "Rekhter's Law" was cited as one of the fundamental assumptions underlying the scalability of routing systems: "Addressing can follow topology or topology can follow addressing. Choose one." I can find no mention of new hardware FIB designs which are not impacted by the route disaggregation which "Rekhter's Law" is intended to prevent. However, this assumption of the apparent futility of hoping for such an approach is noted in the following paragraph:
"A refinement to Rekhter's Law, then, is that for a routing system to scale, the locator part of IP address must be assigned in such a way that it is congruent with the Internet's topology. However, as identifiers are typically assigned based upon organizational (not topological) structure and have stability as a desirable property, a 'natural incongruence' arises. As a result, it is difficult (if not impossible) to make a single number space serve both purposes efficiently. Of course this conclusion assumes, as mentioned above, that no effective 'non-topological routing system' exists."
The purpose of this Internet Draft is to suggest that a simple hardware forwarding system does exist which is not impacted by address assignments which have no correlation with topology.
TOC |
Regarding IPv6: "The primary issue with IPv6 deployment was that, in the absence of a scalable routing strategy, IPv6 has the potential to exacerbate today's problems simply by the virtue of its much larger address space." and "Thus the opportunity exists to create a "swamp" (unaggregatable address space) that can be many orders of magnitude larger than what we faced with IPv4."
TOC |
Regarding the impact of activities such as multihoming by ISPs and end-users on the costs of purchasing and running transit and border routers, "the workshop participants felt that the costs and benefits in today's routing system are misaligned. While the IETF does not typically consider the "business model" impacts various technology choices directly, many participants felt that perhaps the time has come to review that philosophy." The high cost of renumbering an end-user network was acknowledged together with the observation that "no strong disincentive exists to discourage the increasing use of Provider Independent address space".
TOC |
Multihoming for end-user organisations was recognised as being "in some circumstances, mandatory due to contract or law." Uses of Traffic Engineering were recognised as being mandatory - for goals including load balancing, low-cost path selection, maintaining peering agreements and for ensuring that packets must follow, or not follow, certain paths. There is also a statement to the effect that ARIN has been allocating Provider Independent IPv6 /48 prefixes for end users, but my understanding of the policy statements at the ARIN site is that this is only for "infrastructure providers", such as Internet exchanges.
TOC |
Section 2.2 of the report discusses how the two-layer domain name / IP address split has become overloaded with functions, since the IP address which rightfully specifies "where" (in contrast to the DNS text name's "who") is no longer a direct function of "where" the node is within the network's topology. There is a review of various approaches to inserting a third layer into IP protocols, where a middle level, reasonably stable "locator" is mapped in real-time to one or more relatively unstable "identifiers", enabling higher level protocols to function as usual while the physical location of nodes changes.
The first part of David Wheeler's aphorism is cited: "There is no problem in computer science that cannot be solved by an extra level of indirection,", but not the second: "but that usually will create another problem."
TOC |
There was considerable debate about the ability of Moore's Law to keep up with the growth in the size of the global BGP routing table. Concerns were raised about the chips used in high-end routers being low volume devices with very high design costs, which only benefited marginally from the spectacular leading edge of semiconductor development which is focused on mass market CPUs and memory.
TOC |
The report's paragraphs on heating and power appear in full below:
"Transistors consume power both when idle ("leakage current") and when switching. The smaller the transistors, the larger the leakage current. The overall power consumption is not linear with the density increase. Thus, as the need for more powerful routers increases, cooling technology grows more taxed. At present, the existing air cooling system is starting to be a limiting factor for scaling high-performance routers.
"A key metric for system evaluation is now the unit of forwarding bandwidth per Watt-- [(Mb/s)/W]. About 60% of the power goes to the forwarding engine circuits, with the rest divided between the memories, route processors, and interconnect. Using parallelization to achieve higher bandwidths can aggravate the situation, due to increased power and cooling demands.
"[Editor's note: Many in the community have commented that heat power utilization and the attendant heat dissipation, along with size limitations of fabrication processes are the current limiting factors.]"
I note that a 2006 report [Gartner] (Mingay, S., “The IT Industry Is Part of the Climate Change and Sustainability Problem,” November 2006.) states "Gartner roughly estimates that during operation, today's servers and PCs account for about 0.75% of global carbon dioxide emissions (based on direct power consumption, not including cooling)."
TOC |
In Section 8, Criteria for Solution Development, the report states: "In the routing system itself, the solutions must allow incremental changes from the current operational Internet. The solutions should be backward compatible with the routing protocols in use today, including BGP, OSPF, IS-IS, and others, possibly with incremental enhancements. The data path should support IPv4 and IPv6."
I believe this SRAM-based FIB proposal for IPv4 meets all these criteria. For IPv6, some changes will be necessary. I initially discuss a complete reallocation of current global unicast space into a smaller prefix. However a more attractive alternative may be to define a smaller prefix for SRAM-based forwarding, which still has a vast addressing capacity, either inside or outside the currently allocated 2000::/3 prefix. Provider Independent use of addresses in that prefix, and of all IPv4 addresses, could then proceed as long as the longest advertised prefixes remained within the defined limits: /24 for IP4 and perhaps /35 for IPv6.
For both IPv4 and IPv6, the success of these proposals depends on all transit routers and multihomed border routers being able to handle the increased number of BGP advertised routes (without any requirement for route aggregation) which the proposal enables and encourages. I expect this would occur over three to five years, as part of the natural replacement cycle of routers, provided that a new standard for routers can be devised, based on stable, agreed, administrative limits to the bits which vary in the inter-AS user traffic of the Internet. The creation of such a standard, the growing deployment of SRAM-FIB-equipped routers and the freedom from constraints regarding route aggregation would lead to an increased rate of growth in the total number of advertised prefixes, above that which would occur without these proposals. This would hasten the pressure to deploy new routers, as older ones with conventional FIB technologies reach the limits of their capacity. These proposals are disruptive in the sense that they would accelerate growth in the number of advertised routes, requiring earlier replacement of non-SRAM-equipped routers - but this is still an incremental change.
I expect the SRAM IPv4 FIB approach would naturally be implemented by router manufacturers without any work prompted by this Internet Draft. My aim is to bring this development forward for both IPv4 and IPv6 in a standardised manner, by prompting discussion and hopefully agreement on the BGP and address assignment policies for both protocols. I envisage the IPv6 solution involving either a compact reallocation of the global unicast address space before it is more widely used, or by the long-term growth of most IPv6 traffic occurring between addresses within a standardised prefix for which the next generation of routers will perform SRAM-based forwarding.
TOC |
In this section, I propose a hardware design an IPv4 FIB function, based on a specific already mass-produced Static RAM (SRAM) device, the 4ns cycle time, 8M x 9 bit, Samsung K7R640982M [Samsung SRAM] (Samsung, “72Mb QDRII data sheets,” February 2006.). While I will not discuss all technical details of this device, I will provide enough to enable readers to envisage the physical implementation of the system I am proposing. There are a number of other devices, including those of other manufacturers, which could provide the same functions, but this particular device is currently the best for explaining the design. My aim is to show that this solution is practical and elegant. By the time the system is built into routers, there are likely to be further choices in how to implement it.
TOC |
The K7R640982M is part of a "Quad Data Rate II" (QDRII) family of devices. The electrical and physical specifications for this family are standardised by the Quad Data Rate Consortium [QDR Consortium] (QDR Consortium, “QDR Consortium,” March 2007.), of which Samsung is a member. None of the other members - Cypress, Renesas, IDT and NEC - currently make a device with the K7R640982M's features: 72 megabits with 9 bit wide data inputs and outputs. Other family members, such as with 18 bit inputs and outputs, could be used for the FIB system, but the 9 bit device is probably most convenient.
The SRAM measures 17 x 15 x 1.3mm when soldered flat to the printed circuit boards via its 165 solder balls. The price I was quoted by Samsung, in February 2007, for sample quantities was USD$70. The maximum power dissipation is 1.45 watts, but actual power dissipation is likely to be less, since even with a 40Gbps stream of packets, the device will not be running at its maximum 250 million memory cycles per second. A single such device holds a 4 bit FEC value for every unique IPv4 /24 prefix.
The device has 9 data in and 9 data out pins. This separation of input and output pins is convenient since the router's CPU only needs to write (except for memory test purposes) and the FIB function only needs to read. I will present the device as if it has 23 address pins, but in fact it has 22, corresponding to A22 through A1. A0 is generated internally, in each cycle, and two 9 bit bytes are read on every 4ns read cycle. I will not discuss the straightforward low-level arrangements for using this 9 bit plus 9 bit read or write memory cycle and will portray it as a simple "8M x 9 bit" SRAM chip, with 23 address lines, reading or writing 9 bits of data.
TOC |
The address pins of the SRAM system need to be driven by either the FIB hardware - by bits 32 through 24 of an IPv4 packet's destination address - or by an address presented by the router's CPU. At first, I will describe the FIB read cycle for a router with up to 14 interfaces. The FIB function is implemented identically on every interface, with the same data being written to each SRAM, assuming there is no need for different interfaces to forward IPv4 packets differently. Relatively straightforward hardware detects the IPv4 packet, switches its address bits to the SRAM, collects either the low 4 bits of the 9 bit read operation or the high 4 bits (one bit is unused in this design) and then uses those four bits to determine where to switch the packet to for forwarding. Thus, for each of the 16,677,216 /24 prefixes, the SRAM reads out a specific value of FEC.
Table 1 shows an example of the meanings of the values of a 4 bit FEC.
Example of functions of 4 bit FEC values
FEC value | Action |
---|---|
0 | Drop packet |
1 | Analyse packet by other means |
2 | Forward packet to interface 0 |
3 | Forward packet to interface 1 |
... | ... |
14 | Forward packet to interface 12 |
15 | Forward packet to interface 13 |
Table 1 |
Where the router has more than 14 interfaces, or where there is a requirement to use more of these values to select alternative methods of processing, the next obvious option with currently available memory chips is to use 9 bits. This provides for up to 510 output interfaces and requires two of the currently available 8M x 9 bit chips, with a little hardware to select one or the other as the active device.
It may be attractive to fix the meaning of "2" to be "forward the packet by this interface", but that would require different data to be written into the SRAMs of each of the router's interfaces. Probably, the writing would be done by the interface's CPU rather than a central CPU, to allow full flexibility.
For a "small" router - one with 14 or fewer interfaces - A31 to A9 of the packet's destination address drives the SRAM itself (A22 through A0) and A8 of the destination address is used by the hardware to select the high or low 4 bit nybble from the 9 read data pins. For a large router, two chips would be needed, in parallel, with packet address bit A8 selecting one chip or the other to perform a read operation. This yields a 9 bit FEC value.
If every packet was known to be an IPv4 packet and all prefixes in the routing table were known to be /24 or shorter, then the above arrangement would be sufficient. In practice, some elaborations would be necessary.
Firstly, hardware would probably detect packets addressed to 224.0.0.0/4 (broadcast) or 240.0.0.0/4 (reserved), although the latter range is a candidate for global routing in the future.
Secondly, the router needs to be able to cope quickly with the main volume of AS to AS traffic, which will be forwarded according to prefixes of /24 to /8, while also being capable of correctly forwarding packets which match longer prefixes. This can be achieved by making the SRAM FIB the first stage for all IPv4 packets, with a packet being sent to the TCAM section if its FEC result is 1, meaning "Analyse packet by other means.". For instance, if any /24 prefix includes a longer prefix which has different FEC than the rest of the /24, then all packets addressed to this /24's address range should be analysed by the conventional TCAM etc. based packet classification system.
The routing table for transit routers (and I assume multihomed border routers) such as that prepared daily by Geoff Huston [GIH BGP prefixes] (Huston, G., “Geoff Huston's BGP prefixes.txt,” March 2007.) have a small number of prefixes longer than /24. These prefixes are analysed and listed separately, for each prefix length, at [RW BGP prefixes analysis] (Whittle, R., “Probing the density of ping-responsive-hosts in each /8 IPv4 prefix and in different sizes of BGP advertised prefix,” March 2007.). These are presumably routes for connecting to routers themselves, rather than for handling Internet user traffic. The small number of routes of this nature and the relatively small traffic volumes, primarily BGP updates, carried on these routes, should not tax the storage capacity or the speed of TCAM or of other approaches to forwarding.
TOC |
This proposal requires no change in IPv4 usage or in the current policy of accepting /24 prefixes into the global BGP routing system and rejecting longer prefixes. As far as I know, this is not a formal policy, but is widely adopted by all network operators. (There is provision for RIRs requiring limits on the size of prefixes added to routing tables in section 4.5 of [RFC3177] (IAB and IESG, “IAB/IESG Recommendations on IPv6 Address Allocations to Sites,” September 2001.).) Ideally, to give router manufacturers and purchasers confidence, the proposed SRAM FIB approach, or its functional equivalent, would be standardised together with some kind of standard or agreement that this /24 limit would be retained for a long time, such as fifteen years or more.
This proposal would be more attractive if it was accompanied by broad agreement about how IPv4 addresses are to be allocated to ISPs and other users with Autonomous Systems, particularly regarding what, if any, expectations there would be regarding how the ISPs and other users would split and separately advertise their address space. The proposal is intended to facilitate an explosion in the number of IPv4 BGP routes, to enable a greater number of users to make more efficient use of their address space. However, this will only be practical after a number of years in which the SRAM-equipped routers replace those which cannot handle the growing number of RIB entries.
TOC |
If this proposal is implemented, and the expected growth of advertised BGP prefixes occurs, all participating routers will be required to handle a much greater number of routes and routing changes. It may be expected that improved router CPU speed and memory capacity will be able to cope with this, with suitable planning. However it cannot be assured that the global BGP system will remain stable and responsive enough under this increased load. The volume of data transacted as part of the BGP protocols and the time delays in transferring it are unlikely to be prohibitive. However, the time it takes for the increasing number of routers to collectively settle after a change in advertised prefixes is likely to be a major challenge.
Perhaps the regular /24 boundaries of the new hardware architecture, and the likely increased advertisement of /24 prefixes, will enable BGP communication to be optimised in terms of network efficiency or to achieve faster convergence and greater stability.
If this proposal is implemented, there will probably need to be changes to the BGP protocol and to administrative standards for advertising prefixes (there are few, if any, at present) in order to cope with the greater tasks imposed on the global BGP routing system. At present, a great deal of the BGP protocol traffic concerns long prefixes (small numbers of addresses) which change relatively often. This is documented in the "Adds and Wdls per Prefix Length" section of Geoff Huston's CIDR report [GIH CIDR Report] (Huston, G., “CIDR Report,” March 2007.).
If there were disincentives to such rapid changes in advertised routes - implemented in the configuration of routers, in RIP guidelines and rules, and/or in the BGP protocol itself, then the changes which are advertised would be more constrained to those which are "most necessary". This value judgement needs to be according to the interests of the majority of operators of DFZ routers, who directly bear the burden of each BGP change. An example of BGP changes which might be deemed unreasonable in this framework is the multiple changes per day generated by an ISP who uses BGP multihoming to achieve traffic engineering, for instance for load balancing as traffic moves from a business-hours pattern to a residential pattern.
Disincentives could be imposed on the ISPs and other AS users who make changes which are deemed to be excessive. For instance, if a user often changed how they advertised a prefix - whether deliberately or due to instability in their network, then beyond some kind of limit, this prefix would have longer periods of unreachability due to some or many routers giving this prefix's changes a low priority. I understand that this is already implemented within many routers, primarily to improve stability by not propagating fluctuating changes too rapidly.
TOC |
For the sake of discussion, if the proposed changes are accepted as desirable, we might ask how could they be improved or extended, either initially or at some time in the future.
The most obvious option would be to double the SRAM requirement in each interface of each DFZ router and extend the BGP prefix length limit to /25. This would have advantages, including enabling a finer use of IP address space, such as allowing ISPs and AS users who only require a handful of IP addresses to be multihomed to do so with smaller allocations of address space.
It is my impression [RW BGP prefixes analysis] (Whittle, R., “Probing the density of ping-responsive-hosts in each /8 IPv4 prefix and in different sizes of BGP advertised prefix,” March 2007.) that there is so much unused, or very sparsely used, IPv4 address space at present that the changes proposed in this Internet Draft would be sufficient to enable much better use of address space. Assuming IPv4 is widely used in the decades to come, a long-range plan might be made in the future to allow advertisement of /25 prefixes - once the cost and power dissipation of the required SRAM is much lower than it is today.
TOC |
TOC |
Current IPv6 address management policy [IPv6‑Policies] (IANA, “IPv6 Allocation and Assignment Policy,” June 2005.) [RFC3177] (IAB and IESG, “IAB/IESG Recommendations on IPv6 Address Allocations to Sites,” September 2001.) [RFC4291] (Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” February 2006.) provides for allocation of global unicast addresses within the prefix 2000::/3, which fixes the most significant three bits of the address to "001". In general, a /48 prefix will be assigned to each end user. 45 bits - 124 to 80 inclusive - vary in this scheme. If the longest prefix admitted to the global BGP IPv6 routing system is a /32 (as, I think, is current practice) then this still requires routers to classify packets based on 29 bits - bits 124 to 96 inclusive.
In principle an SRAM system could be used to map these 29 bits directly to four or nine bits of FEC data. However, for a small router (one with up to 14 interfaces) this would require 32 72Mb SRAM chips. This is impractical in terms of cost, space and power consumption, unless perhaps the one set could be used for the entire router, rather than on each interface. I think it is unlikely that this amount of RAM would be practical to install in each interface of tens of thousands of routers in the next ten years - or perhaps at any time in the future.
TOC |
This section suggests a complete reallocation of the current 2000::/3 prefix of global unicast addresses. The following section suggests a less disruptive and more attractive alternative: maintaining current allocations, by defining a small prefix to be handled with SRAM-based forwarding, either within or outside 2000::/3. Further sections below discuss the use of smaller amounts of memory, initially, for the IPv6 SRAM-based FIB function, including the use of no extra memory chips by applying 2 million spare locations in the SRAM which is needed for IPv4.
Assuming the /32 limit is maintained, an SRAM-based FIB architecture would be practical if it could be known that all global unicast addresses would fall within a smaller range than is currently the case. A router would first process an IPv6 packet to determine whether its destination address was within the restricted range covered by the SRAM system, and if so use that system to determine its FEC. As with the IPv4 proposal, one value from the SRAM would cause the packet to be dropped, another would cause it to be processed by conventional (TCAM etc.) techniques and the rest of the possible values would specify which interface the packet should be forwarded to.
There are several options for RAM size and the range within which all global unicast addresses must be constrained to. I propose the decision be based on ensuring a long-lasting standard, with hardware implementation costs which are practical in the short term. In the event that this space becomes overly restrictive at some time in the future, such as in one or two decades, a decision could be made to double the address space and therefore the SRAM requirements for routers.
One option is to devote two 72Mb SRAMs for small routers and four for larger routers - those with between 15 and 510 interfaces - to the IPv6 global unicast FIB function. If IPv6 global unicast addresses were reallocated to 2000::10, this would provide direct hardware support for 33,554,432 /35 prefixes, each of which provides 8192 /48 user networks. This scheme would support 4,194,304 /32 allocations to each ISP or AS end-user - each of which could be broken into eight independently advertisable /35 prefixes.
To halve the amount of RAM, several changes could be made to this scheme. One is halving the total global unicast space so that 2,097,152 /32 allocations could be made. This might be a reasonable choice, if it could be shown that that this would probably cover demand for 15 years or so. Another approach would be to map /34 prefixes, rather than /35. This reduces the maximum number of separately advertisable subnets in each /32 from eight to four. Larger organisations would then be required to use more than one /32, for instance to robustly multihome more than one site.
Reallocating currently allocated IPv6 global unicast addresses to a range 1/128 the size of their current spread raises some difficulties. Firstly, it would require almost all current users to renumber their networks. However, RIRs have long insisted that all users, other than large ISPs, should renumber their networks whenever they change their connection to the IPv6 Internet - so it does not seem unreasonable to expect the RIRs and ISPs to undergo a once-only renumbering at this early stage of IPv6 adoption.
The second change will need to be in the minds of users and administrators, who formerly saw the vast spaces of IPv6 as an asset. The goal of address aggregation was seen as being somewhat easier to achieve with a vast and uncluttered address space, because even a tiny fraction of the total space provides billions of public IP addresses. The trouble with this approach is that it precludes the use of SRAM based FIB techniques, leaving IPv6 routing to costly, power-hungry, unwieldy techniques such as TCAMs.
By a stroke of luck, the number of bits in play in IPv4 addressing neatly matches mid-2000s SRAM capacities. The extra 7 or so bits in flux within the current IPv6 allocations precludes the use of SRAM - the only cost-effective, power-efficient, hardware routing technology which currently seems to be available. IPv6's current address spread in bits 124 to 118 might therefore be considered harmful to the long-term routability of the Internet. Fortunately, even if these bits are fixed to zero to enable the SRAM FIB architecture, the vast capabilities of bits 0 to 92 should not result in shortages of IP addresses or inflexibility in usage for many decades, or perhaps forever.
TOC |
As a less disruptive alternative to the above suggestion, it may be desirable to continue all existing global unicast allocations (to RIRs) and assignments (from RIRs to ISPs) and to define a particular prefix, such as a /10, either within or outside 2000::/3, for which future routers will use SRAM-based forwarding.
The long-term goal would remain the same: for most or all global IPv6 traffic to be handled by future routers with SRAM-based techniques, so that the space could be split and advertised down to a defined granularity, such as /35 prefixes, with fast SRAM-based forwarding and no requirement for route aggregation. In the discussion which follows, the "/10" refers to the prefix for which future high-end routers will use SRAM-based forwarding, to a granularity of (for instance) /35. A final decision on the size of both prefixes would require consideration of long-term IPv6 development goals and of the costs of SRAM and the other changes which would be required in routers.
As with the previous suggestion of complete reallocation, the definition of the /10 would need to result from agreement involving router manufacturers, network operators, RIRs and the IETF/IANA. However, since no reallocation is required, consensus may be much easier to achieve. RIRs would be able to assign address space within the /10 without regard to network topology or route aggregation. This means that space within the /10 could be allocated to RIRs without such concerns. It would be no concern if neighbouring /32 prefixes were allocated to ISPs and AS end-users in France, Nigeria, New Zealand and Ukraine. This would enable smaller allocations to RIRs, or enable RIRs to assign space to ISPs and AS end-users from a common pool.
Prefixes within the new /10 could be assigned to any ISP or end-user organisation with an AS number, on the basis that both ISPs and AS end-users could, in the long term (when suitable SRAM-FIB routers become ubiquitous), advertise their space down to /35 prefixes with complete freedom in terms of network topology. This does not necessarily mean that they would be encouraged or allowed to rapidly change these advertisements, because the load and stability problems such changes place on the global BGP routing system remains a burden on the entire network.
If the /10 was located within 2000::/3, then the administrative changes would involve a subset of the space currently being allocated to RIRs. Non-SRAM-based (hereafter referred to as "conventional") routers would handle packets addressed to the /10 using conventional TCAM etc. techniques. In the future, when the number of advertised prefixes within this /10 grows to exceed the capacity of conventional routers, then conventional routers which are still being used as transit routers or as multihomed border routers will not be able to handle all the traffic addressed to the /10. Given the slow pace of IPv6 adoption, this could be quite some time in the future - enough time for SRAM-based routers to be very widely deployed. I assume that irrespective of any initiatives resulting from this Internet Draft, SRAM-based FIB architectures will widely implemented in high-end routers in the 2010 to 2015 timeframe, since this is the simplest, most "future-proof" method of handling the continual growth in the number of IPv4 BGP routes. It is possible that this expected change to router architecture for IPv4 would be extended to cover IPv6, probably with one or two extra SRAM chips for this purpose. (Below, I discuss handling IPv6, initially, in spare space inside the IPv4 SRAM.) However, this would probably only occur if there was a formal agreement or technical standard defining a small subset of IPv6 space where SRAM-based forwarding and freedom from route aggregation concerns would apply.
If an industry-wide agreement was reached about SRAM-based forwarding for IPv4 and IPv6 in the next few years, then a /10 inside the current 2000::/3 would be minimally disruptive. If the /10 was defined outside 2000::/3, then this would more clearly distinguish the new space, with its need for SRAM-based routers to be widely once users advertise so many routes in this space that conventional routers can no longer cope. I expect conventional routers would have no difficulty handling global unicast traffic outside the current 2000::/3 prefix. At most, I imagine a firmware or configuration change would be required.
TOC |
The above proposals for IPv6 are intended to cope with a very large number of ISPs and AS end-users. The amount of memory required for SRAM-based IPv4 forwarding depends on only two factors: the granularity - which I suggest will remain fixed at /24 - and the width of bits required to specify the FEC for the number of interfaces in each particular router. With IPv4, the granularity and total address range have already been defined.
Despite IPv6's long pedigree, and the destruction of end-to-end connectivity caused by the widespread deployment of NAT firewalls in IPv4, there remain profound doubts about whether IPv6 will ever be practical for most existing Internet users, whether it will be profitable for ISPs and whether its promised benefits will be worth the effort. (Perhaps, IPv6 will be so poorly adopted, and IPv4 can be kept workable for long enough, that a whole new approach to networking will be developed - with conceptual and functional improvements which would make a wholesale changeover worthwhile.) NATs will never be eliminated from IPv4 usage, and an increasing number of protocols are being equipped with their own, or separate, NAT traversal capabilities. The IT industry has a long history of failing to make non-backward-compatible transitions. I believe that SRAM-based forwarding in DFZ routers, combined with the freedom this brings for splitting and advertising smaller chunks of address space, will enable a vastly more efficient use of IPv4 address space. This may extend IPv4's useful life for decades, so there may be a relatively slow growth in IPv6 traffic volumes and number of BGP advertised prefixes.
When discussing the prospect of building significant additional hardware into each interface of every future DFZ router, it may be argued that the ten-year outlook for widespread IPv6 adoption is too uncertain to justify substantial expense. If the installation of a second 72 Mbit SRAM chip on every router interface (for "small" routers with up to 14 interfaces) - in addition to the chip required for IPv4 - could not be justified for foreseeable IPv6 traffic requirements, there is an alternative involving no extra expense. The previous suggestions for a /10 to provide 2^25 separately mapped /35 prefixes assume the use of two 72 Mbit SRAM chips per router interface. This is arguably overkill for the next decade or so, but I suggested this in the knowledge that IPv4 SRAM requirements are essentially fixed, and that it would be desirable to set a technical standard for IPv6 which would last for several decades.
The two chip IPv6 system (meaning 3 chips per interface in total for "small" routers and 6 for routers with more than 14 interfaces) applies each chip as if it had 24 address lines and 4 data lines. So each chip maps each of 16,777,216 prefixes to 16 possible values of FEC. Two chips for IPv6 could map, for instance, 33,554,432 /35s - providing 4,194,304 /32 prefixes for ISPs and AS end-users.
TOC |
In the event that a second SRAM cannot be justified for IPv6, here is a proposal to apply the unused space in the IPv4 chip. Since IPv4 addresses above 224.0.0.0 are never likely to be used for traffic, this leaves 1/8 of the SRAM unused. This means there are 2,097,152 memory locations which can be applied to IPv6 without any additional hardware cost.
One way of using this is to retain the /35 granularity of the "two additional chips" proposal, with eight /35s for every /32 assignment to an ISP or AS end-user. This provides for 262,144 such /32 assignments within a /14 prefix. While this is 1/16 the amount of address space of the first, two-chip, proposal, it could be argued that it is an adequate number of freely advertisable prefixes to support IPv6 for the lifetime of the next generation of routers. (I expect that these proposals will mean that the current 5 year or less lifetime of high end routers be extended to 10 or so years, as long as they are handling about the same traffic volumes.) An alternative to that just described is to fix the IPv6 SRAM-based forwarding granularity at /34, providing four /34s per /32, with 524,288 PI /32s contained in an initial /13 prefix for SRAM-based forwarding.
This "spare-space, 2 million /35s" proposal would probably not involve reallocating current global unicast address space into the relatively small /14. It would make most sense with a /14 established either inside or beyond 2000::/3. Wherever the /14 is established, the space above it should be available for future SRAM-based IPv6 forwarding, once the exhaustion of that /14 space prompts the production of a new generation of routers with greater IPv6 capabilities - with either two 72 Mbit chips, or a single 144 Mbit chip. (These chips are part of the Quad Data Rate architecture, but are not yet in production.)
TOC |
Another alternative to providing IPv6 SRAM-based forwarding at minimal expense might be to define a hardware architecture and addressing plan based on one, or even two, 72 Mbit chips ultimately being dedicated to IPv6 forwarding, but in the initial years of deployment to populate the boards of routers with less expensive 36 Mbit or 18 Mbit chips. These are surface-mount chips which are soldered in place. Unless they were mounted on plug-in boards (which are costly, bulky and introduce electrical and reliability concerns) it is not practical to design routers with plug-in SRAM for this FIB function. These chips are all "pin-compatible", so the circuit board design of the router's interfaces remain the same. The actual amount of memory installed is determined at the printed circuit board assembly stage.
An 18 Mbit chip would map 4,194,304 prefixes, and could be combined with the 2,097,152 prefixes which can be mapped using the spare space in the IPv4 SRAM chip. Thus an 18 Mbit chip would provide 6,291,456 /35s, for instance to provide eight /35s for 786,432 ISPs and AS end-users, each using a /32.
It would be straightforward to devise an addressing plan with several stages of expansion for IPv6. Stage 0 would be for SRAM mapping of 2,097,152 /35 prefixes (or whatever granularity was chosen) - using spare space in the IPv4 SRAM. Stage 1 would use an 18 Mbit SRAM (or a 36 Mbit SRAM for larger routers) to map a total of 6,291,456 prefixes. Stage 2 would use a 36 Mbit SRAM to map 10,485,760 prefixes etc. Probably by the time this was necessary, chip costs would be lower still and probably a single 144 Mbit chip would be installed, mapping 16,777,216 prefixes, or 18,874,368 if the spare space in the IPv4 half of the chip was also used.
Routers could be designed with a place for an IPv6 SRAM, beside the 72 Mbit SRAM for IPv4. This half postage stamp sized space of 18 x 16mm would remain empty until such time as new routers were expected to need to cope with traffic addressed to beyond the range of Stage 0. It may seem strange to base the architecture and administration of the Internet on the physical characteristics and costs of particular memory chips, but I believe that unless this is done, the Internet will become unroutable except through inefficient and uncoordinated application of expensive electronics, with the costs being passed on to all Internet users.
TOC |
None.
TOC |
If this proposal, or something like it, is adopted for IPv6, then significant changes will need to be made to the IPv6 Address Allocation and Assignment Policy [IPv6‑Policies] (IANA, “IPv6 Allocation and Assignment Policy,” June 2005.). Section 3.4, concerning address aggregation, would no longer apply to the /10 (for instance) prefix for which routers will perform SRAM-based forwarding.
These proposals for an SRAM-based FIB architecture for IPv4 may not require any changes to Internet usage or IANA standards. However, once implemented globally, the requirement to distribute addresses hierarchically to facilitate routing scalability, as expressed in section 1.2 of [RFC2050] (Hubbard, K., Kosters, M., Conrad, D., Karrenberg, D., and J. Postel, “INTERNET REGISTRY IP ALLOCATION GUIDELINES,” November 1996.) would no longer apply. This RFC, in November 1996, anticipated improved routing technologies in the future: "In the event that routing or router technology develops to the point that adequate routing aggregation can be achieved by other means or that routers can deal with larger routing and more dynamic tables, it may be appropriate to review these constraints."
TOC |
TOC |
Robin Whittle | |
First Principles | |
Email: | rw@firstpr.com.au |
URI: | http://www.firstpr.com.au/ip/ |
TOC |
Copyright © The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided on an “AS IS” basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).