TOC 
Network Working GroupY. Nir
Internet-DraftCheck Point
Intended status: Standards TrackApril 02, 2008
Expires: October 4, 2008 


A Quick Crash Detection Method for IKE
draft-nir-ike-qcd-00.txt

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on October 4, 2008.

Abstract

This document describes an extension to the IKEv2 protocol that allows for faster crash recovery using a saved token.

When an IPsec tunnel between two IKEv2 implementations is disconnected due to a restart of one peer, it can take as much as several minutes for the other peer to discover that the reboot has occurred, thus delaying recovery. In this text we propose an extension to the protocol, that allows for recovery within a few seconds of the reboot.



Table of Contents

1.  Introduction
    1.1.  Conventions Used in This Document
2.  RFC 4306 Crash Recovery
3.  Protocol Outline
4.  Formats and Exchanges
    4.1.  Notification Format
    4.2.  Authentication Exchange
    4.3.  Informational Exchange
5.  Token Generation and Verification
    5.1.  A Stateful Method of Token Generation
    5.2.  A Stateless Method of Token Generation
    5.3.  Token Lifetime
6.  Backup Gateways
7.  Alternative Solutions
    7.1.  Why not Save the Entire IKE SA
    7.2.  Initiating a new IKE SA
8.  Interaction with IFARE
9.  Operational Considerations
10.  Security Considerations
11.  IANA Considerations
12.  Acknowledgements
13.  Change Log
    13.1.  Changes from draft-nir-qcr-00
14.  References
    14.1.  Normative References
    14.2.  Informative References
§  Author's Address
§  Intellectual Property and Copyright Statements




 TOC 

1.  Introduction

IKEv2, as described in [RFC4306] (Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” December 2005.) has a method for recovering from a reboot of one peer. As long as traffic flows in both directions, the rebooted peer should re-establish the tunnels immediately. However, in many cases the rebooted peer is a VPN gateway that protects only servers, or else the non-rebooted peer has a dynamic IP address. In such cases, the rebooted peer will not re-establish the tunnels.

Section 2 (RFC 4306 Crash Recovery) describes the current procedure, and explains why crash recovery can take up to several minutes. The method proposed here, is to send a token in the IKE_AUTH exchange that establishes the tunnel. That token can be maintained on the peer in some kind of persistent storage such as a disk or a database, and can be used to delete the IKE SA on the non-rebooted peer after a crash. Deleting the IKE SA results is a quick re-establishment of the IPsec tunnel.



 TOC 

1.1.  Conventions Used in This Document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).



 TOC 

2.  RFC 4306 Crash Recovery

When one peer reboots, the other peer does not get any notification, so IPsec traffic can still flow. The rebooted peer will not be able to decrypt it, however, and the only remedy is to send an unprotected INVALID_SPI notification as described in section 3.10.1 of [RFC4306] (Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” December 2005.). That section also describes the processing of such a notification: "If this Informational Message is sent outside the context of an IKE_SA, it should be used by the recipient only as a "hint" that something might be wrong (because it could easily be forged)."

Since the INVALID_SPI can only be used as a hint, the non-rebooted peer has to determine whether the IPsec SA, and indeed the parent IKE SA are still valid. The method of doing this is described in section 2.4 of [RFC4306] (Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” December 2005.). This method, called "liveness check" involves sending a protected empty INFORMATIONAL message, and awaiting a response. This procedure is sometimes referred to as "Dead Peer Detection" or DPD.

Section 2.4 does not mandate how many times the INFORMATIONAL message should be retransmitted, or for how long, but does recommend the following: "It is suggested that messages be retransmitted at least a dozen times over a period of at least several minutes before giving up on an SA". Clearly, implementations differ, but all will take a significant amount of time.



 TOC 

3.  Protocol Outline

Supporting implementations will send a notification, called a "QCD token", as described in Section 4.1 (Notification Format) in the last packets of the IKE_AUTH exchange. These are the final request and final response that contain the AUTH payloads. The generation of these tokens is a local matter for implementations, but considerations are described in Section 5 (Token Generation and Verification). Implementations that send such a token will be called "token makers".

A supporting implementation receiving such a token SHOULD store it in such a way, that it will survive a reboot. If the implementation is part of a configuration where there is a backup gateway as described in Section 6 (Backup Gateways) (such configurations are often referred to as high-availability), then the persistent storage module SHOULD be accessible to all implementations within the configuration. An implementation supporting this part of the protocol will be called "token taker".

When a token taker receives a protected IKE request message with unknown IKE SPIs, it MUST scan its saved token store. If a token matching the IKE SPIs is found, it SHOULD be sent to the requesting peer in an unprotected IKE message as described in Section 4.3 (Informational Exchange).

When a token maker receives the QCD token in an unprotected notification, it MUST verify that the TOKEN_SECRET_DATA field is associated with the IKE SPIs in the IKE_SPI fields of the IKE packet. If the verification fails, it SHOULD log the event. If it succeeds, it MUST delete the IKE SA associated with the IKE_SPI fields, and all dependant child SAs. This event MAY also be logged. The token maker MUST accept such tokens from any address, so as to allow different kinds of high-availability configuration of the token taker.

A supporting implementation MAY immediately create new SAs using an Initial exchange, or it may wait for subsequent traffic to trigger the creation of new SAs.

There is ongoing work on IKEv2 Session Resumption [resumption] (Sheffer, Y., Tschofenig, H., Dondeti, L., and V. Narayanan, “IPsec Gateway Failover Protocol,” March 2008.). See Section 8 (Interaction with IFARE) for a short discussion about this protocol's interaction with session resumption.



 TOC 

4.  Formats and Exchanges



 TOC 

4.1.  Notification Format

The notification payload called "QCD token" is formatted as follows:

                           1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      ! Next Payload  !C!  RESERVED   !         Payload Length        !
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      !  Protocol ID  !   SPI Size    ! QCD Token Notify Message Type !
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      !                                                               !
      ~                       TOKEN_SECRET_DATA                       ~
      !                                                               !
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+



 TOC 

4.2.  Authentication Exchange

For clarity, only the EAP version of an AUTH exchange will be presented here. The non-EAP version is very similar. The figure below is based on appendix A.3 of [RFC4718] (Eronen, P. and P. Hoffman, “IKEv2 Clarifications and Implementation Guidelines,” October 2006.).

   first request       --> IDi,
                           [N(INITIAL_CONTACT)],
                           [[N(HTTP_CERT_LOOKUP_SUPPORTED)], CERTREQ+],
                           [IDr],
                           [CP(CFG_REQUEST)],
                           [N(IPCOMP_SUPPORTED)+],
                           [N(USE_TRANSPORT_MODE)],
                           [N(ESP_TFC_PADDING_NOT_SUPPORTED)],
                           [N(NON_FIRST_FRAGMENTS_ALSO)],
                           SA, TSi, TSr,
                           [V+]

   first response      <-- IDr, [CERT+], AUTH,
                           EAP,
                           [V+]

                     / --> EAP
   repeat 1..N times |
                     \ <-- EAP

   last request        --> AUTH
                           [N(QCD_TOKEN)]

   last response       <-- AUTH,
                           [N(QCD_TOKEN)]
                           [CP(CFG_REPLY)],
                           [N(IPCOMP_SUPPORTED)],
                           [N(USE_TRANSPORT_MODE)],
                           [N(ESP_TFC_PADDING_NOT_SUPPORTED)],
                           [N(NON_FIRST_FRAGMENTS_ALSO)],
                           SA, TSi, TSr,
                           [N(ADDITIONAL_TS_POSSIBLE)],
                           [V+]

Note that the QCD_TOKEN notification is marked as optional because it is not required by this specification that every implementation be both token maker and token taker. If only one peer sends the QCD token, then a reboot of the other peer will not be recoverable by this method. This may be acceptable if traffic typically originates from the other peer.

In any case, the lack of a QCD_TOKEN notification MUST NOT be taken as an indication that the peer does not support this standard. Conversely, if a peer does not understand this notification, it will simply ignore it. Therefore a peer MAY send this notification freely, even if it does not know whether the other side supports it.



 TOC 

4.3.  Informational Exchange

This informational exchange is non-protected, and is sent as a response to a protected IKE request, which uses an IKE SA that is unknown.

            request             --> N(QCD_TOKEN)

            response            <--

The QCD_TOKEN is the only notification in the request. Similar to the description in section 2.21 of [RFC4306] (Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” December 2005.), The IKE SPI and message ID fields in the packet headers are taken from the protected IKE request.

If the QCD_TOKEN verifies OK, an empty response MUST be sent. If the QCD_TOKEN cannot be validated, a response SHOULD NOT be sent. Section 5 (Token Generation and Verification) defines token verification.



 TOC 

5.  Token Generation and Verification

No token generation method is mandated by this document. Two methods are documented in Section 5.1 (A Stateful Method of Token Generation) and Section 5.2 (A Stateless Method of Token Generation), but they only serve as examples.

The following lists the requirements from a token generation mechanism:



 TOC 

5.1.  A Stateful Method of Token Generation

This describes a stateful method of generating a token:



 TOC 

5.2.  A Stateless Method of Token Generation

This describes a stateless method of generating a token.



 TOC 

5.3.  Token Lifetime

The token is associated with a single IKE SA, and SHOULD be deleted when the SA is deleted or expires. More formally, the token is associated with the pair (SPI-I, SPI-R).



 TOC 

6.  Backup Gateways

Making crash recovery quick is important, but since rebooting a gateway takes a non-zero amount of time, many implementations choose to have a stand-by gateway ready to take over as soon as the primary gateway fails for any reason.

If such a configuration is available, it is RECOMMENDED that the persistent storage be shared between the primary and backup gateway. This has the effect of having the crash recovery available immediately. This recommendation is especially useful if the primary and backup gateway either share an external IP address or reside on the same LAN. If they are geographically remote, this may be less practical.



 TOC 

7.  Alternative Solutions



 TOC 

7.1.  Why not Save the Entire IKE SA

IKEv2 does not assume the existence of a persistent storage module. If we are adding such a module, why not use it to save the entire IKE SA across reboots, nullifying the need for a crash recovery procedure?

There are several reasons why we believe that this is not a good idea:

  1. A token is only 16-256 octets, and is much more compact than all the data needed to store an IKE SA.
  2. A token is valid for the life of an IKE SA. An IKE SA state is updated whenever a message is sent, because of the requirement to maintain the sequence of message IDs. It may not be acceptable to update the persistent storage whenever an IKE message is sent.
  3. A reboot is usually an unpredictable event, and as such, we cannot know how long it will last. By the time the machine has rebooted, the peer may have attempted some type of protected exchange (liveness check, create-child-SA or delete), timed out, and deleted the SA. It is far better to reboot without SAs and with only a token for quick recovery.


 TOC 

7.2.  Initiating a new IKE SA

Instead of sending a QCD token, we could have the rebooted implementation start an Initial exchange with the peer, including the INITIAL_CONTACT notification. This would have the same effect, instructing the peer to erase the old IKE SA, as well as establishing a new IKE SA with fewer rounds.

The disadvantage here, is that in IKEv2 an authentication exchange MUST have a piggy-backed Child SA set up. Since our use case is such that the rebooted implementation does not have traffic flowing to the peer, there are no good selectors for such a Child SA.

Additionally, when authentication is asymmetric, such as when EAP is used, it is not possible for the rebooted implementation to initiate IKE.



 TOC 

8.  Interaction with IFARE

IFARE, specified in [resumption] (Sheffer, Y., Tschofenig, H., Dondeti, L., and V. Narayanan, “IPsec Gateway Failover Protocol,” March 2008.) proposes to make setting up a new IKE SA consume less computing resources. This is particularly useful in the case of a remote access gateway that has many tunnels. A failure of such a gateway would require all these many remote access clients to establish an IKE SA either with the rebooted gateway or with a backup gateway. This tunnel re-establishment should occur within a short period of time, creating a burden on the remote access gateway. IFARE addresses this problem by having the clients store an encrypted derivative of the IKE SA for quick re-establishment.

What IFARE does not help, is the problem of detecting that the peer gateway has failed. A failed gateway may go undetected for an unbounded amount of time, because IPsec does not have packet acknowledgement. Before establishing a new IKE SA using IFARE, a client MUST ascertain that the gateway has indeed failed. This could be done using either a liveness check (as in RFC 4306) or using the QCD tokens described in this document.

A remote access client conforming to both specifications will generate QCD tokens, and store the IFARE state, if provided by the gateway. A remote access gateway conforming to both specifications will store the QCD token sent from the client. When the gateway reboots, the client will discover this in either of two ways:

  1. The client does regular liveness checks, or else the time for some other IKE exchange has come. The IKE times out after several minutes, if the gateway does not finish rebooting in time. In this case QCD does not help.
  2. Either the primary gateway or a backup gateway (see Section 6 (Backup Gateways)) is ready and sends a QCD token to the client. In that case the client will quickly re-establish the IPsec tunnel, either with the rebooted primary gateway, the backup gateway as described in this document or another gateway as described in [resumption] (Sheffer, Y., Tschofenig, H., Dondeti, L., and V. Narayanan, “IPsec Gateway Failover Protocol,” March 2008.)

The full combined protocol looks like this:

     Initiator                Responder
     -----------              -----------
    HDR, SAi1, KEi, Ni  -->

                        <--    HDR, SAr1, KEr, Nr, [CERTREQ]

    HDR, SK {IDi, [CERT,]
    [CERTREQ,] [IDr,]
    AUTH, N(QCD_TOKEN)
    SAi2, TSi, TSr,
    N(TICKET_REQUEST)}  -->
                        <--    HDR, SK {IDr, [CERT,] AUTH, SAr2, TSi,
                               TSr, N(TICKET_OPAQUE)
                               [,N(TICKET_GATEWAY_LIST)]}

             ---- Reboot -----

    HDR, {}             -->
                        <--  HDR, N(QCD_Token)

    HDR, Ni, N(TICKET_OPAQUE),
    [N+,], SK {IDi, [IDr,]
    SAi2, TSi, TSr,
    [CP(CFG_REQUEST)]}  -->
                        <--  HDR, SK {IDr, Nr, SAr2, [TSi, TSr],
                             [CP(CFG_REPLY)]}




 TOC 

9.  Operational Considerations

To support "token taker" part of this standard, an implementation needs to have access to a persistent storage module. This could be an internal hard disk, a local or remote database application, or any other method that persists across reboots. This storage module and the data links between the storage module and the IKE module must meet the performance requirements of the IKE module. The storage module MUST support insertion and deletion rates equal to peek IKE SA setup rates and it SHOULD support query rates that are fast enough.

See Section 10 (Security Considerations) for security considerations for this storage mechanism.

Throughout this document, we have referred to reboot time alternatingly as the time that the implementation crashes and the time when it is ready to process IPsec packets and IKE exchanges. Depending on the hardware and software platforms and the cause of the reboot, rebooting may take anywhere from a few seconds to several minutes. If the implementation is down for a long time, the benefit of this protocol extension are reduced. For this reason critical systems should implement backup gateways as described in Section 6 (Backup Gateways). Note that the lower-case should in the previous sentence is intentional, as we do not specify this in the sense of RFC 2119.

Implementing the "token taker" side of QCD makes sense for IKE implementation where protected connections originate from the peer, such as inter-domain VPNs and remote access gateways. Implementing the "token maker" side of QCD makes sense for IKE implementations where protected connections originate, such as inter-domain VPNs and remote access clients.

To clarify the requirements:

In order to limit the effects of DoS attacks, an implementation SHOULD limit the rate of queries into the token storage so as not to overload it. If excessive amounts of IKE requests protected with unknown IKE SPIs arrive, the IKE module SHOULD revert to the behavior described in section 2.21 of [RFC4306] (Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” December 2005.) and either send an INVALID_IKE_SPI notification, or ignore it entirely.



 TOC 

10.  Security Considerations

Tokens MUST be hard to guess. This is critical, because if an attacker can guess the token associated with the IKE SA, she can tear down the IKE SA and associated tunnels at will. When the token is delivered in the IKE_AUTH exchange, it is encrypted. When it is sent back in an informational exchange it is not encrypted, but that is the last use of that token.

An aggregation of some tokens generated by one peer together with the related IKE SPIs MUST NOT give an attacker the ability to guess other tokens. Specifically, if one peer does not properly secure the QCD tokens and an attacker gains access to them, this attacker MUST NOT be able to guess other tokens generated by the same peer. This is the reason that the QCD_SECRET in Section 5.2 (A Stateless Method of Token Generation) needs to be long.

The persistent storage MUST be protected from access by other parties. Anyone gaining access to the contents of the storage will be able to delete all the IKE SAs described in it.

The tokens associated with expired and deleted IKE SAs MUST be deleted from the storage, so that a future compromise of the storage does not reveal enough tokens to facilitate an attack against the QCD tokens.

The QCD token is sent by the rebooted peer in an unprotected message. A message like that is subject to modification, deletion and replay by an attacker. However, these attacks will not compromise the security of either side. Modification is meaningless because a modified token is simply an invalid token. Deletion will only cause the protocol not to work, resulting in a delay in tunnel re-establishment as described in Section 2 (RFC 4306 Crash Recovery). Replay is also meaningless, because the IKE SA has been deleted after the first transmission.



 TOC 

11.  IANA Considerations

IANA is requested to assign a notify message type from the error types range (43-8191) of the "IKEv2 Notify Message Types" registry with name "QUICK_CRASH_DETECTION".



 TOC 

12.  Acknowledgements

We would like to thank Hannes Tschofenig and Yaron Sheffer for their comments about IFARE.



 TOC 

13.  Change Log

This section lists all changes in this document

NOTE TO RFC EDITOR : Please remove this section in the final RFC



 TOC 

13.1.  Changes from draft-nir-qcr-00



 TOC 

14.  References



 TOC 

14.1. Normative References

[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[RFC4306] Kaufman, C., “Internet Key Exchange (IKEv2) Protocol,” RFC 4306, December 2005 (TXT, HTML, XML).
[RFC4718] Eronen, P. and P. Hoffman, “IKEv2 Clarifications and Implementation Guidelines,” RFC 4718, October 2006 (TXT, HTML, XML).


 TOC 

14.2. Informative References

[resumption] Sheffer, Y., Tschofenig, H., Dondeti, L., and V. Narayanan, “IPsec Gateway Failover Protocol,” draft-sheffer-ipsec-failover-03 (work in progress), March 2008 (TXT).


 TOC 

Author's Address

  Yoav Nir
  Check Point Software Technologies Ltd.
  5 Hasolelim st.
  Tel Aviv 67897
  Israel
Email:  ynir@checkpoint.com


 TOC 

Full Copyright Statement

Intellectual Property