Internet-Draft | Abbreviated-Title | March 2022 |
Wang, et al. | Expires 8 September 2022 | [Page] |
NVMe over Fabrics defines a common architecture that supports a range of storage networking fabrics for NVMe block storage protocol over a storage networking fabric, such as Ethernet, Fibre Channel and InfiniBand. For Ethernet-based networks, RDMA or TCP technology can be used to transport NVMe, but the network management mechanism is simple, and fault detection is weak.¶
This document defines the architecture of the Ethernet-based NVMe control optimization technology, including service processes between hosts, storage devices and network switches, and fast fault-aware switchover.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 8 September 2022.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
For a long time, the key storage applications and high performance requirements were mainly based on FC networks. With the increase of transmission rates, the medium has evolved from HDDs to solid-state storage, and the protocol has evolved from SCSI to NVMe. The emergence of new NVMe technologies brings new opportunities.¶
Ethernet-based NVMe is an implementation of NVMe over Fabric that best fits NVMe semantics. It surpasses FC in terms of performance, cost and network management. It is the development trend of high-speed storage networks in the future. Ethernet-based NVMe has been defined in NVM Express. The specification defined in this document optimizes network control in terms of ease of use, maintainability, and reliability, making Ethernet-based NVMe more suitable for high reliability requirements of key applications. This feature improves system usability and maintainability.¶
The [ODCC-2020-05016] defined the basic specifications for NVME of RoCEv2, and this document draws on that definition.¶
NoF : NVMe of Fabric¶
FC : Fiber Channel¶
NVMe : Non-Volatile Memory Express¶
An Ethernet-based NVMe network mainly includes three types of roles: an initiator (referred to as a host), a switch, and a target (referred to as a storage device). Initiators and targets are also referred to as endpoint devices. Hosts and storage devices use the Ethernet-based NVMe protocol to transmit data over the network to provide high-performance storage services.¶
+--+ +--+ Host |H1| |H2| (Initiator) +-,+ +_.+ | `', _-` | | _-` | | _-` `', | Ethernet +----+ +----+ Network | SW | | SW | +---,+ +_.--+ | `', _-` | | `', | | _-` `', | Storage +-`+ +`'+ (Target) |S1| |S2| +--+ +--+ Figure 1 : Basic Model¶
This is the basic model for small-scale storage access networks. Hosts and storage devices are dual-homed to different switches.¶
After a host or a storage device is connected to a switch, they register their information to the switch and obtain registration information of other hosts/storage devices from the switch node.¶
+--+ +--+ +--+ +--+ Host |H1| |H2| |H3| |H4| (Initiator) +/-+ +-,+ +.-+ +/-+ | | '. ,-`| | | | `', | | | | ,-` '. | | +-\--+ +--`-+ +`'--+ +-\--+ | SW | | SW | | SW | | SW | +--,-+ +---,, +,.--+ +-.--+ `. `'.,` .` `. _,-'` ``'., .` Ethernet +--'`+ +`-`-+ Network | SW | | SW | +--,,+ +,.,-+ .` `'., ,.-`` ', .` _,-'` `. +--`-+ +--'`+ `'---+ +-`'-+ | SW | | SW | | SW | | SW | +-.,-+ +-..-+ +-.,-+ +-_.-+ | '. ,-` | | `., .' | | `', | | '.` | | ,-` '. | | ,-` `', | Storage +-`+ `'\+ +-`+ +`'+ (Target) |S1| |S2| |S3| |S4| +--+ +--+ +--+ +--+ Figure 2 : CLOS Model¶
This is a relatively large-scale storage network which applies to a large-scale storage device access network.¶
Hosts and storage nodes connect to different switch nodes and register to the switch nodes. The switch needs to flood the registration information received locally to other switch nodes on the network.¶
The Ethernet-based NVMe network consists of storage devices, hosts and switches.¶
As the server side, storage devices provide storage access services for hosts. When a storage device is connected to a switch, storage service information must be registered and periodically notified to the switch to ensure the validity of information.¶
If the storage device has interest in information of other storage device or host in the storage network, it may also receive the notification of such information from the switch.¶
+-------+ +------+ |Storage| |Switch| +-------+ +------+ | Register Msg | | ----------------------->| | | | Notification Msg | | <-----------------------| | | | | Figure 3 : Storage Device¶
The host is the client of the storage device. When a host accesses a switch, it needs to register the host information to the switch and periodically publish it.¶
As the client side, a host needs to quickly obtain the service status of the storage device that provides services. When the host obtains the notification message from the switch indicating that the storage device goes online, the host may establish a connection to the storage device. When the host receives a notification message from the switch indicating that the storage device is faulty, the host needs to quickly disconnect from the storage device and attempt to establish a connection to other redundant storage devices.¶
+-------+ +------+ | HOST | |Switch| +-------+ +------+ | Register Msg | | ----------------------->| | | | Notification Msg | | <-----------------------| | | | | Figure 4 : Host Device¶
Switches manage the registration information of the hosts and storage devices, and monitor the network status. Switches will synchronize this information to the other switches in the network.¶
+------+ +------+ |Switch| |Switch| +------+ +------+ | Information Sync | | ----------------------->| | | | | | | Figure 5 : Network Device¶
On an FCoE network, users can control access between nodes through zones, improving network security. This zone is used for inter-domain isolation and intra-domain communication.¶
On the Ethernet-base NVMe network, we also need to implement FC zones to isolate and control services between storage devices and hosts. On the Ethernet-base NVMe network, IP addresses are used as the unique identifiers of hosts and storage devices, and domains are used as the attributes of IP addresses. Hosts and storage devices in the same domain can access each other. Hosts and storage devices in different domains are isolated. Each IP address needs to be assigned to one or more domains. Also, there is a default domain. If no isolation is required, the IP addresses of these hosts and storage devices belong to the default domain. For each domain, we may also call it zone.¶
_,.---.,, ,,.--.,, .'` `'., .'` `'. ,-` ,' `\ / +--------+ ,' \ +--------+`. .' |StorageA| / `, |StorageB| \ / +---,----+ / \ +-_.-----+ \ / `., / ,_-` \ ' '/ _-\ , | |`', _-` | | / / +-`-`--+ \ \ | | |Switch| | | | | +- .-,,+ | | | | ,'` | '. | | | |-` | `',| | | .'| | |., | , ,-` \ | / ', / | +-----`-+ | +---\---+ | +-`'----+ | , | HostA | \ | HostB | / | HostC | ` \ +-------+ \+-------+ ` +-------+ / \ \ / / `. \ ' / \ `, ,' ` `. Zone1 `. Zone2 ,' `'., _.-` '., _.'` `'''--''` `''--''` Figure 6 : Zone Management¶
As shown in the figure above, HostA and StorageA belong to Zone1, HostC and StorageB belong to Zone2, and HostB belongs to Zone1 and Zone2.¶
StorageA can be accessed only by HostA but not HostC. StorageB can be accessed only by HostC, but not by HostA. Because HostB belongs to both Zone1 and Zone2, HostB can access StorageA in Zone1 and StorageB in Zone2.¶
The NoF network uses the standard Ethernet technology, and the typical deployment model is the CLOS architecture. Network deployments typically use the current IP technologies. For example, OSPF is usually deployed as an underlay protocol.¶
Hosts and storage devices are connected to the ethernet network. The administrator assigns access IP addresses to the hosts and storage devices. In most scenarios, these routes can be advertised through the underlay protocol. In addition, after hosts and storage devices go online, they need to register their information to the switches. It is recommended that the registration message be completed using LLDP.¶
The registration information includes the IP address type, whether to subscribe to host or storage device information changes, device role, service protocol type and version number, protocol service port number, protocol identifier, etc.¶
The switch receives and saves the registration information of hosts and storage devices. For a host/storage device that subscribes to the hosts and storage device information changes, the switch also needs to advertise the collected registration information to the subscriber. The information to be advertised includes the device status, device status change reason, and device attachment information. When advertising the subscribed information, it must be ensured that only the registration information of the domain to which the node belongs is advertised. It is recommended to use a new protocol to implement this notification message.¶
Users assign domains for different hosts and storage devices. The domain information must be obtained by all access switches on the entire storage network. The domain information can be configured on each access switch. It can also be configured on some switches and then synchronize to all other access switches throughout the storage network.¶
In addition, the local host and storage device registration information stored on each access switch needs to be synchronized across the entire switch network so that host/storage devices under other access switches can obtain the information.¶
The synchronization information about the host and storage devices belongs to the application layer's information. A new protocol should be defined to implement the information synchronization.¶
+-------+ +----+ +------+ +----+ +-------+ | HOST |-----------|TOR1|------|Spine1|------|TOR3|------|Storage| +---/---+ +-/--+ +--/---+ +-/--+ +---/---+ |---------------->| | |<------------| | Register Msg |----------->|<-----------| Register Msg| | |<-----------|----------->| | |<----------------| Info Sync | Info Sync | | |Notification Msg | | | | | | | | | Figure 7 : Information Advertisement¶
When a storage device is faulty, the access switch detects the fault and spreads the fault on the network. After receiving the fault, the host that subscribes to the storage device can switch to another storage device. The switchover is performed by the host side. The network side needs to quickly notify the host of the fault.¶
When a host is faulty, the access switch detects the fault and floods the fault on the network. Hosts and storage devices determine whether to subscribe to the fault status of a specified host based on the implementation.¶
When an access link is faulty, the access switch detects the fault and spreads the fault on the network. After receiving the fault, the host that subscribes to the storage device can switch to another storage device.¶
To accelerate fault detection, BFD or other fast detection technologies can be used to accelerate it.¶
ECMP or redundant link protection is usually deployed to prevent this failure.¶
When multiple links fail on the network side, the switch network may be split. In the two split networks, each host receives the corresponding notification and performs different serves on the storage devices.¶
The fault is equivalent to a network link fault or an access link fault or both.¶
This document makes no request of IANA.¶