Internet-Draft | SINC Architecture | October 2022 |
Lou, et al. | Expires 26 April 2023 | [Page] |
This memo introduces "Signaling In-Network Computing operations" (SINC), a mechanism to enable in-packet operation signaling for in-network computing for specific scenarios like NetReduce, NetDistributedLock, NetSequencer, etc. In particular, this solution allows to flexibly communicate computation parameters to be used in conjunction with the packets' payload, to signal to in-network SINC-enabled devices the computing operations to be performed.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 26 April 2023.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
According to the original design, the Internet performs just "store and forward" of packets, and leaves more complex operations at the end-points. However, new emerging applications could benefit from in-network packet processing to improve the overall system efficiency ([GOBATTO], [ZENG]).¶
The formation of the COIN Research Group [COIN] in IRTF encourages people to explore this emerging technology and its impact on the Internet architecture. The "Use Cases for In-Network Computing" draft [I-D.irtf-coinrg-use-cases] introduces some use cases to demonstrate how real applications can benefit from COIN and show essential requirements demanded by COIN applications.¶
Recent research has shown that network devices undertake some computing tasks can greatly improve the network and application performance in some scenarios like aggregating path-computing [NetReduce], key-value(K-V) cache [NetLock], and strong consistency [GTM]. Their implementations are mainly based on programmable network devices, by using P4 or other languages. In the context of such heterogeneity of scenarios, it is desirable to have a generic and flexible protocol to explicitly signal the computing operation to be performed by network devices, which is applicable to many use cases, enabling easier deployment of these research results.¶
This document specifies a signaling architecture for in-network computing operation. The computing functions are hosted on network devices, which can be perceived as network SINC service instances.¶
It focuses on the design of the data plane, while the control plane will be depicted in a separate draft. Service Function Chaining (SFC) [RFC7665] is used as a running example on how to tunnel the SINC header to the in-network device and implement the desired in-network computation. Nevertheless, the mechanism can be adapted to other transport protocols, like Remote Direct Memory Access (RDMA) [ROCEv2], but such adaptation is out of the scope of this document.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] and [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document uses the terms as defined in [RFC7498], [RFC7665] and [RFC8300]. This document assume that the reader is familiar with the Service Function Chaining architecture.¶
Hereafter a few relevant use cases are described, namely NetReduce, NetDistributedLock, and Net Sequencer, in order to help understanding of the requirements for a general framework. Such a framework, should be generic enough to accommodate a large variety of use cases, besides the ones described in this document.¶
Over the last decade, the rapid development of the Deep Neural Networks (DNN) has greatly improved the performance of many Artificial Intelligence (AI) applications like computer vision and natural language processing. However, DNN training is a computation intensive and time consuming task, which has been increased exponentially (computation time gets doubled every 3.4 months [OPENAI]) in the past 10 years. Scale-up techniques concentrating on the computing capability of a single device cannot meet the expectation. Distributed DNN training approaches with synchronous data parallelism like Parameter Server and All-Reduce are commonly employed in practice, which on the other hand, become increasingly a network-bound workload since communication becomes a bottleneck at scale ([PARAHUB],[MGWFBP]).¶
Comparing with the host oriented solutions, in-network aggregation approaches like SwithML [SwitchML] and SHARP [SHARP] could potentially reduce nearly half of the bandwidth needed for data aggregation by offloading gradients aggregation from the host to the network switch. The SwitchML solution uses UDP for network transport. The system solely relies on application layer logic to trigger retransmission for packet loss, which leads to extra latency and reduces the training performance. The SHARP solution on the contrary, uses Remote Direct Memory Access (RDMA) to provide reliable transmission [ROCEv2]. As the Infini-Band (IB) technology requires specific hardware support, this solution is not very cost-effective. NetReduce [NetReduce] doesn't depend on dedicated hardware and provides a general in-network aggregation solution that is suitable for Ethernet networks.¶
In the majority of distributed system, the lock primitive is a widely used concurrency control mechanism. For large distributed systems, there is commonly a dedicated lock manager that nodes compete to gain read and/or write permissions of a resource. The lock manager is often abstracted as Compare And Swap (CAS) or Fetch Add (FA) operations.¶
The lock manager is typically running on a server, causing a limitation on the performance by the speed of disk I/O transaction. When the load increases, for instance in the case of database transactions processed on a single node, the lock manager becomes a major performance bottleneck, consuming nearly 75% of transaction time [OLTP]. The multi-node distributed lock processing superimposes the communication latency between nodes, which makes the performance even worse. Therefore offloading the lock manager function from the server to the network switch might be a much better choice, as the switch is capable of managing lock function efficiently. Meanwhile it releases the server for other computation tasks.¶
The test results in NetLock [NetLock] show that the lock manager running on a switch is able to answer 100 million requests per second, nearly 10 times more than what a lock server can do.¶
Transaction managers are centralized solutions to the consistency issue for distributed transactions, such as GTM in Postgre-XL ([GTM], [CALVIN]). However, as a centralized module, transaction managers have became a bottleneck in large scale high-performance distributed systems.
The work [HPRDMA] introduces a server based networked sequencer, which is a kind of task manager assigning monotonically increasing sequence number for transactions. In [HPRDMA], the authors shows that the maximum throughput is 122 Million requests per second (Mrps), at the cost of an increased average latency.
This bounded throughput will impact the scalability of distributed systems. Meanwhile, the authors also test the bottlenecks for varies optimization methods, including CPU, DMA bandwidth and PCIe RTT, which is introduced by the CPU centric architecture.¶
For a programmable switch, a sequencer is a rather simple operation to implement, while the pipeline architecture can avoid bottlenecks. It is worth trying to implement a switch based sequencer, which set the performance goals as hundreds of Mrps and latency in the order of microseconds.¶
The COIN use case draft [I-D.irtf-coinrg-use-cases] illustrates some general requirements for scenarios like in-network control and distributed AI, where the aforementioned use cases belong to. One of the requirements defined in [I-D.irtf-coinrg-use-cases] is that any in-network computing system must provide means to specify the constraints for placing execution logic in certain logical execution points (and their associated physical locations). In case of NetReduce, NetDistributedLock and NetSequencer, data aggregation, lock management and sequence number generation functions can be offloaded respectively onto the network switch.¶
We can see that those functions are based on some "simple" and "generic" operators, as shown in Table 1. Programmable switches are capable of performing those basic operations by executing one or more operators, without impacting the forwarding performance ([NetChain], [ERIS]).¶
Use Case | Operation | Description |
---|---|---|
NetReduce | Sum value (SUM) | The network device sums the collected parameters together and outputs the resulting value. |
NetLock | Compare And Swap or Fetch-and-Add (CAS or FA) | By comparing the request value with the status of its own lock, the network device sends out whether the host has the acquired the lock. Through the CAS and FA, host can implement shared and exclusive locks. |
NetSequencer | Fetch-and-Add (FA) | The network device offers a counter service and provides a monotonically increasing sequence number for the host. |
This section describes the various elements and functional modules in the SINC system and explains how they work together.¶
The SINC computing protocol and extensions are designed for limited domains such as the data center network instead of across the Internet. The requirements and semantics are specifically limited, as defined in the previous sections.¶
The main deployment model is to place SINC-capable switches/routers, aiming to take over part of the data computing operations during the data transmission. For instance, in the case of NetLock, Top-of-Rack switches can be equipped with SINC capabilities to manage I/O locks. In the case of NetReduce, SINC-capable switches can be deployed in a centric point where all data has to pass through, to achieve on-path aggregation/reduction.¶
Figure 1 shows the architecture of a SINC network. In the computing service chain, a host sends out packets containing data operations to be executed in the network. The data operation description should be carried in the packet itself by using the SINC header.¶
Once the packet is in the SINC domain, it includes a SINC header, so that SINC-enabled switches and router have access to such header and can perform the desired operation directly on the in-network device. Note that hosts can also be SINC enabled, in that case the proxies are not necessary.¶
The SINC header, has a fixed length of 16 octets and it is appended right after the Service Path Header, carries the data operation information, used for on-path in-switch SFs.¶
As previously stated, Service Function Chaining (SFC) [RFC7665] is used as a running example on how to tunnel the SINC header to the in-network device and implement the desired in-network computation.¶
Figure 3 shows the architecture of a SFC-based SINC network. In the computing service chain, a host sends out packets containing data operations to be executed in the network. The data operation description should be carried in the packet itself by using a SINC-specific NSH encapsulation.¶
Once the SINC packet is in the SFC domain, the Service Function Forwarder (SFF) [RFC7665] is responsible for forwarding packets to one or more connected service functions according to information carried in the SFC encapsulation. The Service Function (SF) [RFC7665] is responsible for implementing data operations.¶
As shown in Figure 3, the SFC proxy, SFF, and SINC switch/router containing SFF and SF, are used.¶
The SFC proxy is required to support SFC-unaware hosts to encapsulate the packets with correct NSH header and SINC context header, and to forward the packets to a correct SFF. The SFF forwards packets based on the Service Path Header (SPH), as specified in [RFC8300]. The SFC-unaware hosts can only add the SINC information in the payload after the transport layer encapsulation.¶
The SFC proxy needs to associate packets to a group and, hence, to a specific operation to be done in-network. For TCP and UDP packets, the five-tuple is sufficient for flow identification. For RoCEv2 packets, the destination port number is set to 4791 for the indication of the InfiniBand Base Transport Header (IB BTH), which cannot be used for flow identification. Therefore, a combination of source IP address, destination IP address, and Destination Queue Pair number [ROCEv2] should be used to for flow identification.¶
For packets from the SFC-unaware hosts that requires SINC operation, the ingress SFC proxy will copy the SINC information to a SINC context header and set the Data Offset value accordingly ((see Section 7)).¶
Based on the Group ID, the SPI is matched and the NSH based header is built. With a SFC encapsulation, the SINC packet will be forwarded to SFF.¶
The egress SFC proxy removes the NSH header, including the SINC context header, before forwarding the packets to destination.¶
With the standardized context header, the SFs can be decoupled from transport layer encapsulation. The SFs perform the data operation as defined in the headers, update the original payload with the results, and forward the packets to the next hop.¶
This section defines the SINC header fields as part of the NSH [RFC8300] encapsulation for SFC [RFC7665].¶
The SINC NSH header is basically another type of NSH MD header. SINC NSH encapsulation uses the NSH Meta Data (MD) fixed-length context headers to carry the data operation information. Please refer to the NSH [RFC8300] for a detailed SFC basic header description. This draft suggest the base header specifies MD type = 0x4, to allow a fixed length context header immediately following the service path header.¶
Following the NSH basic header there is the Service Path Header, show in Figure 5, as defined in [RFC8300].¶
By stacking the previously shown headers, the complete SINC NSH header, meaning the NSH base header, NSH Service Path Header, and the SINC Header, all together are shown in Figure 6.¶
This section describes the SINC system workflow, focusing on elements and key information changes through the workflow. Since SINC's use-cases will use a programmable switch to host the SF, it is assumed that both SFF and SF are colocated on the same switch, as shown in Figure 7.¶
For the sake of clarity, a simple example with one sender (Host A) and one Receiver (Host B) is provided. Packet processing goes through the following steps:¶
SINC networks need to deploy and control the whole life-cycle of the task. It should be able to manage the full life-cycle from the initialization to the end of the computing task and give support to the computing tasks. The detailed design of the control plane will be discussed in a separate document.¶
In-network computing exposes computing data to network devices, which inevitably raises security and privacy considerations. The security problems faced by in-network computing include, but are not limited to:¶
This documents assume that the deployment is done in a trusted environment. For example, in a data center network or a private network.¶
A fine security analysis will be provided in future revisions of this memo.¶
This document defines a new NSH fixed length context header. As such, IANA is requested to add the entry depicted in Table 2, to the "NSH MD Types" sub registry of the "Network Service Header Parameter" registry. [Note to RFC Editor: If IANA assign a different value the authors will update the document accordingly]¶
MD Type | Description | Reference |
---|---|---|
0x4 | NSA SINC MD Header | [This Document] |
Dirk Trossen's feedback was of great help in improving this document.¶
Computing tasks and application are becoming increasingly complex. The complexities are caused by model extension. If some computing tasks are directly offloaded on network devices, the universality of devices will be reduced. Complex models can be disassembled into basic calculation operation, such as addition, subtraction, Max, etc. Therefore, a more appropriate offloading method is to disassemble complex tasks into basic computing operations.¶
The DOIN Network needs to provide a set of general computing abilities abstraction framework. The application, management and computing network nodes can negotiate and calculate resources according to the abstract computing abilities. For each calculation operation, such as addition, subtraction and maximization, the corresponding settings should be found in the abstract scheme and the abstraction should be realized. The abstraction of computing abilities represents that network nodes should give the same output with the same input and operation.¶
OpName | Operation Explanation |
---|---|
Max | Maximum value of several parameters |
MIN | Minimum value |
SUM | Sum value |
PROD | Product value |
LAND | Logical and |
BAND | Bit-wise and |
LOR | Logical or |
BOR | Bit-wise or |
LXOR | Logical xor |
BXOR | Bit-wise xor |
WRITE | Write value accord to key |
READ | Read value accord to key |
DELETE | Delete value accord to key |
CAS | Compare and swap. compare the value of the key and old value. If not same, swap old value to key value. Return old key value. |
CAADD | Compare and add. compare the value of the key and expected value. If same, add add-value to key value. Return old key value. |
CASUB | Compare and subtract. compare the value of the key and expected value. If same, sub sub-value to key value. Return old key value. |
FA | Fetch and add. Fetch value according key. Add add-value to key value. Return old key-value. |
FASUB | Fetch and subtract.Fetch value according key. Subtract sub-value to key value. Return old key value. |
FAOR | Fetch and OR. Fetch value according key. Key value get logical or operation with parameter. Return old key value. |
FAADD | Fetch and ADD. Fetch value according key. Key value get logical add operation with parameter. Return old key value. |
FANAND | Fetch and NAND. Fetch value according key. Key value get logical NAND operation with parameter. Return old key value. |
FAXOR | Fetch and XOR. Fetch value according key. Key value get logical XOR operation with parameter. Return old key value. |
Defining an appropriate abstract model of computing capability is helpful for interoperability between computing devices. They are also a necessary condition for the application and practice of In-Network computing technology. Most of the existing papers are based on a single computing task, and corresponding private protocols are proposed. The lack of unified protocols makes the equipment complex and unstable. It also makes the research task of In-Network computing impossible to disassemble. For example, scholars who study hardware prefer to focus on optimizing the processing efficiency of a single operator in the device, but they are not good at the message protocol with the design operator. The computing capability abstraction model should support a variety of operators, including the possibility of operator extension.¶