Internet-Draft | C-SCORE | June 2023 |
Joung, et al. | Expires 26 December 2023 | [Page] |
This document specifies the framework and the operational procedure for deterministic networks with work conserving packet schedulers that guarantees end-to-end latency bounds to flows. Schedulers in core nodes of the network do not need to maintain flow states. Instead, the entrance node of a flow marks an ideal service completion time, called Finish Time (FT), of a packet in the packet header. The subsequent core nodes update FT by adding a delay factor, which is a function of the flow and upstream nodes. The packets in the queue are served in the ascending order of the FT. This technique is called fair queuing. The result is that flows are isolated from each other almost perfectly. The latency bound of a flow depends only on the flow's intrinsic parameters, except the maximum packet length among other flows sharing each output link with the flow.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 26 December 2023.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
There are emerging applications that require both latency and jitter bounds in large-scale networks. One of the keys to the latency and jitter performance of a network is the packet scheduling technique. Our objective is to devise a scheduler that can completely isolate a flow from other flows sharing a link. An ideal flow isolation scheduler would provide service to a flow, as if there is no other flow in an imaginary link whose capacity is equal to the allocated service rate to the flow. In this case the latency upper bound is a function of the parameters, which depend only on the flow.¶
In large-scale networks, end-nodes join and leave, and a large number of flows are dynamically generated and terminated. Achieving satisfactory deterministic performance in such environments would be challenging. The current Internet, which has adopted the DiffServ architecture, has the problem of the burst accumulation and the cyclic dependency, which is mainly due to FIFO queuing and strict priority scheduling. Cyclic dependency is defined as a situation wherein the graph of interference between flow paths has cycles [THOMAS]. The existence of such cyclic dependencies makes the proof of determinism a much more challenging issue and can lead to system instability, that is, unbounded delays [ANDREWS][BOUILLARD].¶
A class of schedulers called Fair Queuing (FQ) limits interference between flows to the degree of maximum packet size. Packetized generalized processor sharing (PGPS) and weighted fair queuing (WFQ) are representative examples of FQ [PAREKH]. In FQ, the ideal service completion time, called Finish Time (FT), of a packet is obtained from an imaginary system, which can provide the ideal flow isolation. Packets are served in an increasing order of the FT. Then the service of the packet is completed approximately the same as the FT. When this technique is applied, the worst delay at each node is proportional to the maximum packet length among flows sharing the link. However, the FT of the previous packet within a flow has to be remembered for the calculation of the current packet's FT. This information can be seen as the flow state. The complexity of managing such information for a large number of flows is a problem, so it is not usually adopted in practice.¶
This document specifies a framework for approximating the FQ scheduler in core nodes. It also specifies the operational procedure based on this framework. The entrance node in a network generates FT for a packet and records it in the packet. A core node, based on these records, updates FT by adding a delay factor that is a function of parameters of nodes and the flow. The proposed framework is called work conserving stateless core fair queuing (C-SCORE). C-SCORE is work conserving and has the property that, for a certain choice of the delay factor, the expression for end-to-end (E2E) latency bound can be found. This E2E latency bound function is the same as that of a network with stateful FQ schedulers in all the nodes.¶
The key component of the C-SCORE is the packet state that is carried as metadata. The C-SCORE does not need to maintain flow states at core nodes, yet it works as one of the FQ schedulers, which is known to provide the best flow isolation performance. The metadata to be carried in the packet header is simple and can be updated during the stay in the queue or before joining the queue.¶
In this document, strict time-synchronization among network nodes and slot scheduling are avoided. These are not easily achievable, especially over a large area network or across multiple Deterministic Networking (DetNet) domains. The asynchronous solution suggested in this document can provide satisfactory latency bounds without complex pre-computation and configuration for slot scheduling, and without hardware support usually necessary for time synchronization.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Generalized processor sharing (GPS) suggested a paradigm for a fair service for flows as fluid. Packetized GPS (PGPS), which implemented GPS in the realistic packet-based environment, played a pioneering role in this type of packet-based schedulers [PAREKH]. PGPS determines the service order of packets in ascending order of the FT derived by the following equation.¶
F(p) = max{F(p-1), V(A(p))}+L(p)/r, (1)¶
where p and p-1 are the pth and (p-1)th packets of a flow, F(p) is the FT, A(p) is the arrival time, L(p) is the length of the packet p, and r is the service rate allocated to the flow. Note that the index for the flow i is omitted. V(t) is called the virtual time function [PAREKH] or the system potential [STILIADIS-RPS] and is a value representing the current system progress at time t. If the backlogged flows almost fill the link capacity, then the system slowly progresses in terms of a flow's view, and the virtual time increases slowly. If there is only a handful of backlogged flows, then the virtual time increases with a higher rate. This behavior of the virtual time function prevents an unfair situation in which flows that entered late have relatively small FTs thus receive services earlier for a considerable period of time, compared to existing flows.¶
F(p) represents the time that an ideal fluid system would complete its service of packet p. In a real packetized FQ system, the packets are served in the increasing order of the FT. The FT can be calculated at the moment the packet arrives in the node, and the value of FT is attached to the packet before the packet is stored in the buffer. In general, there is a queue for each flow, the queues are managed by FIFO manner, and the scheduler serves the queue having the HoQ packet with the smallest FT. Alternatively, it is possible to put all packets in one queue and insert them in the middle of the queue according to the value of the FT. This implementation requires a push in first out (PIFO) queue. The point of (1) is that, in the worst case, when all flows are active and the link is fully used, a flow is served at an interval of L(p)/r. At the same time, by using the work conserving scheduler, any excessive link resources are shared among the flows. How fairly it is shared is the main difference between various FQ schedulers.¶
In order to obtain F(p) in (1), the FT of the previous packet of the flow, F(p-1), must be remembered. When a packet is received, it is necessary to find out which flow it belongs to and find out the FT of the latest packet of the corresponding flow. F(p-1) of this latest packet is a value representative of the 'flow state'. The fact that such state information must be memorized and read means a considerable complexity at a core node managing millions of flows. This is the main reason why such FQ schedulers are not actually used on the Internet. In this document, we pay attention to the fact that the fair service time interval information between packets is already included in the FTs of the entrance node of a flow. Instead of deriving a new FT at each core node, we will specify a method of deriving the FT at downstream nodes by using the initial FT calculated at the entrance node.¶
On the other hand, calculating V(t) also involves the complexity of calculating the sum of r by tracking the flows currently being serviced in real time. It is a complex calculation. Therefore, instead of calculating V(t) accurately, methods for estimating it in a simple way have been suggested. Among them, Virtual clock (VC) [ZHANG] uses the current time t instead of V(t) to determine the FT. Self-clocked fair queuing [17] uses the FT of the recently serviced packet of another flow instead of V(t).¶
Stiliadis showed that this series of FQ schedulers can belong to the Packetized Rate Proportional Servers (PRPS) [STILIADIS-RPS]. For example, PGPS and VC are PRPS, while self-clocked fair queuing is not. It was proved that a network with all the nodes having one of these PRPSs guarantee the E2E latency for flow. Moreover, any PRPS scheduler yields the same E2E latency bound [STILIADIS-RPS].¶
In this document, we assume there are only two classes of traffic. The high priority traffic requires guarantee on latency upper bounds. All the other traffic is considered to be the low priority traffic, and be completely preempted by the high priority traffic. High priority (HP) traffic is our only focus.¶
All the flows conform to their traffic specification (TSpec) parameters. In other words, with the maximum burst size Bi and the arrival rate ri, the accumulated arrival from flow i in any arbitrary time interval (t2-t1) does not exceed Bi+(t2-t1)ri. An actual allocated service rate to a flow can be larger than or equal to the arrival rate of the flow. Total allocated service rate to all the flows in a node does not exceed the link capacity of the node. These assumptions make the resource reservation and the admission control mandatory.¶
In this document, the service rate is considered to be the same as the arrival rate ri. However, as it will be shown in (5) that, by adjusting the service rate to a flow, the E2E latency bound of the flow can be adjusted. Note that ri is used interchangeably with the symbol r.¶
A node, or equivalently a server, means an output port module of a switching device.¶
The entrance node for a flow is the node located at the edge of a network, from which the flow enters into the network. A core node for a flow is a node in the network, which is traversed by the flow and is not the entrance node. Note that a single node can be both an entrance node to a flow and a core node for another flow.¶
A packet is considered arrived or serviced; when its last bit has arrived or left the node, respectively.¶
Propagation delays are neglected for the simplicity of representation. However, it can be easily incorporated into the equations presented in this document, if necessary. This issue will be covered in Section 6.3.6.¶
FQ schedulers utilize the concept of FT that is used as the service order assigned to a packet. The packet with the minimum FT in a buffer is served first.¶
As an example, the VC scheduler [ZHANG] defines the FT to be¶
F(p) = max{F(p-1), A(p)} + L(p)/r, (2)¶
where (p-1) and p are consecutive packets of the flow under observation, A(p) is the arrival time of p, L(p) is the length of p, and r is the flow service rate. The flow index is omitted.¶
The key idea of the FQ is to calculate the service completion times of packets in an imaginary ideal fluid service model and use them as the service order in the real packet-based scheduler.¶
While having the excellent flow isolation property, the FQ needs to maintain the flow state, F(p-1). For every arriving packet, the flow it belongs to has to be identified and its previous packet's FT should be extracted. As the packet departs, the flow state, F(p), has to be updated as well.¶
We consider a framework for constructing FTs for packets at core nodes without flow states. In a core node, the following conditions on FTs SHOULD be met.¶
In essence, (2) has to be approximated in core nodes. There can be many possible solutions to meet these conditions. We describe a generic framework with requirements for constructing FTs in core nodes that meet the conditions, without flow state, in the following.¶
Definition: An active period for a flow is a maximal interval of time during a node busy period, over which the FT of the most recently arrived packet of the flow is greater than the virtual time (equivalently the system potential). Any other period is an inactive period for the flow.¶
Requirement 1: In the entrance node, it is REQUIRED to obtain the FTs with the following equation. 0 denotes the entrance node of the flow under observation.¶
F0(p) = max{F0(p-1), A0(p)}+L(p)/r.¶
Note that if the FTs are constructed according to the above equation, the fair distance of consecutive packets is maintained.¶
Requirement 2: In a core node h, it is REQUIRED to increase the FT of a packet by an amount, d(h-1)(p), that depends on the previous node and the packet.¶
Fh(p) = F(h-1)(p) + d(h-1)(p).¶
Requirement 3: It is REQUIRED that dh(p) is a non-decreasing function of p, within a flow active period.¶
Requirements 1, 2, and 3 specify how to construct the FT in a network. By these requirements Conditions C1), C2), and C3) are met. The following requirements 4 and 5 specify how the FT is used for scheduling.¶
Requirement 4: It is REQUIRED that a node provides service whenever there is a packet.¶
Requirement 5: It is REQUIRED that all packets waiting for service in a node are served in the ascending order of their FTs.¶
We call this framework the work conserving stateless core fair queuing (C-SCORE), which can be compared to the existing non-work conserving scheme [STOICA].¶
For C-SCORE to guarantee E2E latency bound, the dh(p) is RECOMMENDED to be defined as in the following.¶
dh(p) = SLh. (3)¶
The service latency of the flow at node h, denoted by SLh, is given as follows.¶
SLh = Lh/Rh + L/r, (4)¶
where Lh is the max packet length in the node h over all the flows that are transmitted from the output port under observation, Rh is the link capacity of the node h, and L is the max packet length in the flow. The service latency was first introduced in the concept of Latency-rate server model [STILIADIS-LRS], which can be interpreted as the worst delay from the arrival of the first packet of a new flow until its service completion.¶
Consider the worst case: Right before a new flow's first packet arrives at a node, the transmission of another packet with length Lh has just started. This packet takes the transmission delay of Lh/Rh. After the transmission of the packet with Lh, the flow under observation could take only the allocated share of the link, and the service of the packet under observation would be completed after L/r. Therefore, the packet has to wait, in the worst case, Lh/Rh + L/r.¶
The reason to add the service latency to F(h-1)(p) to get Fh(p) is to meet Condition C3) in a most conservative way without being too excessive. Intuitively, when every packet's FT is updated with the flow's own worst delay, then a packet that experienced the worst delay gets a favor. Thus its worst delay will not get any worse, while the delay differences among flows are reflected.¶
When dh(p) is decided by (3), then it can be proved that¶
Dh(p) <= (B-L)/r + SL0 + SL1 + ... + SLh, (5)¶
where Dh(p) is the latency experienced by p from the arrival at the node 0 to the departure from node h; B, L, and r are the max burst of, max packet length of, and allocated rate to the flow under observation that p belongs to, respectively [KAUR].¶
Note that the latency bound in (5) is the same to the network where every node has a stateful FQ scheduler, including VC. The parameters in the latency bound are all intrinsic to the flow, except Lh/Rh.¶
As a packet arrives at a core node, it carries metadata F(h-1)(p), d(h-1)(p), and L/r.¶
L/r should be kept in a packet in order to calculate dh(p) according to (3) and (4).¶
Fh(p) and dh(p) are dynamic and thus need to be updated at every hop.¶
Fh(p) can be obtained by a summation of the metadata F(h-1)(p) and d(h-1)(p).¶
dh(p) can be obtained by a summation of the metadata L/r and the node specific parameter Lh/Rh.¶
They can be updated during the time interval between the packet arrival to the switch (or router) and putting it into the queue in the output port.¶
Alternatively, Fh(p) can be pre-calculated at node h-1 with d(h-1)(p), which is the function of node h-1. In this case Fh(p) and L/r are the only metadata necessary. In Section 6.3.3 and Section 6.3.4, we follow this alternative approach.¶
The clear definition and format of metadata will be described in a later version.¶
A source requests E2E latency bound for a flow, specifying its arrival rate, maximum packet length, and maximum burst size. If the E2E latency bound can be met, the network admits the flow. The network reserves the links in the path such that the sum of allocated service rates to the flows does not exceed the link capacity.¶
The detailed operational procedure for such admission and reservation is out of scope of this document.¶
We assume that the packet length of p, L(p), is written in the packet header. An entrance node maintains the flow state, i.e. F0(p-1), L, and r. It also maintains a clock to identify A0(p). It also maintains the link information L0/R0 to calculate d0(p). Upon receiving or generating packet p, it obtains F0(p) = max{F0(p-1), A0(p)} + L(p)/r, and uses it as the FT in the entrance node. If the queue is not empty then it puts p in a sorted queue. It also obtains F1(p) = F0(p) + L0/R0 + L/r before or during p is in the queue. It records F1(p) and L/r in the packet as metadata for use in the next node 1. Finally it updates the flow state information F0(p-1) to F0(p).¶
A core node h maintains the link information Lh/Rh. As in an entrance node, Lh is a rather static value, but still can be changed over time. Upon receiving packet p, it retrieves metadata Fh(p) and L/r, and uses Fh(p) as the FT. It puts p in a sorted queue. It obtains F(h+1)(p) = Fh(p) + Lh/Rh + L/r and updates the packet metadata Fh(p) with F(h+1)(p) before or during p is in the queue.¶
Flow states still have to be maintained in entrance nodes. The notion of an entrance node, however, can be mitigated into various edge devices, including a source itself. FT of a packet is decided based on the maximum of F0(p-1) and A0(p); and L(p)/r. These parameters are flow-specific. There is no need to know any other external parameters. The arrival time of p to the network, A0(p), can be approximated by the generation time of p at the source. Then F0(p) is determined at the packet generation time and can be recorded in the packet. Therefore, we can simplify the proposed solution to a great degree, and can apply to any network with robustness and scalability.¶
As it has been stated in Section 5, we have assumed zero propagation delays between nodes. In reality, there are time differences between nodes, including the differences due to the propagation delays. This time difference can be defined as the difference between the service completion time measured at the upstream node and the arrival time measured at the current node. In other words,¶
td(h-1, h)(p) = Ah(p) - C(h-1)(p),¶
where td(h-1, h)(p) is the time difference between node h-1 and h, and C(h-1)(p) is the service completion time measured at node h-1, for packet p respectively.¶
FT does not need to be precise. It is used just to indicate the packet service order. Therefore, if we can assume that the propagation delay is constant and the clocks do not drift, then td(h-1, h)(p) can be simplified to a constant value, td(h-1, h). In this case the delay factor in (3) can be modified to be¶
dh(p) = SL(h-1)+td(h-1, h).¶
The E2E latency bound in (5) increases as much as the sum of propagation delays from node 0 to h.¶
Moreover, the time difference td(h-1, h) can be updated only once in a while, and signaled out-of-band.¶
Even with the clock differences and propagation delays, C-SCORE does not need global time synchronization.¶
There might be matters that require IANA considerations associated with metadata. If necessary, relevant text will be added in a later version.¶
This section will be described later.¶