Network service YANG modules [RFC8199] describe the configuration, state data, operations, and notifications of abstract representations of services implemented on one or multiple network elements.¶
Service orchestrators use Network service YANG modules that will infer network-wide configuration and, therefore the invocation of the appropriate device modules (Section 3 of [RFC8969]).
Knowing that a configuration is applied doesn't imply that the service is up and running as expected.
For instance, the service might be degraded because of a failure in the network, the experience quality is distorted, or a service function may be reachable at the IP level but does not provide its intended function.
Thus, the network operator must monitor the service operational data at the same time as the configuration (Section 3.3 of [RFC8969]).
To feed that task, the industry has been standardizing on telemetry to push network element performance information.¶
A network administrator needs to monitor their network and services as a whole, independently of the management protocols.
With different protocols come different data models, and different ways to model the same type of information.
When network administrators deal with multiple management protocols, the network management entities have to perform the difficult and time-consuming job of mapping data models:
e.g. the model used for configuration with the model used for monitoring when separate models or protocols are used.
This problem is compounded by a large, disparate set of data sources (MIB modules, YANG models [RFC7950], IPFIX information elements [RFC7011], syslog plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], etc.).
In order to avoid this data model mapping, the industry converged on model-driven telemetry to stream the service operational data, reusing the YANG models used for configuration.
Model-driven telemetry greatly facilitates the notion of closed-loop automation whereby events/status from the network drive remediation changes back into the network.¶
However, it proves difficult for network operators to correlate the service degradation with the network root cause.
For example, "Why does my layer 3 virtual private network (L3VPN) fail to connect?" or "Why is this specific service not highly responsive?".
The reverse, i.e., which services are impacted when a network component fails or degrades, is also important for operators.
For example, "Which services are impacted when this specific optic decibel milliwatt (dBm) begins to degrade?",
"Which applications are impacted by an imbalance in this equal cost multiple paths (ECMP) bundle?", or "Is that issue actually impacting any other customers?".
This task usually falls under the so-called "Service Impact Analysis" functional block.¶
In this document, we propose an architecture implementing Service Assurance for Intent-Based Networking (SAIN).
Intent-based approaches are often declarative, starting from a statement of "The service works as expected" and trying to enforce it.
However some already defined services might have been designed using a different approach.
Aligned with Section 3.3 of [RFC7149], and instead of requiring a declarative intent as a starting point,
this architecture focuses on already defined services and tries to infer the meaning of "The service works as expected".
To do so, the architecture works from an assurance graph, deduced from the configuration pushed to the device for enabling the service instance.
If the SAIN orchestrator supports it, the service model (Section 2 of [RFC8309]) or the network model (Section 2.1 of [RFC8969]) can also be used to build the assurance graph.
In that case and if the service model includes the declarative intent as well, the SAIN orchestrator can rely on the declared intent instead of inferring it.
The assurance graph may also be explicitly completed to add an intent not exposed in the service model itself.¶
The assurance graph of a service is decomposed into components, which are then assured independently.
The root of the assurance graph represents the service to assure, and its children represent components identified as its direct dependencies; each component can have dependencies as well.
Components involved in the assurance graph of a service are called subservices.
The SAIN orchestrator updates automatically the assurance graph when services are modified.¶
When a service is degraded, the SAIN architecture will highlight where in the assurance service graph to look, as opposed to going hop by hop to troubleshoot the issue.
More precisely, the SAIN architecture will associate to each service a list of symptoms originating from specific subservices, corresponding to components of the network.
These components are good candidates for explaining the source of a service degradation.
Not only can this architecture help to correlate service degradation with network root cause/symptoms, but it can deduce from the assurance graph the number and type of services impacted by a component degradation/failure.
This added value informs the operational team where to focus its attention for maximum return.
Indeed, the operational team should focus his priority on the degrading/failing components impacting the highest number customers, especially the ones with the SLA contracts involving penalties in case of failure.¶
This architecture provides the building blocks to assure both physical and virtual entities and is flexible with respect to services and subservices, of (distributed) graphs, and of components (Section 3.7).¶
The architecture presented in this document is completed by a set of YANG modules defined in a companion document [I-D.ietf-opsawg-service-assurance-yang].
These YANG modules properly define the interfaces between the various components of the architecture in order to foster interoperability.¶