Internet-Draft Network Incident Management October 2023
Feng, et al. Expires 25 April 2024 [Page]
Workgroup:
OPSAWG
Internet-Draft:
draft-feng-opsawg-incident-management-03
Published:
Intended Status:
Standards Track
Expires:
Authors:
C. Feng, Ed.
Huawei
T. Hu
CMCC
LM. Contreras
Telefonica I+D
T. Graf
Swisscom
Q. Wu
Huawei
C. Yu
Huawei
N. Davis
Ciena

Incident Management for Network Services

Abstract

A network incident refers to an unexpected interruption of a network service, degradation of a network service quality, or sub-health of a network service. Different data sources including alarms, metrics and other anomaly information can be aggregated into few amount of network incidents by data correlation analysis and the service impact analysis.

This document defines YANG modules for the network incident lifecycle management. The YANG modules are meant to provide a standard way to report, diagnose, and resolve network incidents for the sake of network service health and root cause analysis.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 25 April 2024.

Table of Contents

1. Introduction

[RFC8969] defines a framework for Automating Service and Network Management with YANG to full life cycle network management. A set of YANG data models have already been developed in IETF for Network performance monitoring and fault monitoring,e.g.,A YANG [RFC7950] data model for alarm management [RFC8632] defines a standard interface for alarm management. A data model for Network and VPN Service Performance Monitoring[RFC9375] defines a standard interface for network performance management. In addition, distributed tracing mechanism defined in [W3C-Trace-Context] can also be used to analyze and debug operations, such as configuration transactions, across multiple distributed systems.

However these YANG data models for network maintenance are based on specific data source information and manage alarms and performance metrics data separately by different layers in various different management systems. In addition, the frequency and quantity of alarms and performance metrics data reported to Operating Support System (OSS) are increased dramatically (in many cases multiple orders of magnitude) with the growth of service types and complexity and greatly overwhelm OSS platforms; with known dependency relation between fault, alarm and events at each layer (e.g.,packet layer or optical layer), the traditional solutions, e.g., data compression are time-consuming and labor-intensive, usually rely on maintenance engineers' experience for data analysis, which result in low processing efficiency, inaccurate root cause identification and duplicated tickets. In addition, it is also difficult to assess the impact of alarms, performance metrics and other anomaly data on network services without known relation across layer of the network topology data or the relation with other network topology data.

To address these challenges, a network wide incident-centric solution is proposed to establish dependency relation with both network service and network topology at different layers , which not only can be used at specific layer in specific domain but also can be used to span across layer for multi-layer network troubleshooting. A network incident refers to an unexpected interruption of a network service, degradation of a network service quality, or sub-health of a network service. Different data sources including alarms, metrics and other anomaly information can be aggregated into few amount of incidents irrespective layer by correlation analysis and the service impact analysis. For example, the protocols related to the interface fail to work properly due to the interface down or Service Level Objective (SLO) violation, large amount of alarms may be reported to upper layer management system and aggregated into one or a few incidents when some network services may be affected by this incident (e.g. L3VPN services related with the interface will become unavailable [I-D.ietf-ippm-pam] ). An incident may also be raised through the analysis of some network performance metrics, for example, as described in SAIN[RFC9417] , network services can be decomposed to some sub-services, some metrics are monitored for each sub-service, symptoms will occur if services/sub-services are unhealthy(after analyzing metrics), these symptoms may raise one incident when it causes degradation of the network services.

In addition, Artificial Intelligence (AI) and Machine Learning (ML) play a important role in the processing of large amounts of data with complex data correlations. For example, Neural Network Algorithm or Hierarchy Aggregation Algorithm can be used to replace manual alarm data correlation. Through online and offline learning, these algorithms can be continuously optimized to improve the efficiency of fault diagnosis.

This document defines the concepts, requirements, and architecture of incident management. The document also defines a YANG data model for incident lifecycle management, which improves troubleshooting efficiency, ensures network service quality, and improves network automation [RFC8969].

2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

The following terms are defined in [RFC8632] are not redefined here:

The following terms are defined in this document:

Incident:

An unexpected interruption of a network service, degradation of network service quality, or sub-health of a network service.

Incident management:
Lifecycle management of incidents including incident identification, reporting, acknowledge, diagnosis, and resolution.
Incident management system:
An entity which implements incident management. It include incident management server and incident management client.
Incident management server:
An entity which provides some functions of incident management. For example, it can detect an incident, perform incident diagnosis, resolution and prediction,etc.
Incident management client:
An entity which can manage incidents. For example, it can receive incident notifications, query the information of incidents, instruct the incident management server to diagnose, resolve, etc.

3. Sample Use Cases

3.1. Incident-Based Trouble Tickets dispatching

Traditionally, the dispatching of trouble tickets is mostly based on alarms data analysis and need to involve operators' maintenance engineers. These operators' maintenance engineers are able to monitor and detect that alarms at both end devices of specific network tunnel or at both optical layer and IP layer are associated with the same network fault. Therefore, they can correlate these alarms to the same trouble ticket, which is in the low automation. If there are more alarms, then the human costs for network maintenance are increased accordingly.

Some operators preconfigure whitelist and adopt some coarse granularity data correlation rules for the alarm management. It seems to improve fault management automation. However, some trouble tickets could be missed if the filtering conditions are too strict. If the filtering conditions are not strict, it might end up with multiple trouble tickets being dispatched to the same network fault.

It is hard to achieve a perfect balance between the network management automation and duplicated trouble tickets under the traditional working situations. However, with the help of the incident management, massive alarms can be aggregated into a few network incidents based on service impact analysis, the number of trouble tickets will be greatly reduced. At the same time, the efficiency of network troubleshooting can be largely improved. which address the pain point of traditional trouble ticket dispatching.

3.2. Incident Derivation from L3VPN services Unavailability

SLOs can be used to characterize the ability of a particular set of nodes to communicate according to certain measurable expectations [I-D.ietf-ippm-pam]. For example, an SLA might state that any given SLO applies to at least a certain percentage of packets, allowing for a certain level of packet loss and exceeding packet delay threshold to take place. An SLA might establish a multi-tiered SLO of end to end latency as follows:

This SLA information can be bound with two or multiple service attachment point defined in [RFC9408], so that the service orchestration layer can use these interfaces to commit the delivery of a service on specific point to point service topology or point to multi-point topology. Upon specific levels of a threshold of an SLO is violated, a specific network incident, associated with,let's say L3VPN service will be derived.

3.3. Multi-layer Fault Demarcation

Currently, to accomplish fault isolation and fault localization, maintenance experts need to correlate topology data, service data together with huge amount of alarm data at different layers (e.g., optical layer, packet layer) to do the analysis. Sometimes some cooperation from the construction engineers who work on site, are required to attempt to make configuration change on devices and then further investigate the corresponding root cause. Sometimes cooperation between different operation teams are required to locate fault either at the optical layer or packet layer, which usually results in long fault locating time.

For example, for IP layer link down, fault might happen at the optical network,e.g.,common cable interruption, or inter-layer link between IP network equipment and optical network equipment or fault is caused by malfunction of optical module. Maintenance experts need to analyze the root cause alarm from large amount of alarms, and then trace the root cause alarm in the network segment by segment. Next, site engineers perform tests at the source station to locate the interruption and locate the faulty optical exchange station. Then they move to the located optical exchange station to replace or splice fibers. During the whole process, multiple operation teams are needed inside and outside the site.

With the help of incident management, the management system can automatically perform Multi-layer fault demarcation, establish multi-link optical power view, locate the faulty segment at each layer or at the multi-layer link, and eliminate the need for manual data correlation analysis. By cooperating with the integrated Optical time-domain reflectometer (OTDR) within the network equipment, we can determine the target optical exchange station before site visits. Multiple site visits and time are saved.

4. Incident Management Architecture


            +----------------------+-------------------+
            |                                          |
            |            Incident Management Client    |
            |                                          |
            |                                          |
            +------------+---------+---------+---------+
               ^         |         |         |
               |Incident |Incident |Incident |Incident
               |Report   |Ack      |Diagnose |Resolve
               |         |         |         |
               |         V         V         V
            +--+-------------------+---------+----------+
            |                                           |
            |                                           |
            |            Incident Management Server     |
            |                                           |
            |                                           |
            |                                           |
            |                                           |
            +-------------------------------+-----------+
                  ^       ^Abnormal         ^
                  |Alarm  |Operations       |Metrics
                  |Report |Report           |/Telemetry
                  |       |                 V
     +------------+-------+-----------------+----------------+
     |                                                       |
     |                     Network                           |
     |                                                       |
     +------------------------------------+------------------+

Figure 1: Incident Management Architecture

Figure 1 illustrates the incident management architecture. Two key components for the incident management are incident management client and incident management server.

Incident management server can be deployed in network analytics platform, controllers and provides functionalities such as incident identification, report, diagnosis, resolution, querying for incident lifecycle management.

Incident management client can be deployed in the network OSS or other business systems of operators and invokes the functionalities provided by incident management server to meet the business requirements of fault management.

A typical workflow of incident management is as follows:

4.1. Interworking with Alarm Management

   +-----------------------------+
   |         OSS                 |
   |+-------+      +-----------+ |
   ||alarm  |      | incident  | |
   ||handler|      |  client   | |
   |+-------+      +-----------+ |
   +---^---------------^---------+
       |               |
       |alarm          |incident
   +---|---------------|---------+
   |   |  controller   |         |
   |   |               |         |
   |+--+---++      +-----------+ |
   ||alarm  |      |           | |
   ||process+----->|  incident | |
   ||       |alarm |   server  | |
   |+------++      +-----------+ |
   |   ^              ^          |
   +---+--------------|----------+
       |alarm         | metrics/trace/etc.
       |              |
   +---+--------------+----------+
   |         network             |
   |                             |
   +-----------------------------+
Figure 2: Interworking with alarm management

YANG model for the alarm management[RFC8632] defines a standard interface to manage the lifecycle of alarms. Alarms represent the undesirable state of network resources, alarm data model also defines the root causes and impacted services fields, but there may lack sufficient information to determine them in lower layer system (mainly in devices level), so alarms do not always tell the status of services or the root causes. As described in [RFC8632], alarm management act as a starting point for high-level fault management. While incident management often works at the network level, so it's possible to have enough information to perform correlation and service impact analysis. Alarms can work as one of data sources of incident management and may be aggregated into few amount of incidents by correlation analysis, network service impact and root causes may be determined during incident process.

Incident also contains some related alarms,if needed users can query the information of alarms by alarm management interface [RFC8632]. In some cases, e.g. cutover scenario, incident server may use alarm management interface [RFC8632] to shelve some alarms.

Alarm management may keep the original process, alarms are reported from network to network controller or analytics and then reported to upper layer system(e.g. OSS). Upper layer system may store these alarms and provide the information for fault analysis (e.g. deeper analysis based on incident).

Compared with alarm management, incident management provides not only incident reporting but also diagnosis and resolution functions, it's possible to support self-healing and may be helpful for single-domain closed-loop control.

Incident management is not a substitute for alarm management. Instead, they can work together to implement fault management.

4.2. Interworking with SAIN

          +----------------+
          | incident client|
          +----------------+
                  ^
                  |incident
          +-------+--------+
          |incident server |
          +----------------+
                  ^
                  |symptoms
          +-------+--------+
          |     SAIN       |
          |                |
          +----------------+
                  ^
                  |metrics
    +-------------+-------------+
    |                           |
    |         network           |
    |                           |
    +---------------------------+

Figure 3: Interworking with SAIN

SAIN [RFC9417] defines the architecture of network service assurance. A network service can be decomposed into some sub-services, and some metrics can be monitored for sub-services. For example, a tunnel service can be decomposed into some peer tunnel interface sub-services and IP connectivity sub-service. If some metrics are evaluated to indicate unhealthy for specific sub-service, some symptoms will be present. Incident server may identify the incident based on symptoms, and then report it to upper layer system. So, SAIN can be one way to identify incident, services, sub-services and metrics can be preconfigured via APIs defined by service assurance YANG model [RFC9418] and incident will be reported if symptoms match the condition of incident.

4.3. Relationship with RFC8969

[RFC8969] defines a framework for network automation using YANG, this framework breaks down YANG modules into three layers, service layer, network layer and device layer, and contains service deployment, service optimization/assurance, and service diagnosis. Incident works at the network layer and aggregates alarms, metrics and other information from device layer, it's helpful to provfide service assurance. And the incident diagnosis may be one way of service diagnosis.

4.4. Relationship with Trace Context

W3C defines a common trace context[W3C-Trace-Context] for distributed system tracing, [I-D.rogaglia-netconf-trace-ctx-extension] defines a netconf extension for [W3C-Trace-Context] and [I-D.quilbeuf-opsawg-configuration-tracing] defines a mechanism for configuration tracing. If some errors occur when services are deploying, it's very easy to identify these errors by distributed system tracing, and an incident should be reported.

5. Functional Interface Requirements between the Client and the Server

5.1. Incident Identification

        +--------------+
     +--|  Incident1   |
     |  +--+-----------+
     |     |  +-----------+
     |     +--+  alarm1   |
     |     |  +-----------+
     |     |
     |     |  +-----------+
     |     +--+  alarm2   |
     |     |  +-----------+
     |     |
     |     |  +-----------+
     |     +--+  alarm3   |
     |        +-----------+
     |  +--------------+
     +--|  Incident2   |
     |  +--+-----------+
     |     |  +-----------+
     |     +--+  metric1  |
     |     |  +-----------+
     |     |  +-----------+
     |     +--+  metric2  |
     |        +-----------+
     |
     |  +--------------+
     +--|  Incident3   |
        +--+-----------+
           |  +-----------+
           +--+ alarm1    |
           |  +-----------+
           |
           |  +-----------+
           +--| metric1   |
              +-----------+
Figure 4: Incident Identification

As described in Figure 4, multiple alarms, metrics, or hybrid can be aggregated into an incident after analysis.

The incident management server MUST be capable of identifying incidents. Multiple alarms, metrics and other information are reported to incident server, and the server must analyze it and find out the correlations of them, if the correlation match the incident rules, incident will be identified and reported to the client. Service impact analysis will be performed if an indent is identified, and the content of incident will be updated if impacted network services are detected.

AI/ML may be used to identify the incident. Expert system and online learning can help AI to identify the correlation of alarms, metrics and other information by time-base correlation algorithm, topo-based correlation algorithm, etc. For example, if interface is down, then many protocol alarms will be reported, AI will think these alarms have some correlations. These correlations will be put into knowledge base, and the incident will be identified faster according to knowledge base next time.

As mentioned above, SAIN is another way to implement incident identification. Observation timestamp defined in [I-D.tgraf-yang-push-observation-time] and trace context defined in [W3C-Trace-Context] may be helpful for incident identification.

                     +----------------------+
                     |                      |
                     |     Orchestrator     |
                     |                      |
                     +----+-----------------+
                          ^VPN A Unavailable
                          |
                      +---+----------------+
                      |                    |
                      |     Controller     |
                      |                    |
                      |                    |
                      +-+-+-+-----+--+-----+
                      ^ ^            ^
                  IGP | |Interface   |IGP Peer
                 Down | |Down        | Abnormal
                      | |            |
     VPN A            | |            |
    +-----------------+-+------------+------------------*
    | \  +---+       ++-++         +-+-+        +---+  /|
    |  \ |   |       |   |         |   |        |   | / |
    |   \|PE1+-------| P1+X--------|P2 +--------|PE2|/  |
    |    +---+       +---+         +---+        +---+   |
    +---------------------------------------------------+

Figure 5: Example 1 of Incident Identification

As described in Figure 5, vpn a is deployed from PE1 to PE2, if a interface of P1 is going down, many alarms are triggered, such as interface down, igp down, and igp peer abnormal from P2. These alarms are aggregated and analyzed by controller, and the incident 'vpn unavailable' is triggered by the controller.

                    +----------------------+
                    |                      |
                    |     Orchestrator     |
                    |                      |
                    +----+-----------------+
                         ^VPN A Degradation
                         |
                     +---+----------------+
                     |                    |
                     |     controller     |
                     |                    |
                     |                    |
                     +-+-+-+-----+--+-----+
                       ^            ^
                       |Packet      |Path Delay
                       |Loss        |
                       |            |
    VPN A              |            |
   +-------------------+------------+-------------------+
   | \  +---+       ++-++         +-+-+        +---+  / |
   |  \ |   |       |   |         |   |        |   | /  |
   |   \|PE1+-------|P1 +---------|P2 +--------|PE2|/   |
   |    +---+       +---+         +---+        +---+    |
   +----------------------------------------------------+

Figure 6: Example 2 of Incident Identification

As described in Figure 6, controller collect the network metrics from network elements, it finds the packet loss of P1 and the path delay of P2 exceed the thresholds, an incident 'VPN A degradation' may be triggered after analysis.

5.2. Incident Diagnosis

After an incident is reported to the incident management client, the client MAY diagnose the incident to determine the root cause. Some diagnosis operations may affect the running network services. The client can choose not to perform that diagnosis operation after determining the impact is not trivial. The incident management server can also perform self-diagnosis. However, the self-diagnosis MUST not affect the running network services. Possible diagnosis methods include link reachability detection, link quality detection, alarm/log analysis, and short-term fine-grained monitoring of network quality metrics, etc.

5.3. Incident Resolution

After the root cause is diagnosed, the client MAY resolve the incident. The client MAY choose resolve the incident by invoking other functions, such as routing calculation function, configuration function, dispatching a ticket or asking the server to resolve it. Generally, the client would attempt to directly resolve the root cause. If the root cause cannot be resolved, an alternative solution SHOULD be required. For example, if an incident caused by a physical component failure, it cannot be automatically resolved, the standby link can be used to bypass the faulty component.

Incident server will monitor the status of incident, if the faults are fixed, the server will update the status of incident to 'cleared', and report the updated incident to the client.

Incident resolution may affect the running network services. The client can choose not to perform those operations after determining the impact is not trivial.

6. Incident Data Model Concepts

6.1. Identifying the Incident Instance

An incident ID is used as an identifier of an incident instance, if an incident instance is identified, a new incident ID is created. The incident ID MUST be unique in the whole system.

6.2. The Incident Lifecycle

6.2.1. Incident Instance Lifecycle

From an incident instance perspective, an incident can have the following lifecycle: 'raised', 'updated', 'cleared'. When an incident is generated, the status is 'raised'. If the status changes after the incident is generated, (for example, self-diagnosis, diagnosis command issued by the client, or any other condition causes the status to change but does not reach the 'cleared' level.) , the status changes to 'updated'. When an incident is successfully resolved, the status changes to 'cleared'.

6.2.2. Operator Incident Lifecycle

From an operator perspective, the lifecycle of an incident instance includes 'acknowledged', 'diagnosed', and 'resolved'. When an incident instance is generated, the operator SHOULD acknowledge the incident. And then the operator attempts to diagnose the incident (for example, find out the root cause and affected components). Diagnosis is not mandatory. If the root cause and affected components are known when the incident is generated, diagnosis is not required. After locating the root cause and affected components, operator can try to resolve the incident.

7. Incident Data Model Design

7.1. Overview

There are two YANG modules in the model. The first module, "ietf-incident-type", provides common definitions such as incident domain, incident category, incident priority. The second module, "ietf-incident", defines technology independent abstraction of network incident construct for alarm, log, performance metrics, etc. At the top of "ietf-incident" module is the Network Incident. Network incident is represented as a list and indexed by "incident-id". Each Network Incident is associated with a service instance, domain and sources. Under sources, there is one or more sources. Each source corresponds to node defined in the network topology model and network resource in the network device,e.g., interface. In addition, "ietf-incident" support one general notification to report incident state changes and three rpcs to manage the network incidents.

module: ietf-incident
  +--ro incidents
     +--ro incident* [incident-id]
        +--ro incident-id string
        +--ro csn? uint64
        +--ro service-instance* string
        +--ro name? string
        +--ro type? enumeration
        +--ro domain? identityref
        +--ro priority? int:incident-priority
        +--ro status? enumeration
        +--ro ack-status? enumeration
        +--ro category? identityref
        +--ro detail? string
        +--ro resolve-advice? string
        +--ro sources
        ...
        +--ro root-causes
        ...
        +--ro root-events
        ...
        +--ro events
        ...
        +--ro raise-time? yang:date-and-time
        +--ro occur-time? yang:date-and-time
        +--ro clear-time? yang:date-and-time
        +--ro ack-time? yang:date-and-time
        +--ro last-updated? yang:date-and-time
  rpcs:
    +---x incident-acknowledge
    ...
    +---x incident-diagnose
    ...
    +---x incident-resolve

  notifications:
    +---n incident-notification
       +--ro incident-id?
               -> /inc:incidents/inc:incident/inc:incident-id
       ...
       +--ro time? yang:date-and-time

7.2. Incident Notifications

  notifications:
    +---n incident-notification
       +--ro incident-id?
               -> /inc:incidents/inc:incident/inc:incident-id
       +--ro csn? uint64
       +--ro service-instance* string
       +--ro name? string
       +--ro type? enumeration
       +--ro domain? identityref
       +--ro priority? int:incident-priority
       +--ro status? enumeration
       +--ro ack-status? enumeration
       +--ro category? identityref
       +--ro detail? string
       +--ro resolve-advice? string
       +--ro sources
       |  +--ro source* [node]
       |     +--ro node -> /nw:networks/nw:network/nw:node/nw:node-id
       |     +--ro resource* [name]
       |        +--ro name al:resource
       +--ro root-causes
       |  +--ro root-cause* [node]
       |     +--ro node -> /nw:networks/nw:network/nw:node/nw:node-id
       |     +--ro resource* [name]
       |     |  +--ro name al:resource
       |     |  +--ro cause-name? string
       |     |  +--ro detail? string
       |     +--ro cause-name? string
       |     +--ro detail? string
       +--ro root-events
       |  +--ro root-event* [type event-id]
       |     +--ro type -> ../../../events/event/type
       |     +--ro event-id leafref
       +--ro events
       |  +--ro event* [type event-id]
       |     +--ro type enumeration
       |     +--ro event-id string
       |     +--ro (event-type-info)?
       |        +--:(alarm)
       |        |  +--ro alarm
       |        |     +--ro resource? leafref
       |        |     +--ro alarm-type-id? leafref
       |        |     +--ro alarm-type-qualifier? leafref
       |        +--:(notification)
       |        +--:(log)
       |        +--:(KPI)
       |        +--:(unknown)
       +--ro time? yang:date-and-time

A general notification, incident-notification, is provided here. When an incident instance is identified, the notification will be sent. After a notification is generated, if the incident management server performs self diagnosis or the client uses the interfaces provided by the incident management server to deliver diagnosis and resolution actions, the notification update behavior is triggered, for example, the root cause objects and affected objects are updated. When an incident is successfully resolved, the status of the incident would be set to 'cleared'.

7.3. Incident Acknowledge

    +---x incident-acknowledge
    |  +---w input
    |  |  +---w incident-id*
    |  |          -> /inc:incidents/inc:incident/inc:incident-id

After an incident is generated, updated, or cleared, (In some scenarios where automatic diagnosis and resolution are supported, the status of an incident may be updated multiple times or even automatically resolved.) The operator needs to confirm the incident to ensure that the client knows the incident.

The incident-acknowledge rpc can confirm multiple incidents at a time

7.4. Incident Diagnose

    +---x incident-diagnose
    |  +---w input
    |  |  +---w incident-id*
    |  |          -> /inc:incidents/inc:incident/inc:incident-id

After an incident is generated, incident diagnose rpc can be used to diagnose the incident and locate the root causes. Diagnosis can be performed on some detection tasks, such as BFD detection, flow detection, telemetry collection, short-term threshold alarm, configuration error check, or test packet injection.

After the diagnosis is performed, a incident update notification will be triggered to report the latest status of the incident.

7.5. Incident Resolution

    +---x incident-resolve
       +---w input
       |  +---w incident-id*
       |          -> /inc:incidents/inc:incident/inc:incident-id

After the root causes and impacts are determined, incident-resolve rpc can be used to resolve the incident (if the server can resolve it). How to resolve an incident instance is out of the scope of this document.

Incident resolve rpc allows multiple incident instances to be resolved at a time. If an incident instance is successfully resolved, a notification will be triggered to update the incident status to 'cleared'. If the incident content is changed during this process, a notification update will be triggered.

8. Incident Management YANG Module

<CODE BEGINS>
       file="ietf-incident-types@2023-05-16.yang"
module ietf-incident-types {
  yang-version "1.1";
  namespace "urn:ietf:params:xml:ns:yang:ietf-incident-types";
  prefix "int";

  import ietf-network {
    prefix nw;
    reference
      "RFC 8345: A YANG Data Model for Network Topologies";
  }

  organization
    "IETF OPSAWG Working Group";
  contact
    "WG Web:   <https://datatracker.ietf.org/wg/opsawg/>;
     WG List:  <mailto:opsawg@ietf.org>
     Author:   Chong Feng  <mailto:frank.fengchong@huawei.com>
     Author:   Tong Hu  <mailto:hutong@cmhi.chinamobile.com>
     Author:   Luis Miguel Contreras Murillo <mailto:
               luismiguel.contrerasmurillo@telefonica.com>
     Author :  Thomas Graf   <mailto:thomas.graf@swisscom.com>
     Author :  Qin Wu   <mailto:bill.wu@huawei.com>
     Author:   Chaode Yu   <mailto:yuchaode@huawei.com>
     Author:   Nigel Davis   <mailto:ndavis@ciena.com>";

  description
    "This module defines the identities and typedefs for
     incident management.

     Copyright (c) 2022 IETF Trust and the persons identified as
     authors of the code.  All rights reserved.

     Redistribution and use in source and binary forms, with or
     without modification, is permitted pursuant to, and subject
     to the license terms contained in, the Revised BSD License
     set forth in Section 4.c of the IETF Trust's Legal Provisions
     Relating to IETF Documents
     (https://trustee.ietf.org/license-info).
     This version of this YANG module is part of RFC XXXX; see the
     RFC itself for full legal notices.  ";
  revision 2023-05-16 {
    description "initial version";
    reference "RFC XXX: Yang module for incident management.";
  }
  //identities
  identity incident-domain {
    description "The abstract identity to indicate the domain of
                 an incident.";
  }
  identity single-domain {
    base incident-domain;
    description "single domain.";
  }
  identity access {
    base single-domain;
    description "access domain.";
  }
  identity ran {
    base access;
    description "radio access network domain.";
  }
  identity transport {
    base single-domain;
    description "transport domain.";
  }
  identity otn {
    base transport;
    description "optical transport network domain.";
  }
  identity ip {
    base transport;
    description "ip transport network domain.";
  }
  identity ptn {
    base ip;
    description "packet transport network domain.";
  }

  identity cross-domain {
    base incident-domain;
    description "cross domain.";
  }
  identity incident-category {
    description "The abstract identity for incident category.";
  }
  identity device {
    base incident-category;
    description "device category.";
  }
  identity power-environment {
    base device;
    description "power system category.";
  }
  identity device-hardware {
    base device;
    description "hardware of device category.";
  }
  identity device-software {
    base device;
    description "software of device category";
  }
  identity line-card {
    base device-hardware;
    description "line card category.";
  }
  identity maintenance {
    base incident-category;
    description "maintenance category.";
  }
  identity network {
    base incident-category;
    description "network category.";
  }
  identity protocol {
    base incident-category;
    description "protocol category.";
  }
  identity overlay {
    base incident-category;
    description "overlay category";
  }
  identity vm {
    base incident-category;
    description "vm category.";
  }

  //typedefs
  typedef incident-priority {
    type enumeration {
      enum critical {
        description "the incident MUST be handled immediately.";
      }
      enum high {
        description "the incident should be handled as soon as
                     possible.";
      }
      enum medium {
        description "network services are not affected, or the
                     services are slightly affected,but corrective
                     measures need to be taken.";
      }
      enum low {
        description "potential or imminent service-affecting
                     incidents are detected,but services are
                     not affected currently.";
      }
    }
    description "define the priority of incident.";
  }
  typedef node-ref {
    type leafref {
      path "/nw:networks/nw:network/nw:node/nw:node-id";
    }
    description "reference a network node.";
  }
}

<CODE ENDS>
<CODE BEGINS>
       file="ietf-incident@2023-05-16.yang"
module ietf-incident {
  yang-version 1.1;
  namespace "urn:ietf:params:xml:ns:yang:ietf-incident";
  prefix inc;
  import ietf-yang-types {
    prefix yang;
    reference
      "RFC 6991: Common YANG Data Types";
  }

  import ietf-alarms {
    prefix al;
    reference
      "RFC 8632: A YANG Data Model for Alarm Management";
  }

  import ietf-incident-types {
    prefix int;
    reference
      "draft-feng-opsawg-incident-management: Incident
       Management for Network Services";
  }
  organization
    "IETF OPSAWG Working Group";
  contact
    "WG Web:   <https://datatracker.ietf.org/wg/opsawg/>;
     WG List:  <mailto:opsawg@ietf.org>
     Author:   Chong Feng  <mailto:frank.fengchong@huawei.com>
     Author:   Tong Hu  <mailto:hutong@cmhi.chinamobile.com>
     Author:   Luis Miguel Contreras Murillo <mailto:
               luismiguel.contrerasmurillo@telefonica.com>
     Author :  Qin Wu   <mailto:bill.wu@huawei.com>
     Author:   Chaode Yu   <mailto:yuchaode@huawei.com>
     Author:   Nigel Davis   <mailto:ndavis@ciena.com>";

 description
    "This module defines the interfaces for incident management
     lifecycle.

     This module is intended for the following use cases:
     * incident lifecycle management:
       - incident report: report incident instance to client
                          when an incident instance is detected.
       - incident acknowledge: acknowledge an incident instance.
       - incident diagnose: diagnose an incident instance.
       - incident resolve: resolve an incident instance.

     Copyright (c) 2022 IETF Trust and the persons identified as
     authors of the code.  All rights reserved.

     Redistribution and use in source and binary forms, with or
     without modification, is permitted pursuant to, and subject
     to the license terms contained in, the Revised BSD License
     set forth in Section 4.c of the IETF Trust's Legal Provisions
     Relating to IETF Documents
     (https://trustee.ietf.org/license-info).
     This version of this YANG module is part of RFC XXXX; see the
     RFC itself for full legal notices.  ";
  revision 2023-05-16 {
    description "remove identies and typedefs to independent yang
                 module. update some definitions of data model.";
    reference "RFC XXX: Yang module for incident management.";
  }
  revision 2023-03-13 {
    description "initial version";
    reference "RFC XXX: Yang module for incident management.";
  }

  //groupings
  grouping resources-info {
    description "the grouping which defines the network
                 resources of a node.";
    leaf node {
      type int:node-ref;
      description "reference to a network node.";
    }
    list resource {
      key name;
      description "the resources of a network node.";
      leaf name {
         type al:resource;
         description "network resource name.";
      }
    }
  }

  grouping incident-time-info {
    description "the grouping defines incident time information.";
    leaf raise-time {
      type yang:date-and-time;
      description "the time when an incident instance is raised.";
    }
    leaf occur-time {
      type yang:date-and-time;
      description "the time when an incident instance occurs.
                   It's the occurrence time of the first event during
                   incident detection.";
    }
    leaf clear-time {
      type yang:date-and-time;
      description "the time when an incident instance is
                   resolved.";
    }
    leaf ack-time {
      type yang:date-and-time;
      description "the time when an incident instance is
                   acknowledged.";
    }
    leaf last-updated {
      type yang:date-and-time;
      description "the latest time when an incident instance is
                   updated";
    }
  }

  grouping incident-info {
    description "the grouping defines the information of an
                 incident.";
    leaf csn {
      type uint64;
      mandatory true;
      description "The sequence number of the incident instance.";
    }
    leaf-list service-instance {
      type string;
      description "the related network service instances of
                   the incident instance.";
    }
    leaf name {
      type string;
      mandatory true;
      description "the name of an incident.";
    }
    leaf type {
      type enumeration {
        enum fault {
          description "It indicates the type of the incident
                       is a fault, for example an interface
                       fails to work.";
        }
        enum potential-risk {
          description "It indicates the type of the incident
                       is a potential risk, for example high
                       CPU rate may cause a fault in the
                       future.";
        }
      }
      mandatory true;
      description "The type of an incident.";
    }
    leaf domain {
      type identityref {
        base int:incident-domain;
      }
      mandatory true;
      description "the domain of an incident.";
    }
    leaf priority {
      type int:incident-priority;
      mandatory true;
      description "the priority of an incident instance.";
    }

    leaf status {
      type enumeration {
        enum raised {
          description "an incident instance is raised.";
        }
        enum updated {
          description "the information of an incident instance
                       is updated.";
        }
        enum cleared {
          description "an incident is cleared.";
        }
      }
      default raised;
      description "The status of an incident instance.";
    }
    leaf ack-status {
      type enumeration {
        enum acknowledged {
          description "The incident has been acknowledged by user.";
        }
        enum unacknowledged {
          description "The incident hasn't been acknowledged.";
        }
      }
      default unacknowledged;
      description "the acknowledge status of an incident.";
    }
    leaf category {
      type identityref {
        base int:incident-category;
      }
      mandatory true;
      description "The category of an incident.";
    }
    leaf detail {
      type string;
      description "detail information of this incident.";
    }
    leaf resolve-advice {
      type string;
      description "The advice to resolve this incident.";
    }
    container sources {
      description "The source components.";
      list source {
        key node;
        uses resources-info;
        min-elements 1;
        description "The source components of incident.";
      }
    }

    container root-causes{
      description "The root cause objects.";
      list root-cause {
        key node;
        description "the root causes of incident.";
        grouping root-cause-info {
          description "The information of root cause.";
          leaf cause-name {
            type string;
            description "the name of cause";
          }
          leaf detail {
            type string;
            description "the detail information of the cause.";
          }
        }
        uses resources-info {
          augment resource {
            description "augment root cause information.";
            //if root cause object is a resource of a node
            uses root-cause-info;
          }
        }
        //if root cause object is a node
        uses root-cause-info;
      }
    }
    container root-events {
      description "the root events of the incident.";
      list root-event {
        key "type event-id";
        description "the root event of the incident.";
        leaf type {
          type leafref {
            path "../../../events/event/type";
          }
          description "the event type.";
        }
        leaf event-id {
          type leafref {
            path "../../../events/event[type = current()/../type]"
                +"/event-id";
          }
          description "the event identifier, such as uuid,
                       sequence number, etc.";
        }
      }
    }
    container events {
      description "related events.";
      list event {
        key "type event-id";
        description "related events.";
        leaf type {
          type enumeration {
            enum alarm {
              description "alarm type";
            }
            enum inform {
              description "inform type";
            }
            enum KPI {
              description "KPI type";
            }
            enum unknown {
              description "unknown type";
            }
          }
          description "event type.";
        }
        leaf event-id {
          type string;
          description "the event identifier, such as uuid,
                       sequence number, etc.";
        }
        choice event-type-info {
          description "event type information.";
          case alarm {
            when "type = 'alarm'";
            container alarm {
              description "alarm type event.";
              leaf resource {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                      +"/al:resource";
                }
                description "network resource.";
                reference "RFC 8632: A YANG Data Model for Alarm
                           Management";
              }
              leaf alarm-type-id {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                      +"[al:resource = current()/../resource]"
                      +"/al:alarm-type-id";
                }
                description "alarm type id";
                reference "RFC 8632: A YANG Data Model for Alarm
                            Management";
              }
              leaf alarm-type-qualifier {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                      +"[al:resource = current()/../resource]"
                      +"[al:alarm-type-id = current()/.."
                      +"/alarm-type-id]/al:alarm-type-qualifier";
                }
                description "alarm type qualitifier";
                reference "RFC 8632: A YANG Data Model for Alarm
                           Management";
              }
            }
          }
          case notification {
            //TODO
          }
          case log {
          //TODO
          }
          case KPI {
          //TODO
          }
          case unknown {
          //TODO
          }
        }
      }

    }

  }

  //data definitions
  container incidents {
    config false;
    description "the information of incidents.";
    list incident {
      key incident-id;
      description "the information of incident.";
      leaf incident-id {
        type string;
        description "the identifier of an incident instance.";
      }
      uses incident-info;
      uses incident-time-info;
    }
  }

  // notifications
  notification incident-notification {
    description "incident notification. It will be triggered when
                 the incident is raised, updated or cleared.";
    leaf incident-id {
      type leafref {
        path "/inc:incidents/inc:incident/inc:incident-id";
      }
      description "the identifier of an incident instance.";
    }
    uses incident-info;
    leaf time {
      type yang:date-and-time;
      description "occur time of an incident instance.";
    }
  }
  // rpcs
  rpc incident-acknowledge {
    description "This rpc can be used to acknowledge the specified
                 incidents.";
    input {
      leaf-list incident-id {
        type leafref {
          path "/inc:incidents/inc:incident/inc:incident-id";
        }
        description "the identifier of an incident instance.";
      }
    }
  }
  rpc incident-diagnose {
    description "This rpc can be used to diagnose the specified
                 incidents. The result of diagnosis will be reported
                 by incident notification.";
    input {
      leaf-list incident-id {
        type leafref {
          path "/inc:incidents/inc:incident/inc:incident-id";
        }
        description
          "the identifier of an incident instance.";
      }
    }
  }

  rpc incident-resolve {
    description "This rpc can be used to resolve the specified
                 incidents. The result of resolution will be reported
                 by incident notification.";
    input {
      leaf-list incident-id {
        type leafref {
          path "/inc:incidents/inc:incident/inc:incident-id";
        }
        description
          "the identifier of an incident instance.";
      }
    }
  }
}

<CODE ENDS>

9. IANA Considerations

9.1. The "IETF XML" Registry

This document registers one XML namespace URN in the 'IETF XML registry', following the format defined in [RFC3688].

URI: urn:ietf:params:xml:ns:yang:ietf-incident
Registrant Contact: The IESG.
XML: N/A, the requested URIs are XML namespaces.

9.2. The "YANG Module Names" Registry

This document registers one module name in the 'YANG Module Names' registry, defined in [RFC6020].

name: ietf-incident
prefix: inc
namespace: urn:ietf:params:xml:ns:yang:ietf-incident
RFC: XXXX
// RFC Ed.: replace XXXX and remove this comment

10. Security Considerations

The YANG modules specified in this document define a schema for data that is designed to be accessed via network management protocol such as NETCONF [RFC6241] or RESTCONF [RFC8040]. The lowest NETCONF layer is the secure transport layer, and the mandatory-to-implement secure transport is Secure Shell (SSH) [RFC6242]. The lowest RESTCONF layer is HTTPS, and the mandatory-to-implement secure transport is TLS [RFC8446].

The Network Configuration Access Control Model (NACM) [RFC8341] provides the means to restrict access for particular NETCONF or RESTCONF users to a preconfigured subset of all available NETCONF or RESTCONF protocol operations and content.

There are a number of data nodes defined in this YANG module that are writable/creatable/deletable (i.e., config true, which is the default). These data nodes may be considered sensitive or vulnerable in some network environments. Write operations (e.g., edit-config) to these data nodes without proper protection can have a negative effect on network operations. These are the subtrees and data nodes and their sensitivity/vulnerability:

Some of the readable data nodes in this YANG module may be considered sensitive or vulnerable in some network environments. It is thus important to control read access (e.g., via get, get-config, or notification) to these data nodes. These are the subtrees and data nodes and their sensitivity/vulnerability:

Some of the RPC operations in this YANG module may be considered sensitive or vulnerable in some network environments. It is thus important to control access to these operations. These are the operations and their sensitivity/vulnerability:

11. Contributors

Aihua Guo
Futurewei Technologies
aihuaguo.ietf@gmail.com
Zhidong Yin
Huawei
yinzhidong@huawei.com
Guoxiang Liu
Huawei
liuguoxiang@huawei.com
Kaichun Wu
Huawei
wukaichun@huawei.com
Yanlei Zheng
China Unicom
zhengyanlei@chinaunicom.cn
Yunbin Xu
CAICT
xuyunbin@caict.ac.cn
Xing Zhao
CAICT
zhaoxing@caict.ac.cn

12. Acknowledgments

The authors would like to thank Mohamed Boucadair, Robert Wilton, Benoit Claise, Oscar Gonzalez de Dios, Mahesh Jethanandani, Balazs Lengyel, Bo Wu, Qiufang Ma, Haomian Zheng, YuanYao for their valuable comments and great input to this work.

13. References

13.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC3688]
Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, DOI 10.17487/RFC3688, , <https://www.rfc-editor.org/info/rfc3688>.
[RFC6020]
Bjorklund, M., Ed., "YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF)", RFC 6020, DOI 10.17487/RFC6020, , <https://www.rfc-editor.org/info/rfc6020>.
[RFC7950]
Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", RFC 7950, DOI 10.17487/RFC7950, , <https://www.rfc-editor.org/info/rfc7950>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.

13.2. Informative References

[I-D.ietf-ippm-pam]
Mirsky, G., Halpern, J. M., Min, X., Clemm, A., Strassner, J., and J. François, "Precision Availability Metrics for Services Governed by Service Level Objectives (SLOs)", Work in Progress, Internet-Draft, draft-ietf-ippm-pam-08, , <https://datatracker.ietf.org/doc/html/draft-ietf-ippm-pam-08>.
[I-D.quilbeuf-opsawg-configuration-tracing]
Quilbeuf, J., Claise, B., Graf, T., Lopez, D., and S. Qiong, "External Trace ID for Configuration Tracing", Work in Progress, Internet-Draft, draft-quilbeuf-opsawg-configuration-tracing-02, , <https://datatracker.ietf.org/doc/html/draft-quilbeuf-opsawg-configuration-tracing-02>.
[I-D.rogaglia-netconf-trace-ctx-extension]
Gagliano, R., Larsson, K., and J. Lindblad, "NETCONF Extension to support Trace Context propagation", Work in Progress, Internet-Draft, draft-rogaglia-netconf-trace-ctx-extension-03, , <https://datatracker.ietf.org/doc/html/draft-rogaglia-netconf-trace-ctx-extension-03>.
[I-D.tgraf-yang-push-observation-time]
Graf, T., Claise, B., and A. H. Feng, "Support of Network Observation Timestamping in YANG Notifications", Work in Progress, Internet-Draft, draft-tgraf-yang-push-observation-time-00, , <https://datatracker.ietf.org/doc/html/draft-tgraf-yang-push-observation-time-00>.
[RFC8345]
Clemm, A., Medved, J., Varga, R., Bahadur, N., Ananthakrishnan, H., and X. Liu, "A YANG Data Model for Network Topologies", RFC 8345, DOI 10.17487/RFC8345, , <https://www.rfc-editor.org/info/rfc8345>.
[RFC8632]
Vallin, S. and M. Bjorklund, "A YANG Data Model for Alarm Management", RFC 8632, DOI 10.17487/RFC8632, , <https://www.rfc-editor.org/info/rfc8632>.
[RFC8969]
Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and L. Geng, "A Framework for Automating Service and Network Management with YANG", RFC 8969, DOI 10.17487/RFC8969, , <https://www.rfc-editor.org/info/rfc8969>.
[RFC9375]
Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de Dios, O., and B. Wen, "A YANG Data Model for Network and VPN Service Performance Monitoring", RFC 9375, DOI 10.17487/RFC9375, , <https://www.rfc-editor.org/info/rfc9375>.
[RFC9417]
Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. Arumugam, "Service Assurance for Intent-Based Networking Architecture", RFC 9417, DOI 10.17487/RFC9417, , <https://www.rfc-editor.org/info/rfc9417>.
[RFC9418]
Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. Arumugam, "A YANG Data Model for Service Assurance", RFC 9418, DOI 10.17487/RFC9418, , <https://www.rfc-editor.org/info/rfc9418>.
[W3C-Trace-Context]
W3C, "W3C Recommendation on Trace Context", , <https://www.w3.org/TR/2021/REC-trace-context-1-20211123/>.

Appendix A. Changes between revisions

[[RFC editor: please remove this section before publication.]]

v02 - v03

v01 - v02

v00 - v01

Authors' Addresses

Chong Feng (editor)
Huawei
101 Software Avenue, Yuhua District
Nanjing
Jiangsu, 210012
China
Tong Hu
China Mobile (Hangzhou) Information Technology Co., Ltd
Building A01, 1600 Yuhangtang Road, Wuchang Street, Yuhang District
Hangzhou
ZheJiang, 311121
China
Luis Miguel Contreras Murillo
Telefonica I+D
Madrid
Spain
Thomas Graf
Swisscom
Binzring 17
CH-8045 Zurich
Switzerland
Qin Wu
Huawei
101 Software Avenue, Yuhua District
Nanjing
Jiangsu, 210012
China
Chaode Yu
Huawei
Nigel Davis
Ciena