Internet-Draft | Knowledge Graphs & Incident Management | August 2024 |
Tailhardat, et al. | Expires 20 February 2025 | [Page] |
Operational efficiency in incident management on telecom and computer networks requires correlating and interpreting large volumes of heterogeneous technical information. Knowledge graphs can provide a unified view of complex systems through shared vocabularies. YANG data models enable describing network configurations and automating their deployment. However, both approaches face challenges in vocabulary alignment and adoption, hindering knowledge capitalization and sharing on network designs and best practices. To address this, the concept of a meta-knowledge graph is introduced to leverage existing network infrastructure descriptions in YANG format and enable abstract reasoning on network behaviors. An experiment is proposed to assess the potential of the meta-knowledge graph in improving network quality and designs.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://genears.github.io/draft-tailhardat-nmop-incident-management-noria/draft-tailhardat-nmop-incident-management-noria.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-tailhardat-nmop-incident-management-noria/.¶
Discussion of this document takes place on the Network Management Operations Working Group mailing list (mailto:nmop@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/nmop/. Subscribe at https://www.ietf.org/mailman/listinfo/nmop/.¶
Source for this draft and an issue tracker can be found at https://github.com/genears/draft-tailhardat-nmop-incident-management-noria.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 20 February 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Incident management on telecom and computer networks, whether it is related to infrastructure or cybersecurity issues, requires the ability to simultaneously and quickly correlate and interpret a large number of heterogeneous technical information sources. Knowledge Graphs (KGs), by structuring heterogeneous data through shared vocabularies, enable providing a unified view of complex technical systems, their ecosystem, and the activities and operations related to them (see [I-D.marcas-nmop-knowledge-graph-yang] and [NORIA-O-2024]). Using such formal knowledge representation allows for a simplified interpretation of networks and their behavior, both for NetOps & SecOps teams and artificial intelligence (AI) algorithms (e.g. anomaly detection, root cause analysis, diagnostic aid, situation summarization), and paves the way, in line with the Network Digital Twin vision [I-D.irtf-nmrg-network-digital-twin-arch], for the development of tools for detecting and analyzing complex network incident situations through explainable, actionable, and shareable models (see [FOLIO-2018], [SLKG-2023], and [GPL-2024]).¶
However, despite potential benefits of using knowledge graphs, these are not mainstream yet in commercial network deployment systems and decision support systems (see [NORIA-UI-2024] for more on the decision support systems perspective). YANG is a widely used standard among operators for describing network configurations and automating their deployment. Using YANG representations in the form of a KG, as suggested in [I-D.marcas-nmop-knowledge-graph-yang], would minimize the effort required to adapt network management tools towards the unified vision and applications evoked above. The lack of alignment between various YANG models on key concepts (e.g. for describing network topology) is, however, hindering this evolution [I-D.boucadair-nmop-rfc3535-20years-later].¶
Furthermore, although [I-D.netana-nmop-network-anomaly-lifecycle] addresses the capitalization of incident management knowledge through a YANG model, it can be observed that the overall scope of YANG models does not naturally cover the description of the networks' ecosystem (e.g. physical equipment location, operator organization, supervision systems) or the description of network operations from an IT service management (ITSM) perspective (e.g. business processes and design rules used by the company, scheduled modification operations, remediation actions performed during incident handling). As a consequence, the continuous improvement of network quality & designs requires additional data cross-referencing operations to properly contextualize incidents and learn from remediation actions taken (e.g. analyzing intervention technicians' verbatim, comparing actions performed on similar incidents but occurring on different networks). As a result of these additional efforts of contextualization, the capitalization of knowledge typically remains confined at the level of each network operator. This, in turn, hinders the sharing of information within the community of researchers and system designers regarding failure modes and best practices to adopt, considering the concept of overall improvement of IT systems and the Internet.¶
Realizing an ITSM knowledge graph for network deployment, anomaly detection and risk management applications has been studied for several years in the Semantic Web community (i.e. knowledge representation and automated reasoning leveraging Web technologies such as [RDF], [RDFS], [OWL], and [SKOS]). Among other examples: the DevOpsInfra ontology [DevOpsInfra-2021] allows for describing sets of computing resources and how they are allocated for hosting services; the NORIA-O ontology [NORIA-UI-2024] allows for describing a network infrastructure & ecosystem, its events, diagnosis and repair actions performed during incident management. Assuming the continuous integration into a knowledge graph of data from ticketing systems, network monitoring solutions, and network configuration management databases, we remark that the resulting knowledge graph (Figure 1) implicitely holds the necessary information to (automatically) learn incident contexts (i.e. the network topology, its set of states and set of events prior to the incident) and remediation procedures (i.e. the set of actions and network configuration changes carried-out to resolve the incident).¶
By going a step further, we notice that a generic understanding of incident context can be extracted and shared among operators from knowledge graphs. Indeed, a knowledge graph, being an instantiation of shared vocabularies (e.g. RDFS/OWL ontologies and controlled vocabularies in SKOS syntax), sharing incident signatures can be done without revealing infrastructure details (e.g. hostname, IP address), but rather the abstract representation of the network (i.e. the class of the knowlegde graph entities and relationships, such as "server" or "router", and or "IPoWDM link").¶
The remainder of this document is organized as follows. Firstly, the concept of a meta-knowledge graph is introduced to leverage existing network infrastructure descriptions in YANG format and enable abstract reasoning on network behaviors. Secondly, an experiment is proposed to assess the potential of the meta-knowledge graph in improving network quality and designs. In addition to the main parts of the proposal, the document also covers data integration and data federation architectures in the Security Considerations section. This section specifically addresses the handling of event data streams and the provision of a unified view for different stakeholders.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
As this document covers the meta-knowledge graph concepts, and use cases, there is no specific security considerations.¶
However, as the concept of a meta-knowledge graph involves the construction of a multi-faceted graph (i.e. including network topologies, operational data, and service and client data), it poses the risk of simplifying access to network operational data and functions that fall outside the knowledge graph users' responsibility or that could facilitate the intervention of malicious individuals. To support the discussion on mitigating this risk, we suggest referring to Figure 6, which illustrates the concept of partial access to the meta-knowledge graph based on rights associated with each user group (UG) at the data domain level. We also recommend referring to [AMO-2012] for an example of implemention of access rights in a content management system that relies on Semantic Web models and technologies. This implementation uses the AMO ontology, which includes a set of classes and properties for annotating resources that require access control, as well as a base of inference rules that model the access management strategy to carry out.¶
This document has no IANA actions.¶
We would like to thank Benoit Claise for spontaneously seeking to include the work of the NORIA research project in the vision of the NMOP working group through direct contact.¶
We would also like to thank Fano Rampary for his initial analysis of the possibilities of defining a model conversion algebra for going from Yang data models to OWL ontologies.¶