Knowledge Graphs for Enhanced Cross-Operator Incident Management and Network Design

Internet-Draft	Knowledge Graphs & Incident Management	August 2024
Tailhardat, et al.	Expires 20 February 2025	[Page]

Abstract

Operational efficiency in incident management on telecom and computer networks requires correlating and interpreting large volumes of heterogeneous technical information. Knowledge graphs can provide a unified view of complex systems through shared vocabularies. YANG data models enable describing network configurations and automating their deployment. However, both approaches face challenges in vocabulary alignment and adoption, hindering knowledge capitalization and sharing on network designs and best practices. To address this, the concept of a meta-knowledge graph is introduced to leverage existing network infrastructure descriptions in YANG format and enable abstract reasoning on network behaviors. An experiment is proposed to assess the potential of the meta-knowledge graph in improving network quality and designs.¶

1. Introduction

Incident management on telecom and computer networks, whether it is related to infrastructure or cybersecurity issues, requires the ability to simultaneously and quickly correlate and interpret a large number of heterogeneous technical information sources. Knowledge Graphs (KGs), by structuring heterogeneous data through shared vocabularies, enable providing a unified view of complex technical systems, their ecosystem, and the activities and operations related to them (see [I-D.marcas-nmop-knowledge-graph-yang] and [NORIA-O-2024]). Using such formal knowledge representation allows for a simplified interpretation of networks and their behavior, both for NetOps & SecOps teams and artificial intelligence (AI) algorithms (e.g. anomaly detection, root cause analysis, diagnostic aid, situation summarization), and paves the way, in line with the Network Digital Twin vision [I-D.irtf-nmrg-network-digital-twin-arch], for the development of tools for detecting and analyzing complex network incident situations through explainable, actionable, and shareable models (see [FOLIO-2018], [SLKG-2023], and [GPL-2024]).¶

However, despite potential benefits of using knowledge graphs, these are not mainstream yet in commercial network deployment systems and decision support systems (see [NORIA-UI-2024] for more on the decision support systems perspective). YANG is a widely used standard among operators for describing network configurations and automating their deployment. Using YANG representations in the form of a KG, as suggested in [I-D.marcas-nmop-knowledge-graph-yang], would minimize the effort required to adapt network management tools towards the unified vision and applications evoked above. The lack of alignment between various YANG models on key concepts (e.g. for describing network topology) is, however, hindering this evolution [I-D.boucadair-nmop-rfc3535-20years-later].¶

Furthermore, although [I-D.netana-nmop-network-anomaly-lifecycle] addresses the capitalization of incident management knowledge through a YANG model, it can be observed that the overall scope of YANG models does not naturally cover the description of the networks' ecosystem (e.g. physical equipment location, operator organization, supervision systems) or the description of network operations from an IT service management (ITSM) perspective (e.g. business processes and design rules used by the company, scheduled modification operations, remediation actions performed during incident handling). As a consequence, the continuous improvement of network quality & designs requires additional data cross-referencing operations to properly contextualize incidents and learn from remediation actions taken (e.g. analyzing intervention technicians' verbatim, comparing actions performed on similar incidents but occurring on different networks). As a result of these additional efforts of contextualization, the capitalization of knowledge typically remains confined at the level of each network operator. This, in turn, hinders the sharing of information within the community of researchers and system designers regarding failure modes and best practices to adopt, considering the concept of overall improvement of IT systems and the Internet.¶

Realizing an ITSM knowledge graph for network deployment, anomaly detection and risk management applications has been studied for several years in the Semantic Web community (i.e. knowledge representation and automated reasoning leveraging Web technologies such as [RDF], [RDFS], [OWL], and [SKOS]). Among other examples: the DevOpsInfra ontology [DevOpsInfra-2021] allows for describing sets of computing resources and how they are allocated for hosting services; the NORIA-O ontology [NORIA-UI-2024] allows for describing a network infrastructure & ecosystem, its events, diagnosis and repair actions performed during incident management. Assuming the continuous integration into a knowledge graph of data from ticketing systems, network monitoring solutions, and network configuration management databases, we remark that the resulting knowledge graph (Figure 1) implicitely holds the necessary information to (automatically) learn incident contexts (i.e. the network topology, its set of states and set of events prior to the incident) and remediation procedures (i.e. the set of actions and network configuration changes carried-out to resolve the incident).¶

┌───Incident context────────────────────────────┐
│                 ┌────────────┐                │
│                 │skos:Concept│                │
│                 └─┬┬─────────┘                │
│                  <server>                     │
│                    ▲                          │
│                    │                          │
│                 resourceType                  │
│         ┌────────┐ │                          │      ┌─────────────┐
│         │Resource│ │                          │      │TroubleTicket│
│         └──────┬┬┘ │                          │      └─────┬┬──────┘
│                ││  │                          │            ││
│        <ne_2>──<ne_1>◄──troubleTicketRelatedResource──<incident_01>
│           │      │                            │            │
│           │      │                            │      problemCategory
│<ne_5>──<ne_4>────┼──<ne_3>────<log_2>         │            │
│           │      │    │                       │            ▼
│           │      │    │                       │       <packet-loss>
│       <log_3>    │  <ne_6>                    │            ││
│                  │                            │       ┌────┴┴──────┐
│     logOriginatingManagedObject               │       │skos:Concept│
│                  │                            │       └────────────┘
│                  ▼                            │
│               <log_1>──────┐                  │
│      ┌─────────┴┴┐     dcterms:type           │
│      │EventRecord│         │                  │
│      └───────────┘         ▼                  │
│                    <integrityViolation>       │
│                       ┌────┴┴──────┐          │
│                       │skos:Concept│          │
│                       └────────────┘          │
└───────────────────────────────────────────────┘

Figure 1: Learning an incident signature seen as a classification model that is trained on the relationship of the incident context (i.e. a subgraph centered around a Resource entity concerned by a given TroubleTicket) to the problem class defined at the TroubleTicket entity level. Arrows are for object properties (owl:ObjectProperty), double line edges are for object class relationships (rdf:type).

By going a step further, we notice that a generic understanding of incident context can be extracted and shared among operators from knowledge graphs. Indeed, a knowledge graph, being an instantiation of shared vocabularies (e.g. RDFS/OWL ontologies and controlled vocabularies in SKOS syntax), sharing incident signatures can be done without revealing infrastructure details (e.g. hostname, IP address), but rather the abstract representation of the network (i.e. the class of the knowlegde graph entities and relationships, such as "server" or "router", and or "IPoWDM link").¶

The remainder of this document is organized as follows. Firstly, the concept of a meta-knowledge graph is introduced to leverage existing network infrastructure descriptions in YANG format and enable abstract reasoning on network behaviors. Secondly, an experiment is proposed to assess the potential of the meta-knowledge graph in improving network quality and designs. In addition to the main parts of the proposal, the document also covers data integration and data federation architectures in the Security Considerations section. This section specifically addresses the handling of event data streams and the provision of a unified view for different stakeholders.¶

6. References

6.1. Normative References

[I-D.havel-nmop-digital-map-concept]: Havel, O., Claise, B., de Dios, O. G., and T. Graf, "Digital Map: Concept, Requirements, and Use Cases", Work in Progress, Internet-Draft, draft-havel-nmop-digital-map-concept-00, 4 July 2024, <https://datatracker.ietf.org/doc/html/draft-havel-nmop-digital-map-concept-00>.
[I-D.netana-nmop-network-anomaly-lifecycle]: Riccobene, V., Roberto, A., Graf, T., Du, W., and A. H. Feng, "Experiment: Network Anomaly Lifecycle", Work in Progress, Internet-Draft, draft-netana-nmop-network-anomaly-lifecycle-03, 8 July 2024, <https://datatracker.ietf.org/doc/html/draft-netana-nmop-network-anomaly-lifecycle-03>.
[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC8345]: Clemm, A., Medved, J., Varga, R., Bahadur, N., Ananthakrishnan, H., and X. Liu, "A YANG Data Model for Network Topologies", RFC 8345, DOI 10.17487/RFC8345, March 2018, <https://www.rfc-editor.org/rfc/rfc8345>.
[RFC9418]: Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. Arumugam, "A YANG Data Model for Service Assurance", RFC 9418, DOI 10.17487/RFC9418, July 2023, <https://www.rfc-editor.org/rfc/rfc9418>.

6.2. Informative References

[AMO-2012]: Buffa, M. and C. Faron-Zucker, "Ontology-Based Access Rights Management", 2012, <https://doi.org/10.1007/978-3-642-25838-1_3>.
[DevOpsInfra-2021]: Corcho, O., Chaves-Fraga, D., Toledo, J., Arenas-Guerrero, J., Badenes-Olmedo, C., Wang, M., Peng, H., Burrett, N., Mora, J., and P. Zhang, "A High-Level Ontology Network for ICT Infrastructures", 2021, <https://doi.org/10.1007/978-3-030-88361-4_26>.
[FLAGSM-2021]: Steenwinckel, B., Paepe, D. D., Hautte, S. V., Heyvaert, P., Bentefrit, M., Moens, P., Dimou, A., Bossche, B. V. D., Turck, F. D., Hoecke, S. V., and F. Ongenae, "FLAGS: A Methodology for Adaptive Anomaly Detection and Root Cause Analysis on Sensor Data Streams by Fusing Expert Knowledge with Machine Learning", 2021, <https://doi.org/10.1016/j.future.2020.10.015>.
[FOLIO-2018]: Steenwinckel, B., Heyvaert, P., Paepe, D. D., Janssens, O., Hautte, S. V., Dimou, A., Turck, F. D., Hoecke, S. V., and F. Ongenae, "Towards Adaptive Anomaly Detection and Root Cause Analysis by Automated Extraction of Knowledge from Risk Analyses", 2018, <https://www.ceur-ws.org/Vol-2213/paper2.pdf>.
[GPL-2024]: Tailhardat, L., Stach, B., Chabot, Y., and R. Troncy, "Graphameleon: Relational Learning and Anomaly Detection on Web Navigation Traces Captured as Knowledge Graphs", 2024, <https://doi.org/10.1145/3589335.3651447>.
[I-D.boucadair-nmop-rfc3535-20years-later]: Boucadair, M., Contreras, L. M., de Dios, O. G., Graf, T., and R. Rahman, "RFC 3535, 20 Years Later: An Update of Operators Requirements on Network Management Protocols and Modelling", Work in Progress, Internet-Draft, draft-boucadair-nmop-rfc3535-20years-later-04, 22 July 2024, <https://datatracker.ietf.org/doc/html/draft-boucadair-nmop-rfc3535-20years-later-04>.
[I-D.irtf-nmrg-network-digital-twin-arch]: Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu, Q., Boucadair, M., and C. Jacquenet, "Network Digital Twin: Concepts and Reference Architecture", Work in Progress, Internet-Draft, draft-irtf-nmrg-network-digital-twin-arch-06, 7 July 2024, <https://datatracker.ietf.org/doc/html/draft-irtf-nmrg-network-digital-twin-arch-06>.
[I-D.marcas-nmop-knowledge-graph-yang]: Martinez-Casanueva, I. D. and L. C. Rodríguez, "Knowledge Graphs for YANG-based Network Management", Work in Progress, Internet-Draft, draft-marcas-nmop-knowledge-graph-yang-03, 5 July 2024, <https://datatracker.ietf.org/doc/html/draft-marcas-nmop-knowledge-graph-yang-03>.
[NORIA-O-2024]: Tailhardat, L., Troncy, R., and Y. Chabot, "NORIA-O: An Ontology for Anomaly Detection and Incident Management in ICT Systems", 2024, <https://doi.org/10.1007/978-3-031-60635-9_2>.
[NORIA-UI-2024]: Tailhardat, L., Chabot, Y., Py, A., and P. Guillemette, "NORIA UI: Efficient Incident Management on Large-Scale ICT Systems Represented as Knowledge Graphs", 2024, <https://doi.org/10.1145/3664476.3670438>.
[OWL]: W3C, "OWL 2 Web Ontology Language Document Overview (Second Edition)", December 2012, <https://www.w3.org/TR/owl2-overview/>.
[RDF]: W3C, "Resource Description Framework (RDF): Concepts and Abstract Syntax", February 2014, <https://www.w3.org/TR/rdf11-concepts/>.
[RDFS]: W3C, "RDF Schema 1.1", February 2014, <https://www.w3.org/TR/rdf-schema/>.
[SKOS]: W3C, "SKOS Simple Knowledge Organization System Reference", August 2009, <https://www.w3.org/TR/skos-reference/>.
[SLKG-2023]: Tailhardat, L., Troncy, R., and Y. Chabot, "Leveraging Knowledge Graphs For Classifying Incident Situations in ICT Systems", 2023, <https://doi.org/10.1145/3600160.3604991>.
[SPARQL11-FQ]: W3C, "SPARQL 1.1 Federated Query", March 2013, <https://www.w3.org/TR/sparql11-federated-query/>.
[SPARQL11-QL]: W3C, "SPARQL 1.1 Query Language", March 2013, <https://www.w3.org/TR/sparql11-query/>.

Knowledge Graphs for Enhanced Cross-Operator Incident Management and Network Design

Abstract

About This Document

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

2. Conventions and Definitions

3.3.1.1. Design Requirements

4. Security Considerations

5. IANA Considerations

6. References

6.1. Normative References

6.2. Informative References

Acknowledgments

Authors' Addresses