| rfc9417.original | rfc9417.txt | |||
|---|---|---|---|---|
| OPSAWG B. Claise | Internet Engineering Task Force (IETF) B. Claise | |||
| Internet-Draft J. Quilbeuf | Request for Comments: 9417 J. Quilbeuf | |||
| Intended status: Informational Huawei | Category: Informational Huawei | |||
| Expires: 7 July 2023 D. Lopez | ISSN: 2070-1721 D. Lopez | |||
| Telefonica I+D | Telefonica I+D | |||
| D. Voyer | D. Voyer | |||
| Bell Canada | Bell Canada | |||
| T. Arumugam | T. Arumugam | |||
| Cisco Systems, Inc. | Consultant | |||
| 3 January 2023 | June 2023 | |||
| Service Assurance for Intent-based Networking Architecture | Service Assurance for Intent-Based Networking Architecture | |||
| draft-ietf-opsawg-service-assurance-architecture-13 | ||||
| Abstract | Abstract | |||
| This document describes an architecture that aims at assuring that | This document describes an architecture that provides some assurance | |||
| service instances are running as expected. As services rely upon | that service instances are running as expected. As services rely | |||
| multiple sub-services provided by a variety of elements including the | upon multiple subservices provided by a variety of elements, | |||
| underlying network devices and functions, getting the assurance of a | including the underlying network devices and functions, getting the | |||
| healthy service is only possible with a holistic view of all involved | assurance of a healthy service is only possible with a holistic view | |||
| elements. This architecture not only helps to correlate the service | of all involved elements. This architecture not only helps to | |||
| degradation with symptoms of a specific network component but also to | correlate the service degradation with symptoms of a specific network | |||
| list the services impacted by the failure or degradation of a | component but, it also lists the services impacted by the failure or | |||
| specific network component. | degradation of a specific network component. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This document is not an Internet Standards Track specification; it is | |||
| provisions of BCP 78 and BCP 79. | published for informational purposes. | |||
| Internet-Drafts are working documents of the Internet Engineering | ||||
| Task Force (IETF). Note that other groups may also distribute | ||||
| working documents as Internet-Drafts. The list of current Internet- | ||||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
| and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
| time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
| material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Not all documents | |||
| approved by the IESG are candidates for any level of Internet | ||||
| Standard; see Section 2 of RFC 7841. | ||||
| This Internet-Draft will expire on 7 July 2023. | Information about the current status of this document, any errata, | |||
| and how to provide feedback on it may be obtained at | ||||
| https://www.rfc-editor.org/info/rfc9417. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2023 IETF Trust and the persons identified as the | Copyright (c) 2023 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
| license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
| and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
| extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
| described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
| provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
| in the Revised BSD License. | ||||
| Table of Contents | Table of Contents | |||
| 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
| 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 2. Terminology | |||
| 3. A Functional Architecture . . . . . . . . . . . . . . . . . . 7 | 3. A Functional Architecture | |||
| 3.1. Translating a Service Instance Configuration into an | 3.1. Translating a Service Instance Configuration into an | |||
| Assurance Graph . . . . . . . . . . . . . . . . . . . . . 10 | Assurance Graph | |||
| 3.1.1. Circular Dependencies . . . . . . . . . . . . . . . . 12 | 3.1.1. Circular Dependencies | |||
| 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 16 | 3.2. Intent and Assurance Graph | |||
| 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 17 | 3.3. Subservices | |||
| 3.4. Building the Expression Graph from the Assurance Graph . 18 | 3.4. Building the Expression Graph from the Assurance Graph | |||
| 3.5. Open Interfaces with YANG Modules . . . . . . . . . . . . 19 | 3.5. Open Interfaces with YANG Modules | |||
| 3.6. Handling Maintenance Windows . . . . . . . . . . . . . . 20 | 3.6. Handling Maintenance Windows | |||
| 3.7. Flexible Functional Architecture . . . . . . . . . . . . 21 | 3.7. Flexible Functional Architecture | |||
| 3.8. Time window for symptoms history . . . . . . . . . . . . 23 | 3.8. Time Window for Symptoms' History | |||
| 3.9. New Assurance Graph Generation . . . . . . . . . . . . . 23 | 3.9. New Assurance Graph Generation | |||
| 4. Security Considerations . . . . . . . . . . . . . . . . . . . 24 | 4. IANA Considerations | |||
| 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 | 5. Security Considerations | |||
| 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 25 | 6. References | |||
| 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 6.1. Normative References | |||
| 7.1. Normative References . . . . . . . . . . . . . . . . . . 25 | 6.2. Informative References | |||
| 7.2. Informative References . . . . . . . . . . . . . . . . . 25 | Acknowledgements | |||
| Appendix A. Changes between revisions . . . . . . . . . . . . . 27 | Contributors | |||
| Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 28 | Authors' Addresses | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 | ||||
| 1. Terminology | ||||
| SAIN agent: A functional component that communicates with a device, a | ||||
| set of devices, or another agent to build an expression graph from a | ||||
| received assurance graph and perform the corresponding computation of | ||||
| the health status and symptoms. A SAIN agent might be running | ||||
| directly on the device it monitors. | ||||
| Assurance case: "An assurance case is a structured argument, | ||||
| supported by evidence, intended to justify that a system is | ||||
| acceptably assured relative to a concern (such as safety or security) | ||||
| in the intended operating environment" [Piovesan2017]. | ||||
| Service instance: A specific instance of a service. | ||||
| Intent: "A set of operational goals (that a network should meet) and | ||||
| outcomes (that a network is supposed to deliver), defined in a | ||||
| declarative manner without specifying how to achieve or implement | ||||
| them" [RFC9315]. | ||||
| Subservice: Part or functionality of the network system that can be | ||||
| independently assured as a single entity in assurance graph. | ||||
| Assurance graph: A Directed Acyclic Graph (DAG) representing the | ||||
| assurance case for one or several service instances. The nodes (also | ||||
| known as vertices in the context of DAG) are the service instances | ||||
| themselves and the subservices, the edges indicate a dependency | ||||
| relation. | ||||
| SAIN collector: A functional component that fetches or receives the | ||||
| computer-consumable output of the SAIN agent(s) and process it | ||||
| locally (including displaying it in a user-friendly form). | ||||
| DAG: Directed Acyclic Graph. | ||||
| ECMP: Equal Cost Multiple Paths | ||||
| Expression graph: A generic term for a DAG representing a computation | ||||
| in SAIN. More specific terms are: | ||||
| * Subservice expressions: Is an expression graph representing all | ||||
| the computations to execute for a subservice. | ||||
| * Service expressions: Is an expression graph representing all the | ||||
| computations to execute for a service instance, i.e., including | ||||
| the computations for all dependent subservices. | ||||
| * Global computation graph: Is an expression graph representing all | ||||
| the computations to execute for all services instances (i.e., all | ||||
| computations performed). | ||||
| Dependency: The directed relationship between subservice instances in | ||||
| the assurance graph. | ||||
| Metric: A piece of information retrieved from the network running the | ||||
| assured service. | ||||
| Metric engine: A functional component, part of the SAIN agent, that | ||||
| maps metrics to a list of candidate metric implementations depending | ||||
| on the network element. | ||||
| Metric implementation: Actual way of retrieving a metric from a | ||||
| network element. | ||||
| Network service YANG module: describes the characteristics of a | ||||
| service as agreed upon with consumers of that service [RFC8199]. | ||||
| Service orchestrator: Quoting RFC8199, "Network Service YANG Modules | ||||
| describe the characteristics of a service, as agreed upon with | ||||
| consumers of that service. That is, a service module does not expose | ||||
| the detailed configuration parameters of all participating network | ||||
| elements and features but describes an abstract model that allows | ||||
| instances of the service to be decomposed into instance data | ||||
| according to the Network Element YANG Modules of the participating | ||||
| network elements. The service-to-element decomposition is a separate | ||||
| process; the details depend on how the network operator chooses to | ||||
| realize the service. For the purpose of this document, the term | ||||
| "orchestrator" is used to describe a system implementing such a | ||||
| process." | ||||
| SAIN orchestrator: A functional component that is in charge of | ||||
| fetching the configuration specific to each service instance and | ||||
| converting it into an assurance graph. | ||||
| Health status: Score and symptoms indicating whether a service | ||||
| instance or a subservice is "healthy". A non-maximal score must | ||||
| always be explained by one or more symptoms. | ||||
| Health score: Integer ranging from 0 to 100 indicating the health of | ||||
| a subservice. A score of 0 means that the subservice is broken, a | ||||
| score of 100 means that the subservice in question is operating as | ||||
| expected. The special value -1 can be used to specify that no value | ||||
| could be computed for that health-score, for instance if some metric | ||||
| needed for that computation could not be collected. | ||||
| Strongly connected component: subset of a directed graph such that | ||||
| there is a (directed) path from any node of the subset to any other | ||||
| node. A DAG does not contain any strongly connected component. | ||||
| Symptom: Reason explaining why a service instance or a subservice is | ||||
| not completely healthy. | ||||
| 2. Introduction | 1. Introduction | |||
| Network service YANG modules [RFC8199] describe the configuration, | Network Service YANG Modules [RFC8199] describe the configuration, | |||
| state data, operations, and notifications of abstract representations | state data, operations, and notifications of abstract representations | |||
| of services implemented on one or multiple network elements. | of services implemented on one or multiple network elements. | |||
| Service orchestrators use Network service YANG modules that will | Service orchestrators use Network Service YANG Modules that will | |||
| infer network-wide configuration and, therefore the invocation of the | infer network-wide configuration and, therefore, the invocation of | |||
| appropriate device modules (Section 3 of [RFC8969]). Knowing that a | the appropriate device modules (Section 3 of [RFC8969]). Knowing | |||
| configuration is applied doesn't imply that the provisioned service | that a configuration is applied doesn't imply that the provisioned | |||
| instance is up and running as expected. For instance, the service | service instance is up and running as expected. For instance, the | |||
| might be degraded because of a failure in the network, the service | service might be degraded because of a failure in the network, the | |||
| quality may be degraded, or a service function may be reachable at | service quality may be degraded, or a service function may be | |||
| the IP level but does not provide its intended function. Thus, the | reachable at the IP level but does not provide its intended function. | |||
| network operator must monitor the service's operational data at the | Thus, the network operator must monitor the service's operational | |||
| same time as the configuration (Section 3.3 of [RFC8969]). To feed | data at the same time as the configuration (Section 3.3 of | |||
| that task, the industry has been standardizing on telemetry to push | [RFC8969]). To fuel that task, the industry has been standardizing | |||
| network element performance information (e.g., | on telemetry to push network element performance information (e.g., | |||
| [I-D.ietf-opsawg-yang-vpn-service-pm]). | [RFC9375]). | |||
| A network administrator needs to monitor their network and services | A network administrator needs to monitor its network and services as | |||
| as a whole, independently of the management protocols. With | a whole, independently of the management protocols. With different | |||
| different protocols come different data models, and different ways to | protocols come different data models and different ways to model the | |||
| model the same type of information. When network administrators deal | same type of information. When network administrators deal with | |||
| with multiple management protocols, the network management entities | multiple management protocols, the network management entities have | |||
| have to perform the difficult and time-consuming job of mapping data | to perform the difficult and time-consuming job of mapping data | |||
| models: e.g., the model used for configuration with the model used | models, e.g., the model used for configuration with the model used | |||
| for monitoring when separate models or protocols are used. This | for monitoring when separate models or protocols are used. This | |||
| problem is compounded by a large, disparate set of data sources (MIB | problem is compounded by a large, disparate set of data sources | |||
| modules, YANG models [RFC7950], IPFIX information elements [RFC7011], | (e.g., MIB modules, YANG data models [RFC7950], IP Flow Information | |||
| syslog plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], | Export (IPFIX) information elements [RFC7011], syslog plain text | |||
| etc.). In order to avoid this data model mapping, the industry | [RFC5424], Terminal Access Controller Access-Control System Plus | |||
| converged on model-driven telemetry to stream the service operational | (TACACS+) [RFC8907], RADIUS [RFC2865], etc.). In order to avoid this | |||
| data, reusing the YANG models used for configuration. Model-driven | data model mapping, the industry converged on model-driven telemetry | |||
| telemetry greatly facilitates the notion of closed-loop automation | to stream the service operational data, reusing the YANG data models | |||
| whereby events and updated operational state streamed from the | used for configuration. Model-driven telemetry greatly facilitates | |||
| network drive remediation changes back into the network. | the notion of closed-loop automation, whereby events and updated | |||
| operational states streamed from the network drive remediation change | ||||
| back into the network. | ||||
| However, it proves difficult for network operators to correlate the | However, it proves difficult for network operators to correlate the | |||
| service degradation with the network root cause. For example, "Why | service degradation with the network root cause, for example, "Why | |||
| does my layer 3 virtual private network (L3VPN) fail to connect?" or | does my layer 3 virtual private network (L3VPN) fail to connect?" or | |||
| "Why is this specific service not highly responsive?". The reverse, | "Why is this specific service not highly responsive?" The reverse, | |||
| i.e., which services are impacted when a network component fails or | i.e., which services are impacted when a network component fails or | |||
| degrades, is also important for operators. For example, "Which | degrades, is also important for operators, for example, "Which | |||
| services are impacted when this specific optic decibel milliwatt | services are impacted when this specific optic decibel milliwatt | |||
| (dBm) begins to degrade?", "Which applications are impacted by an | (dBm) begins to degrade?", "Which applications are impacted by an | |||
| imbalance in this equal cost multiple paths (ECMP) bundle?", or "Is | imbalance in this Equal-Cost Multipath (ECMP) bundle?", or "Is that | |||
| that issue actually impacting any other customers?". This task | issue actually impacting any other customers?" This task usually | |||
| usually falls under the so-called "Service Impact Analysis" | falls under the so-called "Service Impact Analysis" functional block. | |||
| functional block. | ||||
| In this document, we propose an architecture implementing Service | This document defines an architecture implementing Service Assurance | |||
| Assurance for Intent-Based Networking (SAIN). Intent-based | for Intent-based Networking (SAIN). Intent-based approaches are | |||
| approaches are often declarative, starting from a statement of "The | often declarative, starting from a statement of "The service works as | |||
| service works as expected" and trying to enforce it. However, some | expected" and trying to enforce it. However, some already-defined | |||
| already defined services might have been designed using a different | services might have been designed using a different approach. | |||
| approach. Aligned with Section 3.3 of [RFC7149], and instead of | Aligned with Section 3.3 of [RFC7149], and instead of requiring a | |||
| requiring a declarative intent as a starting point, this architecture | declarative intent as a starting point, this architecture focuses on | |||
| focuses on already defined services and tries to infer the meaning of | already-defined services and tries to infer the meaning of "The | |||
| "The service works as expected". To do so, the architecture works | service works as expected". To do so, the architecture works from an | |||
| from an assurance graph, deduced from the configuration pushed to the | assurance graph, deduced from the configuration pushed to the device | |||
| device for enabling the service instance. If the SAIN orchestrator | for enabling the service instance. If the SAIN orchestrator supports | |||
| supports it, the service model (Section 2 of [RFC8309]) or the | it, the service model (Section 2 of [RFC8309]) or the network model | |||
| network model (Section 2.1 of [RFC8969]) can also be used to build | (Section 2.1 of [RFC8969]) can also be used to build the assurance | |||
| the assurance graph. In that case and if the service model includes | graph. In that case and if the service model includes the | |||
| the declarative intent as well, the SAIN orchestrator can rely on the | declarative intent as well, the SAIN orchestrator can rely on the | |||
| declared intent instead of inferring it. The assurance graph may | declared intent instead of inferring it. The assurance graph may | |||
| also be explicitly completed to add an intent not exposed in the | also be explicitly completed to add an intent not exposed in the | |||
| service model itself. | service model itself. | |||
| The assurance graph of a service instance is decomposed into | The assurance graph of a service instance is decomposed into | |||
| components, which are then assured independently. The top of the | components, which are then assured independently. The top of the | |||
| assurance graph represents the service instance to assure, and its | assurance graph represents the service instance to assure, and its | |||
| children represent components identified as its direct dependencies; | children represent components identified as its direct dependencies; | |||
| each component can have dependencies as well. Components involved in | each component can have dependencies as well. Components involved in | |||
| the assurance graph of a service are called subservices. The SAIN | the assurance graph of a service are called subservices. The SAIN | |||
| orchestrator updates automatically the assurance graph when the | orchestrator updates the assurance graph automatically when the | |||
| service instance is modified. | service instance is modified. | |||
| When a service is degraded, the SAIN architecture will highlight | When a service is degraded, the SAIN architecture will highlight | |||
| where in the assurance service graph to look, as opposed to going hop | where in the assurance service graph to look, as opposed to going hop | |||
| by hop to troubleshoot the issue. More precisely, the SAIN | by hop to troubleshoot the issue. More precisely, the SAIN | |||
| architecture will associate to each service instance a list of | architecture will associate to each service instance a list of | |||
| symptoms originating from specific subservices, corresponding to | symptoms originating from specific subservices, corresponding to | |||
| components of the network. These components are good candidates for | components of the network. These components are good candidates for | |||
| explaining the source of a service degradation. Not only can this | explaining the source of a service degradation. Not only can this | |||
| architecture help to correlate service degradation with network root | architecture help to correlate service degradation with network root | |||
| cause/symptoms, but it can deduce from the assurance graph the list | cause/symptoms, but it can deduce from the assurance graph the list | |||
| of service instances impacted by a component degradation/failure. | of service instances impacted by a component degradation/failure. | |||
| This added value informs the operational team where to focus its | This added value informs the operational team where to focus its | |||
| attention for maximum return. Indeed, the operational team is likely | attention for maximum return. Indeed, the operational team is likely | |||
| to focus their priority on the degrading/failing components impacting | to focus their priority on the degrading/failing components impacting | |||
| the highest number of their customers, especially the ones with the | the highest number of their customers, especially the ones with the | |||
| SLA contracts involving penalties in case of failure. | Service-Level Agreement (SLA) contracts involving penalties in case | |||
| of failure. | ||||
| This architecture provides the building blocks to assure both | This architecture provides the building blocks to assure both | |||
| physical and virtual entities and is flexible with respect to | physical and virtual entities and is flexible with respect to | |||
| services and subservices, of (distributed) graphs, and of components | services and subservices of (distributed) graphs and components | |||
| (Section 3.7). | (Section 3.7). | |||
| The architecture presented in this document is implemented by a set | The architecture presented in this document is implemented by a set | |||
| of YANG modules defined in a companion document | of YANG modules defined in a companion document [RFC9418]. These | |||
| [I-D.ietf-opsawg-service-assurance-yang]. These YANG modules | YANG modules properly define the interfaces between the various | |||
| properly define the interfaces between the various components of the | components of the architecture to foster interoperability. | |||
| architecture in order to foster interoperability. | ||||
| 2. Terminology | ||||
| SAIN agent: A functional component that communicates with a device, | ||||
| a set of devices, or another agent to build an expression graph | ||||
| from a received assurance graph and perform the corresponding | ||||
| computation of the health status and symptoms. A SAIN agent might | ||||
| be running directly on the device it monitors. | ||||
| Assurance case: "An assurance case is a structured argument, | ||||
| supported by evidence, intended to justify that a system is | ||||
| acceptably assured relative to a concern (such as safety or | ||||
| security) in the intended operating environment" [Piovesan2017]. | ||||
| Service instance: A specific instance of a service. | ||||
| Intent: "A set of operational goals (that a network should meet) and | ||||
| outcomes (that a network is supposed to deliver) defined in a | ||||
| declarative manner without specifying how to achieve or implement | ||||
| them" [RFC9315]. | ||||
| Subservice: A part or functionality of the network system that can | ||||
| be independently assured as a single entity in an assurance graph. | ||||
| Assurance graph: A Directed Acyclic Graph (DAG) representing the | ||||
| assurance case for one or several service instances. The nodes | ||||
| (also known as vertices in the context of DAG) are the service | ||||
| instances themselves and the subservices; the edges indicate a | ||||
| dependency relation. | ||||
| SAIN collector: A functional component that fetches or receives the | ||||
| computer-consumable output of the SAIN agent(s) and processes it | ||||
| locally (including displaying it in a user-friendly form). | ||||
| DAG: Directed Acyclic Graph. | ||||
| ECMP: Equal-Cost Multipath. | ||||
| Expression graph: A generic term for a DAG representing a | ||||
| computation in SAIN. More specific terms are listed below: | ||||
| Subservice expressions: | ||||
| An expression graph representing all the computations to | ||||
| execute for a subservice. | ||||
| Service expressions: | ||||
| An expression graph representing all the computations to | ||||
| execute for a service instance, i.e., including the | ||||
| computations for all dependent subservices. | ||||
| Global computation graph: | ||||
| An expression graph representing all the computations to | ||||
| execute for all services instances (i.e., all computations | ||||
| performed). | ||||
| Dependency: The directed relationship between subservice instances | ||||
| in the assurance graph. | ||||
| Metric: A piece of information retrieved from the network running | ||||
| the assured service. | ||||
| Metric engine: A functional component, part of the SAIN agent, that | ||||
| maps metrics to a list of candidate metric implementations, | ||||
| depending on the network element. | ||||
| Metric implementation: The actual way of retrieving a metric from a | ||||
| network element. | ||||
| Network Service YANG Module: The characteristics of a service, as | ||||
| agreed upon with consumers of that service [RFC8199]. | ||||
| Service orchestrator: "Network Service YANG Modules describe the | ||||
| characteristics of a service, as agreed upon with consumers of | ||||
| that service. That is, a service module does not expose the | ||||
| detailed configuration parameters of all participating network | ||||
| elements and features but describes an abstract model that allows | ||||
| instances of the service to be decomposed into instance data | ||||
| according to the Network Element YANG Modules of the participating | ||||
| network elements. The service-to-element decomposition is a | ||||
| separate process; the details depend on how the network operator | ||||
| chooses to realize the service. For the purpose of this document, | ||||
| the term "orchestrator" is used to describe a system implementing | ||||
| such a process" [RFC8199]. | ||||
| SAIN orchestrator: A functional component that is in charge of | ||||
| fetching the configuration specific to each service instance and | ||||
| converting it into an assurance graph. | ||||
| Health status: The score and symptoms indicating whether a service | ||||
| instance or a subservice is "healthy". A non-maximal score must | ||||
| always be explained by one or more symptoms. | ||||
| Health score: An integer ranging from 0 to 100 that indicates the | ||||
| health of a subservice. A score of 0 means that the subservice is | ||||
| broken, a score of 100 means that the subservice in question is | ||||
| operating as expected, and the special value -1 can be used to | ||||
| specify that no value could be computed for that health score, for | ||||
| instance, if some metric needed for that computation could not be | ||||
| collected. | ||||
| Strongly connected component: A subset of a directed graph such that | ||||
| there is a (directed) path from any node of the subset to any | ||||
| other node. A DAG does not contain any strongly connected | ||||
| component. | ||||
| Symptom: A reason explaining why a service instance or a subservice | ||||
| is not completely healthy. | ||||
| 3. A Functional Architecture | 3. A Functional Architecture | |||
| The goal of SAIN is to assure that service instances are operating as | The goal of SAIN is to assure that service instances are operating as | |||
| expected (i.e., the observed service is matching the expected | expected (i.e., the observed service is matching the expected | |||
| service) and if not, to pinpoint what is wrong. More precisely, SAIN | service) and, if not, to pinpoint what is wrong. More precisely, | |||
| computes a score for each service instance and outputs symptoms | SAIN computes a score for each service instance and outputs symptoms | |||
| explaining that score. The only valid situation where no symptoms | explaining that score. The only valid situation where no symptoms | |||
| are returned is when the score is maximal, indicating that no issues | are returned is when the score is maximal, indicating that no issues | |||
| were detected for that service instance. The score augmented with | were detected for that service instance. The score augmented with | |||
| the symptoms is called the health status. The exact meaning of the | the symptoms is called the health status. The exact meaning of the | |||
| health score value is out of scope of this document. However the | health score value is out of scope of this document. However, the | |||
| following constraints should be followed: the higher the score, the | following constraints should be followed: the higher the score, the | |||
| better the service health is; the two extrema being 0 meaning the | better the service health is and the two extrema are 0 meaning the | |||
| service is completely broken and 100 meaning the service is | service is completely broken, and 100 meaning the service is | |||
| completely operational. | completely operational. | |||
| The SAIN architecture is a generic architecture, which generates an | The SAIN architecture is a generic architecture, which generates an | |||
| assurance graph from service instance(s), as specified in | assurance graph from service instance(s), as specified in | |||
| Section 3.1). This architecture is applicable to multiple | Section 3.1. This architecture is applicable to not only multiple | |||
| environments (e.g. wireline, wireless), but also different domains | environments (e.g., wireline and wireless) but also different domains | |||
| (e.g. 5G network function virtualization (NFV) domain with a virtual | (e.g., 5G network function virtualization (NFV) domain with a virtual | |||
| infrastructure manager (VIM), etc.), and as already noted, for | infrastructure manager (VIM), etc.) and, as already noted, for | |||
| physical or virtual devices, as well as virtual functions. Thanks to | physical or virtual devices, as well as virtual functions. Thanks to | |||
| the distributed graph design principle, graphs from different | the distributed graph design principle, graphs from different | |||
| environments/orchestrator can be combined to obtain the graph of a | environments and orchestrators can be combined to obtain the graph of | |||
| service instance that spans over multiple domains. | a service instance that spans over multiple domains. | |||
| As an example of a service, let us consider a point-to-point level 2 | As an example of a service, let us consider a point-to-point layer 2 | |||
| virtual private network (L2VPN). [RFC8466] specifies the parameters | virtual private network (L2VPN). [RFC8466] specifies the parameters | |||
| for such a service. Examples of symptoms might be symptoms reported | for such a service. Examples of symptoms might be symptoms reported | |||
| by specific subservices "Interface has high error rate" or "Interface | by specific subservices, including "Interface has high error rate", | |||
| flapping", or "Device almost out of memory" as well as symptoms more | "Interface flapping", or "Device almost out of memory", as well as | |||
| specific to the service such as "Site disconnected from VPN". | symptoms more specific to the service (such as "Site disconnected | |||
| from VPN"). | ||||
| To compute the health status of an instance of such a service, the | To compute the health status of an instance of such a service, the | |||
| service definition is decomposed into an assurance graph formed by | service definition is decomposed into an assurance graph formed by | |||
| subservices linked through dependencies. Each subservice is then | subservices linked through dependencies. Each subservice is then | |||
| turned into an expression graph that details how to fetch metrics | turned into an expression graph that details how to fetch metrics | |||
| from the devices and compute the health status of the subservice. | from the devices and compute the health status of the subservice. | |||
| The subservice expressions are combined according to the dependencies | The subservice expressions are combined according to the dependencies | |||
| between the subservices in order to obtain the expression graph which | between the subservices in order to obtain the expression graph that | |||
| computes the health status of the service instance. | computes the health status of the service instance. | |||
| The overall SAIN architecture is presented in Figure 1. Based on the | The overall SAIN architecture is presented in Figure 1. Based on the | |||
| service configuration provided by the service orchestrator, the SAIN | service configuration provided by the service orchestrator, the SAIN | |||
| orchestrator decomposes the assurance graph. It then sends to the | orchestrator decomposes the assurance graph. It then sends to the | |||
| SAIN agents the assurance graph along with some other configuration | SAIN agents the assurance graph along with some other configuration | |||
| options. The SAIN agents are responsible for building the expression | options. The SAIN agents are responsible for building the expression | |||
| graph and computing the health statuses in a distributed manner. The | graph and computing the health statuses in a distributed manner. The | |||
| collector is in charge of collecting and displaying the current | collector is in charge of collecting and displaying the current | |||
| inferred health status of the service instances and subservices. The | inferred health status of the service instances and subservices. The | |||
| collector also detects changes in the assurance graph structures, for | collector also detects changes in the assurance graph structures | |||
| instance when a switchover from primary to backup path occurs, and | (e.g., an occurrence of a switchover from primary to backup path) and | |||
| forwards to the orchestrator, which reconfigures the agents. | forwards the information to the orchestrator, which reconfigures the | |||
| Finally, the automation loop is closed by having the SAIN collector | agents. Finally, the automation loop is closed by having the SAIN | |||
| providing feedback to the network/service orchestrator. | collector provide feedback to the network/service orchestrator. | |||
| In order to make agents, orchestrators and collectors from different | In order to make agents, orchestrators, and collectors from different | |||
| vendors interoperable, their interface is defined as a YANG model in | vendors interoperable, their interface is defined as a YANG module in | |||
| a companion document [I-D.ietf-opsawg-service-assurance-yang]. In | a companion document [RFC9418]. In Figure 1, the communications that | |||
| Figure 1, the communications that are normalized by this YANG model | are normalized by this YANG module are tagged with a "Y". The use of | |||
| are tagged with a "Y". The use of this YANG model is further | this YANG module is further explained in Section 3.5. | |||
| explained in Section 3.5. | ||||
| +-----------------+ | +-----------------+ | |||
| | Service | | | Service | | |||
| | Orchestrator |<----------------------+ | | Orchestrator |<----------------------+ | |||
| | | | | | | | | |||
| +-----------------+ | | +-----------------+ | | |||
| | ^ | | | ^ | | |||
| | | Network | | | | Network | | |||
| | | Service | Feedback | | | Service | Feedback | |||
| | | Instance | Loop | | | Instance | Loop | |||
| | | Configuration | | | | Configuration | | |||
| | | | | | | | | |||
| | V | | | V | | |||
| | +-----------------+ Graph +-------------------+ | | +-----------------+ Graph +-------------------+ | |||
| | | SAIN | updates | SAIN | | | | SAIN | Updates | SAIN | | |||
| | | Orchestrator |<--------| Collector | | | | Orchestrator |<--------| Collector | | |||
| | +-----------------+ +-------------------+ | | +-----------------+ +-------------------+ | |||
| | | ^ | | | ^ | |||
| | Y| Configuration | Health Status | | Y| Configuration | Health Status | |||
| | | (assurance graph) Y| (Score + Symptoms) | | | (Assurance Graph) Y| (Score + Symptoms) | |||
| | V | Streamed | | V | Streamed | |||
| | +-------------------+ | via Telemetry | | +-------------------+ | via Telemetry | |||
| | |+-------------------+ | | | |+-------------------+ | | |||
| | ||+-------------------+ | | | ||+-------------------+ | | |||
| | +|| SAIN |-----------+ | | +|| SAIN |-----------+ | |||
| | +| agent | | | +| Agent | | |||
| | +-------------------+ | | +-------------------+ | |||
| | ^ ^ ^ | | ^ ^ ^ | |||
| | | | | | | | | | | |||
| | | | | Metric Collection | | | | | Metric Collection | |||
| V V V V | V V V V | |||
| +-------------------------------------------------------------+ | +-------------------------------------------------------------+ | |||
| | (Network) System | | | (Network) System | | |||
| | | | | | | |||
| +-------------------------------------------------------------+ | +-------------------------------------------------------------+ | |||
| skipping to change at page 10, line 5 ¶ | skipping to change at line 407 ¶ | |||
| In order to produce the score assigned to a service instance, the | In order to produce the score assigned to a service instance, the | |||
| various involved components perform the following tasks: | various involved components perform the following tasks: | |||
| * Analyze the configuration pushed to the network device(s) for | * Analyze the configuration pushed to the network device(s) for | |||
| configuring the service instance. From there, determine which | configuring the service instance. From there, determine which | |||
| information (called a metric) must be collected from the device(s) | information (called a metric) must be collected from the device(s) | |||
| and which operations to apply to the metrics to compute the health | and which operations to apply to the metrics to compute the health | |||
| status. | status. | |||
| * Stream (via telemetry [RFC8641]) operational and config metric | * Stream (via telemetry, such as YANG-Push [RFC8641]) operational | |||
| values when possible, else continuously poll. | and config metric values when possible, else continuously poll. | |||
| * Continuously compute the health status of the service instances, | * Continuously compute the health status of the service instances | |||
| based on the metric values. | based on the metric values. | |||
| The SAIN architecture requires time synchronization, with Network | The SAIN architecture requires time synchronization, with the Network | |||
| Time Protocol (NTP) [RFC5905] as a candidate, between all elements: | Time Protocol (NTP) [RFC5905] as a candidate, between all elements: | |||
| monitored entities, SAIN agents, Service orchestrator, the SAIN | monitored entities, SAIN agents, service orchestrator, the SAIN | |||
| collector, as well as the SAIN orchestrator. This guarantees the | collector, as well as the SAIN orchestrator. This guarantees the | |||
| correlations of all symptoms in the system, correlated with the right | correlations of all symptoms in the system, correlated with the right | |||
| assurance graph version. | assurance graph version. | |||
| 3.1. Translating a Service Instance Configuration into an Assurance | 3.1. Translating a Service Instance Configuration into an Assurance | |||
| Graph | Graph | |||
| In order to structure the assurance of a service instance, the SAIN | In order to structure the assurance of a service instance, the SAIN | |||
| orchestrator decomposes the service instance into so-called | orchestrator decomposes the service instance into so-called | |||
| subservice instances. Each subservice instance focuses on a specific | subservice instances. Each subservice instance focuses on a specific | |||
| feature or subpart of the service. | feature or subpart of the service. | |||
| The decomposition into subservices is an important function of the | The decomposition into subservices is an important function of the | |||
| architecture, for the following reasons: | architecture for the following reasons: | |||
| * The result of this decomposition provides a relational picture of | * The result of this decomposition provides a relational picture of | |||
| a service instance, that can be represented as a graph (called | a service instance, which can be represented as a graph (called an | |||
| assurance graph) to the operator. | assurance graph) to the operator. | |||
| * Subservices provide a scope for particular expertise and thereby | * Subservices provide a scope for particular expertise and thereby | |||
| enable contribution from external experts. For instance, the | enable contribution from external experts. For instance, the | |||
| subservice dealing with the optics health should be reviewed and | subservice dealing with the optic's health should be reviewed and | |||
| extended by an expert in optical interfaces. | extended by an expert in optical interfaces. | |||
| * Subservices that are common to several service instances are | * Subservices that are common to several service instances are | |||
| reused for reducing the amount of computation needed. For | reused for reducing the amount of computation needed. For | |||
| instance, the subservice assuring a given interface is reused by | instance, the subservice assuring a given interface is reused by | |||
| any service instance relying on that interface. | any service instance relying on that interface. | |||
| The assurance graph of a service instance is a DAG representing the | The assurance graph of a service instance is a DAG representing the | |||
| structure of the assurance case for the service instance. The nodes | structure of the assurance case for the service instance. The nodes | |||
| of this graph are service instances or subservice instances. Each | of this graph are service instances or subservice instances. Each | |||
| edge of this graph indicates a dependency between the two nodes at | edge of this graph indicates a dependency between the two nodes at | |||
| its extremities: the service or subservice at the source of the edge | its extremities, i.e., the service or subservice at the source of the | |||
| depends on the service or subservice at the destination of the edge. | edge depends on the service or subservice at the destination of the | |||
| edge. | ||||
| Figure 2 depicts a simplistic example of the assurance graph for a | Figure 2 depicts a simplistic example of the assurance graph for a | |||
| tunnel service. The node at the top is the service instance, the | tunnel service. The node at the top is the service instance; the | |||
| nodes below are its dependencies. In the example, the tunnel service | nodes below are its dependencies. In the example, the tunnel service | |||
| instance depends on the "peer1" and "peer2" tunnel interfaces (the | instance depends on the "peer1" and "peer2" tunnel interfaces (the | |||
| tunnel interfaces created on the peer1 and peer2 devices, | tunnel interfaces created on the peer1 and peer2 devices, | |||
| respectively), which in turn depend on the respective physical | respectively), which in turn depend on the respective physical | |||
| interfaces, which finally depend on the respective "peer1" and | interfaces, which finally depend on the respective "peer1" and | |||
| "peer2" devices. The tunnel service instance also depends on the IP | "peer2" devices. The tunnel service instance also depends on the IP | |||
| connectivity that depends on the IS-IS routing protocol. | connectivity that depends on the IS-IS routing protocol. | |||
| +------------------+ | +------------------+ | |||
| | Tunnel | | | Tunnel | | |||
| skipping to change at page 12, line 7 ¶ | skipping to change at line 497 ¶ | |||
| +-------------+ +-------------+ | +-------------+ +-------------+ | |||
| | | | | | | | | | | |||
| | Peer1 | | Peer2 | | | Peer1 | | Peer2 | | |||
| | Device | | Device | | | Device | | Device | | |||
| +-------------+ +-------------+ | +-------------+ +-------------+ | |||
| Figure 2: Assurance Graph Example | Figure 2: Assurance Graph Example | |||
| Depicting the assurance graph helps the operator to understand (and | Depicting the assurance graph helps the operator to understand (and | |||
| assert) the decomposition. The assurance graph shall be maintained | assert) the decomposition. The assurance graph shall be maintained | |||
| during normal operation with addition, modification and removal of | during normal operation with addition, modification, and removal of | |||
| service instances. A change in the network configuration or topology | service instances. A change in the network configuration or topology | |||
| shall automatically be reflected in the assurance graph. As a first | shall automatically be reflected in the assurance graph. As a first | |||
| example, a change of routing protocol from IS-IS to OSPF would change | example, a change of the routing protocol from IS-IS to OSPF would | |||
| the assurance graph accordingly. As a second example, assuming that | change the assurance graph accordingly. As a second example, assume | |||
| ECMP is in place for the source router for that specific tunnel; in | that the ECMP is in place for the source router for that specific | |||
| that case, multiple interfaces must now be monitored, on top of the | tunnel; in that case, multiple interfaces must now be monitored, in | |||
| monitoring the ECMP health itself. | addition to monitoring the ECMP health itself. | |||
| 3.1.1. Circular Dependencies | 3.1.1. Circular Dependencies | |||
| The edges of the assurance graph represent dependencies. An | The edges of the assurance graph represent dependencies. An | |||
| assurance graph is a DAG if and only if there are no circular | assurance graph is a DAG if and only if there are no circular | |||
| dependencies among the subservices, and every assurance graph should | dependencies among the subservices, and every assurance graph should | |||
| avoid circular dependencies. However, in some cases, circular | avoid circular dependencies. However, in some cases, circular | |||
| dependencies might appear in the assurance graph. | dependencies might appear in the assurance graph. | |||
| First, the assurance graph of a whole system is obtained by combining | First, the assurance graph of a whole system is obtained by combining | |||
| the assurance graph of every service running on that system. Here | the assurance graph of every service running on that system. Here, | |||
| combining means that two subservices having the same type and the | combining means that two subservices having the same type and the | |||
| same parameters are in fact the same subservice and thus a single | same parameters are in fact the same subservice and thus a single | |||
| node in the graph. For instance, the subservice of type "device" | node in the graph. For instance, the subservice of type "device" | |||
| with the only parameter (the device ID) set to "PE1" will appear only | with the only parameter (the device ID) set to "PE1" will appear only | |||
| once in the whole assurance graph even if several service instances | once in the whole assurance graph, even if several service instances | |||
| rely on that device. Now, if two engineers design assurance graphs | rely on that device. Now, if two engineers design assurance graphs | |||
| for two different services, and engineer A decides that an interface | for two different services, and Engineer A decides that an interface | |||
| depends on the link it is connected to, but engineer B decides that | depends on the link it is connected to, but Engineer B decides that | |||
| the link depends on the interface it is connected to, then when | the link depends on the interface it is connected to, then when | |||
| combining the two assurance graphs, we will have a circular | combining the two assurance graphs, we will have a circular | |||
| dependency interface -> link -> interface. | dependency interface -> link -> interface. | |||
| Another case possibly resulting in circular dependencies is when | Another case possibly resulting in circular dependencies is when | |||
| subservices are not properly identified. Assume that we want to | subservices are not properly identified. Assume that we want to | |||
| assure a cloud-based computing cluster that runs containers. We | assure a cloud-based computing cluster that runs containers. We | |||
| could represent the cluster by a subservice and the network service | could represent the cluster by a subservice and the network service | |||
| connecting containers on the cluster by another subservice. We will | connecting containers on the cluster by another subservice. We would | |||
| likely model that the network service depends on the cluster, because | likely model that as the network service depending on the cluster, | |||
| the network service runs in a container supported by the cluster. | because the network service runs in a container supported by the | |||
| Conversely, the cluster depends on the network service for | cluster. Conversely, the cluster depends on the network service for | |||
| connectivity between containers, which creates a circular dependency. | connectivity between containers, which creates a circular dependency. | |||
| A finer decomposition might distinguish between the resources for | A finer decomposition might distinguish between the resources for | |||
| executing containers (a part of our cluster subservice) and the | executing containers (a part of our cluster subservice) and the | |||
| communication between the containers (which could be modelled in the | communication between the containers (which could be modeled in the | |||
| same way as communication between routers). | same way as communication between routers). | |||
| In any case, it is likely that circular dependencies will show up in | In any case, it is likely that circular dependencies will show up in | |||
| the assurance graph. A first step would be to detect circular | the assurance graph. A first step would be to detect circular | |||
| dependencies as soon as possible in the SAIN architecture. Such a | dependencies as soon as possible in the SAIN architecture. Such a | |||
| detection could be carried out by the SAIN orchestrator. Whenever a | detection could be carried out by the SAIN orchestrator. Whenever a | |||
| circular dependency is detected, the newly added service would not be | circular dependency is detected, the newly added service would not be | |||
| monitored until more careful modelling or alignment between the | monitored until more careful modeling or alignment between the | |||
| different teams (engineer A and B) remove the circular dependency. | different teams (Engineers A and B) remove the circular dependency. | |||
| As more elaborate solution we could consider a graph transformation: | As a more elaborate solution, we could consider a graph | |||
| transformation: | ||||
| * Decompose the graph into strongly connected components. | * Decompose the graph into strongly connected components. | |||
| * For each strongly connected component: | * For each strongly connected component: | |||
| - Remove all edges between nodes of the strongly connected | - remove all edges between nodes of the strongly connected | |||
| component | component; | |||
| - Add a new "synthetic" node for the strongly connected component | - add a new "synthetic" node for the strongly connected | |||
| component; | ||||
| - For each edge pointing to a node in the strongly connected | - for each edge pointing to a node in the strongly connected | |||
| component, change the destination to the "synthetic" node | component, change the destination to the "synthetic" node; and | |||
| - Add a dependency from the "synthetic" node to every node in the | - add a dependency from the "synthetic" node to every node in the | |||
| strongly connected component. | strongly connected component. | |||
| Such an algorithm would include all symptoms detected by any | Such an algorithm would include all symptoms detected by any | |||
| subservice in one of the strongly component and make it available to | subservice in one of the strongly connected components and make it | |||
| any subservice that depends on it. Figure 3 shows an example of such | available to any subservice that depends on it. Figure 3 shows an | |||
| a transformation. On the left-hand side, the nodes c, d, e and f | example of such a transformation. On the left-hand side, the nodes | |||
| form a strongly connected component. The status of node a should | c, d, e, and f form a strongly connected component. The status of | |||
| depend on the status of nodes c, d, e, f, g, and h, but this is hard | node a should depend on the status of nodes c, d, e, f, g, and h, but | |||
| to compute because of the circular dependency. On the right hand- | this is hard to compute because of the circular dependency. On the | |||
| side, a depends on all these nodes as well, but there the circular | right-hand side, node a depends on all these nodes as well, but the | |||
| dependency has been removed. | circular dependency has been removed. | |||
| +---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| | a | | b | | | a | | b | | | a | | b | | | a | | b | | |||
| +---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| | | | | | | | | | | | | |||
| v v | v v | v v | v v | |||
| +---+ +---+ | +------------+ | +---+ +---+ | +------------+ | |||
| | c |--->| d | | | synthetic | | | c |--->| d | | | synthetic | | |||
| +---+ +---+ | +------------+ | +---+ +---+ | +------------+ | |||
| ^ | | / | | \ | ^ | | / | | \ | |||
| skipping to change at page 14, line 28 ¶ | skipping to change at line 602 ¶ | |||
| +---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | |||
| v v | v v | v v | v v | |||
| +---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| | g | | h | | | g | | h | | | g | | h | | | g | | h | | |||
| +---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| Before After | Before After | |||
| Transformation Transformation | Transformation Transformation | |||
| Figure 3: Graph transformation | Figure 3: Graph Transformation | |||
| We consider a concrete example to illustrate this transformation. | We consider a concrete example to illustrate this transformation. | |||
| Let's assume that Engineer A is building an assurance graph dealing | Let's assume that Engineer A is building an assurance graph dealing | |||
| with IS-IS and Engineer B is building an assurance graph dealing with | with IS-IS and Engineer B is building an assurance graph dealing with | |||
| OSPF. The graph from Engineer A could contain the following: | OSPF. The graph from Engineer A could contain the following: | |||
| +------------+ | +------------+ | |||
| | IS-IS Link | | | IS-IS Link | | |||
| +------------+ | +------------+ | |||
| | | | | |||
| v | v | |||
| +------------+ | +------------+ | |||
| | Phys. Link | | | Phys. Link | | |||
| +------------+ | +------------+ | |||
| | | | | | | |||
| v v | v v | |||
| +-------------+ +-------------+ | +-------------+ +-------------+ | |||
| | Interface 1 | | Interface 2 | | | Interface 1 | | Interface 2 | | |||
| +-------------+ +-------------+ | +-------------+ +-------------+ | |||
| Figure 4: Fragment of assurance graph from Engineer A | Figure 4: Fragment of the Assurance Graph from Engineer A | |||
| The graph from Engineer B could contain the following: | The graph from Engineer B could contain the following: | |||
| +------------+ | +------------+ | |||
| | OSPF Link | | | OSPF Link | | |||
| +------------+ | +------------+ | |||
| | | | | | | | | |||
| v | v | v | v | |||
| +-------------+ | +-------------+ | +-------------+ | +-------------+ | |||
| | Interface 1 | | | Interface 2 | | | Interface 1 | | | Interface 2 | | |||
| +-------------+ | +-------------+ | +-------------+ | +-------------+ | |||
| | | | | | | | | |||
| v v v | v v v | |||
| +------------+ | +------------+ | |||
| | Phys. Link | | | Phys. Link | | |||
| +------------+ | +------------+ | |||
| Figure 5: Fragment of assurance graph from Engineer B | Figure 5: Fragment of the Assurance Graph from Engineer B | |||
| Each Interface subservice and the Physical Link subservice are common | The Interface subservices and the Physical Link subservice are common | |||
| to both fragments above. Each of these subservice appears only once | to both fragments above. Each of these subservices appear only once | |||
| in the graph merging the two fragments. Dependencies from both | in the graph merging the two fragments. Dependencies from both | |||
| fragments are included in the merged graph, resulting in a circular | fragments are included in the merged graph, resulting in a circular | |||
| dependency: | dependency: | |||
| +------------+ +------------+ | +------------+ +------------+ | |||
| | IS-IS Link | | OSPF Link |---+ | | IS-IS Link | | OSPF Link |---+ | |||
| +------------+ +------------+ | | +------------+ +------------+ | | |||
| | | | | | | | | | | |||
| | +-------- + | | | | +-------- + | | | |||
| v v | | | v v | | | |||
| skipping to change at page 15, line 46 ¶ | skipping to change at line 668 ¶ | |||
| | ^ | | | | | | ^ | | | | | |||
| | | +-------+ | | | | | | +-------+ | | | | |||
| v | v | v | | v | v | v | | |||
| +-------------+ +-------------+ | | +-------------+ +-------------+ | | |||
| | Interface 1 | | Interface 2 | | | | Interface 1 | | Interface 2 | | | |||
| +-------------+ +-------------+ | | +-------------+ +-------------+ | | |||
| ^ | | ^ | | |||
| | | | | | | |||
| +------------------------------+ | +------------------------------+ | |||
| Figure 6: Merging graphs from A and B | Figure 6: Merging Graphs from Engineers A and B | |||
| The solution presented above would result in graph looking as | The solution presented above would result in a graph looking as | |||
| follows, where a new "synthetic" node is included. Using that | follows, where a new "synthetic" node is included. Using that | |||
| transformation, all dependencies are indirectly satisfied for the | transformation, all dependencies are indirectly satisfied for the | |||
| nodes outside the circular dependency, in the sense that both IS-IS | nodes outside the circular dependency, in the sense that both IS-IS | |||
| and OSPF links have indirect dependencies to the two interfaces and | and OSPF links have indirect dependencies to the two interfaces and | |||
| the link. However, the dependencies between the link and the | the link. However, the dependencies between the link and the | |||
| interfaces are lost as they were causing the circular dependency. | interfaces are lost since they were causing the circular dependency. | |||
| +------------+ +------------+ | +------------+ +------------+ | |||
| | IS-IS Link | | OSPF Link | | | IS-IS Link | | OSPF Link | | |||
| +------------+ +------------+ | +------------+ +------------+ | |||
| | | | | | | |||
| v v | v v | |||
| +------------+ | +------------+ | |||
| | synthetic | | | synthetic | | |||
| +------------+ | +------------+ | |||
| | | | | |||
| +-----------+-------------+ | +-----------+-------------+ | |||
| | | | | | | | | |||
| v v v | v v v | |||
| +-------------+ +------------+ +-------------+ | +-------------+ +------------+ +-------------+ | |||
| | Interface 1 | | Phys. Link | | Interface 2 | | | Interface 1 | | Phys. Link | | Interface 2 | | |||
| +-------------+ +------------+ +-------------+ | +-------------+ +------------+ +-------------+ | |||
| Figure 7: Removing circular dependencies after merging graphs | Figure 7: Removing Circular Dependencies after Merging Graphs | |||
| from A and B | from Engineers A and B | |||
| 3.2. Intent and Assurance Graph | 3.2. Intent and Assurance Graph | |||
| The SAIN orchestrator analyzes the configuration of a service | The SAIN orchestrator analyzes the configuration of a service | |||
| instance to: | instance to do the following: | |||
| * Try to capture the intent of the service instance, i.e., what is | * Try to capture the intent of the service instance, i.e., What is | |||
| the service instance trying to achieve. At least, this requires | the service instance trying to achieve? At a minimum, this | |||
| the SAIN orchestrator to know the YANG modules that are being | requires the SAIN orchestrator to know the YANG modules that are | |||
| configured on the devices to enable the service. Note that if the | being configured on the devices to enable the service. Note that, | |||
| service model or the network model is known to the SAIN | if the service model or the network model is known to the SAIN | |||
| orchestrator, the latter can exploit it. In that case, the intent | orchestrator, the latter can exploit it. In that case, the intent | |||
| could be directly extracted and include more details, such as the | could be directly extracted and include more details, such as the | |||
| notion of sites for a VPN, which is out of scope of the device | notion of sites for a VPN, which is out of scope of the device | |||
| configuration. | configuration. | |||
| * Decompose the service instance into subservices representing the | * Decompose the service instance into subservices representing the | |||
| network features on which the service instance relies. | network features on which the service instance relies. | |||
| The SAIN orchestrator must be able to analyze configuration pushed to | The SAIN orchestrator must be able to analyze the configuration | |||
| various devices for configuring a service instance and produce the | pushed to various devices of a service instance and produce the | |||
| assurance graph for that service instance. | assurance graph for that service instance. | |||
| To schematize what a SAIN orchestrator does, assume that the | To schematize what a SAIN orchestrator does, assume that a service | |||
| configuration for a service instance touches two devices and | instance touches two devices and configures a virtual tunnel | |||
| configure on each device a virtual tunnel interface. Then: | interface on each device. Then: | |||
| * Capturing the intent would start by detecting that the service | * Capturing the intent would start by detecting that the service | |||
| instance is actually a tunnel between the two devices, and stating | instance is actually a tunnel between the two devices and stating | |||
| that this tunnel must be functional. This solution is minimally | that this tunnel must be operational. This solution is minimally | |||
| invasive as it does not require modifying nor knowing the service | invasive, as it does not require modifying nor knowing the service | |||
| model. If the service model or network model is known by the SAIN | model. If the service model or network model is known by the SAIN | |||
| orchestrator, it can be used to further capture the intent and | orchestrator, it can be used to further capture the intent and | |||
| include more information such as Service Level Objectives. For | include more information, such as Service-Level Objectives (e.g., | |||
| instance, the latency and bandwidth requirements for the tunnel, | the latency and bandwidth requirements for the tunnel) if present | |||
| if present in the service model | in the service model. | |||
| * Decomposing the service instance into subservices would result in | * Decomposing the service instance into subservices would result in | |||
| the assurance graph depicted in Figure 2, for instance. | the assurance graph depicted in Figure 2, for instance. | |||
| The assurance graph, or more precisely the subservices and | The assurance graph, or more precisely the subservices and | |||
| dependencies that a SAIN orchestrator can instantiate, should be | dependencies that a SAIN orchestrator can instantiate, should be | |||
| curated. The organization of such a process is out-of-scope for this | curated. The organization of such a process (i.e., ensure that | |||
| document and should aim to: | existing subservices are reused as much as possible and avoid | |||
| circular dependencies) is out-of-scope for this document. | ||||
| * Ensure that existing subservices are reused as much as possible. | ||||
| * Avoid circular dependencies. | ||||
| To be applied, SAIN requires a mechanism mapping a service instance | To be applied, SAIN requires a mechanism mapping a service instance | |||
| to the configuration actually required on the devices for that | to the configuration actually required on the devices for that | |||
| service instance to run. While the Figure 1 makes a distinction | service instance to run. While Figure 1 makes a distinction between | |||
| between the SAIN orchestrator and a different component providing the | the SAIN orchestrator and a different component providing the service | |||
| service instance configuration, in practice those two components are | instance configuration, in practice those two components are most | |||
| mostly likely combined. The internals of the orchestrator are out of | likely combined. The internals of the orchestrator are out of scope | |||
| scope of this document. | of this document. | |||
| 3.3. Subservices | 3.3. Subservices | |||
| A subservice corresponds to subpart or a feature of the network | A subservice corresponds to a subpart or a feature of the network | |||
| system that is needed for a service instance to function properly. | system that is needed for a service instance to function properly. | |||
| In the context of SAIN, a subservice is associated to its assurance, | In the context of SAIN, a subservice is associated to its assurance, | |||
| that is the method for assuring that a subservice behaves correctly. | which is the method for assuring that a subservice behaves correctly. | |||
| Subservices, just as with services, have high-level parameters that | Subservices, just as with services, have high-level parameters that | |||
| specify the instance to be assured. The needed parameters depend on | specify the instance to be assured. The needed parameters depend on | |||
| the subservice type. For example, assuring a device requires a | the subservice type. For example, assuring a device requires a | |||
| specific deviceId as parameter. For example, assuring an interface | specific deviceId as a parameter and assuring an interface requires a | |||
| requires a specific combination of deviceId and interfaceId. | specific combination of deviceId and interfaceId. | |||
| When designing a new type of subservice, one should carefully define | When designing a new type of subservice, one should carefully define | |||
| what is the assured object or functionality. Then, the parameters | what is the assured object or functionality. Then, the parameters | |||
| must be chosen as a minimal set that completely identify the object | must be chosen as a minimal set that completely identifies the object | |||
| (see examples from the previous paragraph). Parameters cannot change | (see examples from the previous paragraph). Parameters cannot change | |||
| during the lifecycle of a subservice. For instance, an IP address is | during the life cycle of a subservice. For instance, an IP address | |||
| a good parameter when assuring a connectivity towards that address | is a good parameter when assuring a connectivity towards that address | |||
| (i.e. a given device can reach a given IP address), however it's not | (i.e., a given device can reach a given IP address); however, it's | |||
| a good parameter to identify an interface as the IP address assigned | not a good parameter to identify an interface, as the IP address | |||
| to that interface can be changed. | assigned to that interface can be changed. | |||
| A subservice is also characterized by a list of metrics to fetch and | A subservice is also characterized by a list of metrics to fetch and | |||
| a list of operations to apply to these metrics in order to infer a | a list of operations to apply to these metrics in order to infer a | |||
| health status. | health status. | |||
| 3.4. Building the Expression Graph from the Assurance Graph | 3.4. Building the Expression Graph from the Assurance Graph | |||
| From the assurance graph is derived a so-called global computation | From the assurance graph, a so-called global computation graph is | |||
| graph. First, each subservice instance is transformed into a set of | derived. First, each subservice instance is transformed into a set | |||
| subservice expressions that take metrics and constants as input | of subservice expressions that take metrics and constants as input | |||
| (i.e., sources of the DAG) and produce the status of the subservice, | (i.e., sources of the DAG) and produce the status of the subservice | |||
| based on some heuristics. For instance, the health of an interface | based on some heuristics. For instance, the health of an interface | |||
| is 0 (minimal score) with the symptom "interface admin-down" if the | is 0 (minimal score) with the symptom "interface admin-down" if the | |||
| interface is disabled in the configuration. Then for each service | interface is disabled in the configuration. Then, for each service | |||
| instance, the service expressions are constructed by combining the | instance, the service expressions are constructed by combining the | |||
| subservice expressions of its dependencies. The way service | subservice expressions of its dependencies. The way service | |||
| expressions are combined depends on the dependency types (impacting | expressions are combined depends on the dependency types (impacting | |||
| or informational). Finally, the global computation graph is built by | or informational). Finally, the global computation graph is built by | |||
| combining the service expressions, to get a global view of all | combining the service expressions to get a global view of all | |||
| subservices. In other words, the global computation graph encodes | subservices. In other words, the global computation graph encodes | |||
| all the operations needed to produce health statuses from the | all the operations needed to produce health statuses from the | |||
| collected metrics. | collected metrics. | |||
| The two types of dependencies for combining subservices are: | The two types of dependencies for combining subservices are: | |||
| Informational Dependency: Type of dependency whose health score | Informational Dependency: | |||
| does not impact the health score of its parent subservice or | The type of dependency whose health score does not impact the | |||
| service instance(s) in the assurance graph. However, the symptoms | health score of its parent subservice or service instance(s) in | |||
| should be taken into account in the parent service instance or | the assurance graph. However, the symptoms should be taken into | |||
| subservice instance(s), for informational reasons. | account in the parent service instance or subservice instance(s) | |||
| for informational reasons. | ||||
| Impacting Dependency: Type of dependency whose score impacts the | Impacting Dependency: | |||
| score of its parent subservice or service instance(s) in the | The type of dependency whose health score impacts the health score | |||
| assurance graph. The symptoms are taken into account in the | of its parent subservice or service instance(s) in the assurance | |||
| parent service instance or subservice instance(s), as the | graph. The symptoms are taken into account in the parent service | |||
| impacting reasons. | instance or subservice instance(s) as the impacting reasons. | |||
| The set of dependency type presented here is not exhaustive. More | The set of dependency types presented here is not exhaustive. More | |||
| specific dependency types can be defined by extending the YANG model. | specific dependency types can be defined by extending the YANG | |||
| For instance, a connectivity subservice depending on several path | module. For instance, a connectivity subservice depending on several | |||
| subservices is only partially impacted if only one of these paths | path subservices is partially impacted if only one of these paths | |||
| fails. Adding these new dependency types requires defining the | fails. Adding these new dependency types requires defining the | |||
| corresponding operation for combining statuses of subservices. | corresponding operation for combining statuses of subservices. | |||
| Subservices shall not be dependent on the protocol used to retrieve | Subservices shall not be dependent on the protocol used to retrieve | |||
| the metrics. To justify this, let's consider the interface | the metrics. To justify this, let's consider the interface | |||
| operational status. Depending on the device capabilities, this | operational status. Depending on the device capabilities, this | |||
| status can be collected by an industry-accepted YANG module (IETF, | status can be collected by an industry-accepted YANG module (e.g., | |||
| Openconfig [OpenConfig]), by a vendor-specific YANG module, or even | IETF or Openconfig [OpenConfig]), by a vendor-specific YANG module, | |||
| by a MIB module. If the subservice was dependent on the mechanism to | or even by a MIB module. If the subservice was dependent on the | |||
| collect the operational status, then we would need multiple | mechanism to collect the operational status, then we would need | |||
| subservice definitions in order to support all different mechanisms. | multiple subservice definitions in order to support all different | |||
| This also implies that, while waiting for all the metrics to be | mechanisms. This also implies that, while waiting for all the | |||
| available via standard YANG modules, SAIN agents might have to | metrics to be available via standard YANG modules, SAIN agents might | |||
| retrieve metric values via non-standard YANG models, via MIB modules, | have to retrieve metric values via nonstandard YANG data models, MIB | |||
| Command Line Interface (CLI), etc., effectively implementing a | modules, the Command-Line Interface (CLI), etc., effectively | |||
| normalization layer between data models and information models. | implementing a normalization layer between data models and | |||
| information models. | ||||
| In order to keep subservices independent of metric collection method, | In order to keep subservices independent of metric collection method | |||
| or, expressed differently, to support multiple combinations of | (or, expressed differently, to support multiple combinations of | |||
| platforms, OSes, and even vendors, the architecture introduces the | platforms, OSes, and even vendors), the architecture introduces the | |||
| concept of "metric engine". The metric engine maps each device- | concept of "metric engine". The metric engine maps each device- | |||
| independent metric used in the subservices to a list of device- | independent metric used in the subservices to a list of device- | |||
| specific metric implementations that precisely define how to fetch | specific metric implementations that precisely define how to fetch | |||
| values for that metric. The mapping is parameterized by the | values for that metric. The mapping is parameterized by the | |||
| characteristics (model, OS version, etc.) of the device from which | characteristics (i.e., model, OS version, etc.) of the device from | |||
| the metrics are fetched. This metric engine is included in the SAIN | which the metrics are fetched. This metric engine is included in the | |||
| agent. | SAIN agent. | |||
| 3.5. Open Interfaces with YANG Modules | 3.5. Open Interfaces with YANG Modules | |||
| The interfaces between the architecture components are open thanks to | The interfaces between the architecture components are open thanks to | |||
| the YANG modules specified in | the YANG modules specified in [RFC9418]; they specify objects for | |||
| [I-D.ietf-opsawg-service-assurance-yang]; they specify objects for | ||||
| assuring network services based on their decomposition into so-called | assuring network services based on their decomposition into so-called | |||
| subservices, according to the SAIN architecture. | subservices, according to the SAIN architecture. | |||
| These modules are intended for the following use cases: | These modules are intended for the following use cases: | |||
| * Assurance graph configuration: | * Assurance graph configuration: | |||
| - Subservices: configure a set of subservices to assure, by | - Subservices: Configure a set of subservices to assure by | |||
| specifying their types and parameters. | specifying their types and parameters. | |||
| - Dependencies: configure the dependencies between the | - Dependencies: Configure the dependencies between the | |||
| subservices, along with their types. | subservices, along with their types. | |||
| * Assurance telemetry: export the health status of the subservices, | * Assurance telemetry: Export the health status of the subservices, | |||
| along with the observed symptoms. | along with the observed symptoms. | |||
| Some examples of YANG instances can be found in Appendix A of | Some examples of YANG instances can be found in Appendix A of | |||
| [I-D.ietf-opsawg-service-assurance-yang]. | [RFC9418]. | |||
| 3.6. Handling Maintenance Windows | 3.6. Handling Maintenance Windows | |||
| Whenever network components are under maintenance, the operator wants | Whenever network components are under maintenance, the operator wants | |||
| to inhibit the emission of symptoms from those components. A typical | to inhibit the emission of symptoms from those components. A typical | |||
| use case is device maintenance, during which the device is not | use case is device maintenance, during which the device is not | |||
| supposed to be operational. As such, symptoms related to the device | supposed to be operational. As such, symptoms related to the device | |||
| health should be ignored. Symptoms related to the device-specific | health should be ignored. Symptoms related to the device-specific | |||
| subservices, such as the interfaces, might also be ignored because | subservices, such as the interfaces, might also be ignored because | |||
| their state changes are probably the consequence of the maintenance. | their state changes are probably the consequence of the maintenance. | |||
| The ietf-service-assurance model proposed in | The ietf-service-assurance model described in [RFC9418] enables | |||
| [I-D.ietf-opsawg-service-assurance-yang] enables flagging subservices | flagging subservices as under maintenance and, in that case, requires | |||
| as under maintenance, and, in that case, requires a string that | a string that identifies the person or process that requested the | |||
| identifies the person or process who requested the maintenance. When | maintenance. When a service or subservice is flagged as under | |||
| a service or subservice is flagged as under maintenance, it must | maintenance, it must report a generic "Under Maintenance" symptom for | |||
| report a generic "Under Maintenance" symptom, for propagation towards | propagation towards subservices that depend on this specific | |||
| subservices that depend on this specific subservice. Any other | subservice. Any other symptom from this service or by one of its | |||
| symptom from this service, or by one of its impacting dependencies | impacting dependencies must not be reported. | |||
| must not be reported. | ||||
| We illustrate this mechanism on three independent examples based on | We illustrate this mechanism on three independent examples based on | |||
| the assurance graph depicted in Figure 2: | the assurance graph depicted in Figure 2: | |||
| * Device maintenance, for instance upgrading the device OS. The | * Device maintenance, for instance, upgrading the device OS. The | |||
| operator flags the subservice "Peer1" device as under maintenance. | operator flags the subservice "Peer1" device as under maintenance. | |||
| This inhibits the emission of symptoms, except "Under | This inhibits the emission of symptoms, except "Under Maintenance" | |||
| Maintenance", from "Peer1 Physical Interface", "Peer1 Tunnel | from "Peer1 Physical Interface", "Peer1 Tunnel Interface", and | |||
| Interface" and "Tunnel Service Instance". All other subservices | "Tunnel Service Instance". All other subservices are unaffected. | |||
| are unaffected. | ||||
| * Interface maintenance, for instance replacing a broken optic. The | * Interface maintenance, for instance, replacing a broken optic. | |||
| operator flags the subservice "Peer1 Physical Interface" as under | The operator flags the subservice "Peer1 Physical Interface" as | |||
| maintenance. This inhibits the emission of symptoms, except | under maintenance. This inhibits the emission of symptoms, except | |||
| "Under Maintenance" from "Peer 1 Tunnel Interface" and "Tunnel | "Under Maintenance" from "Peer 1 Tunnel Interface" and "Tunnel | |||
| Service Instance". All other subservices are unaffected. | Service Instance". All other subservices are unaffected. | |||
| * Routing protocol maintenance, for instance modifying parameters or | * Routing protocol maintenance, for instance, modifying parameters | |||
| redistribution. The operator marks the subservice "IS-IS Routing | or redistribution. The operator marks the subservice "IS-IS | |||
| Protocol" as under maintenance. This inhibits the emission of | Routing Protocol" as under maintenance. This inhibits the | |||
| symptoms, except "Under Maintenance", from "IP connectivity" and | emission of symptoms, except "Under Maintenance" from "IP | |||
| "Tunnel Service Instance". All other subservices are unaffected. | connectivity" and "Tunnel Service Instance". All other | |||
| subservices are unaffected. | ||||
| In each example above, the subservice under maintenance is completely | In each example above, the subservice under maintenance is completely | |||
| impacting the service instance, putting it under maintenance as well. | impacting the service instance, putting it under maintenance as well. | |||
| There are use cases where the subservice under maintenance only | There are use cases where the subservice under maintenance only | |||
| partially impacts the service instance. For instance, consider a | partially impacts the service instance. For instance, consider a | |||
| service instance supported by both a primary and backup path. If a | service instance supported by both a primary and backup path. If a | |||
| subservice impacting the primary path is under maintenance, the | subservice impacting the primary path is under maintenance, the | |||
| service instance might still be functional but degraded. In that | service instance might still be functional but degraded. In that | |||
| case, the status of the service instance might include "Primary path | case, the status of the service instance might include "Primary path | |||
| Under Maintenance", "No redundancy" as well as other symptoms from | Under Maintenance", "No redundancy", as well as other symptoms from | |||
| the backup path to explain the lower health score. In general, the | the backup path to explain the lower health score. In general, the | |||
| computation of the service instance status from the subservices is | computation of the service instance status from the subservices is | |||
| done in the SAIN collector whose implementation is out of scope for | done in the SAIN collector whose implementation is out of scope for | |||
| this document. | this document. | |||
| The maintenance of a subservice might modify or hide modifications of | The maintenance of a subservice might modify or hide modifications of | |||
| the structure of the assurance graph. Therefore, unflagging a | the structure of the assurance graph. Therefore, unflagging a | |||
| subservice as under maintenance should trigger an update of the | subservice as under maintenance should trigger an update of the | |||
| assurance graph. | assurance graph. | |||
| 3.7. Flexible Functional Architecture | 3.7. Flexible Functional Architecture | |||
| The SAIN architecture is flexible in terms of components. While the | The SAIN architecture is flexible in terms of components. While the | |||
| SAIN architecture in Figure 1 makes a distinction between two | SAIN architecture in Figure 1 makes a distinction between two | |||
| components, the service orchestrator and the SAIN orchestrator, in | components, the service orchestrator and the SAIN orchestrator, in | |||
| practice those two components are mostly likely combined. Similarly, | practice the two components are most likely combined. Similarly, the | |||
| the SAIN agents are displayed in Figure 1 as being separate | SAIN agents are displayed in Figure 1 as being separate components. | |||
| components. Practically, the SAIN agents could be either independent | In practice, the SAIN agents could be either independent components | |||
| components or directly integrated in monitored entities. A practical | or directly integrated in monitored entities. A practical example is | |||
| example is an agent in a router. | an agent in a router. | |||
| The SAIN architecture is also flexible in terms of services and | The SAIN architecture is also flexible in terms of services and | |||
| subservices. In the proposed architecture, the SAIN orchestrator is | subservices. In the defined architecture, the SAIN orchestrator is | |||
| coupled to a service orchestrator which defines the kinds of services | coupled to a service orchestrator, which defines the kinds of | |||
| that the architecture handles. Most examples in this document deal | services that the architecture handles. Most examples in this | |||
| with the notion of Network Service YANG modules, with well-known | document deal with the notion of Network Service YANG Modules with | |||
| services such as L2VPN or tunnels. However, the concept of services | well-known services, such as L2VPN or tunnels. However, the concept | |||
| is general enough to cross into different domains. One of them is | of services is general enough to cross into different domains. One | |||
| the domain of service management on network elements, which also | of them is the domain of service management on network elements, | |||
| require their own assurance. Examples include a DHCP server on a | which also require their own assurance. Examples include a DHCP | |||
| Linux server, a data plane, an IPFIX export, etc. The notion of | server on a Linux server, a data plane, an IPFIX export, etc. The | |||
| "service" is generic in this architecture and depends on the service | notion of "service" is generic in this architecture and depends on | |||
| orchestrator and underlying network system, as illustrated by the | the service orchestrator and underlying network system, as | |||
| following examples: | illustrated by the following examples: | |||
| * if a main service orchestrator coordinates several lower level | * If a main service orchestrator coordinates several lower-level | |||
| controllers, a service for the controller can be a subservice from | controllers, a service for the controller can be a subservice from | |||
| the point of view of the orchestrator. | the point of view of the orchestrator. | |||
| * A DHCP server/data plane/IPFIX export can be considered as | * A DHCP server / data plane / IPFIX export can be considered | |||
| subservices for a device. | subservices for a device. | |||
| * A routing instance can be considered as a subservice for a L3VPN. | * A routing instance can be considered a subservice for an L3VPN. | |||
| * A tunnel can be considered as a subservice for an application in | * A tunnel can be considered a subservice for an application in the | |||
| the cloud. | cloud. | |||
| * A service function can be considered as a subservice for a service | * A service function can be considered a subservice for a service | |||
| function chain [RFC7665]. | function chain [RFC7665]. | |||
| The assurance graph is created to be flexible and open, regardless of | The assurance graph is created to be flexible and open, regardless of | |||
| the subservice types, locations, or domains. | the subservice types, locations, or domains. | |||
| The SAIN architecture is also flexible in terms of distributed | The SAIN architecture is also flexible in terms of distributed | |||
| graphs. As shown in Figure 1, the architecture comprises several | graphs. As shown in Figure 1, the architecture comprises several | |||
| agents. Each agent is responsible for handling a subgraph of the | agents. Each agent is responsible for handling a subgraph of the | |||
| assurance graph. The collector is responsible for fetching the sub- | assurance graph. The collector is responsible for fetching the | |||
| graphs from the different agents and gluing them together. As an | subgraphs from the different agents and gluing them together. As an | |||
| example, in the graph from Figure 2, the subservices relative to Peer | example, in the graph from Figure 2, the subservices relative to Peer | |||
| 1 might be handled by a different agent than the subservices relative | 1 might be handled by a different agent than the subservices relative | |||
| to Peer 2 and the Connectivity and IS-IS subservices might be handled | to Peer 2, and the Connectivity and IS-IS subservices might be | |||
| by yet another agent. The agents will export their partial graph and | handled by yet another agent. The agents will export their partial | |||
| the collector will stitch them together as dependencies of the | graph, and the collector will stitch them together as dependencies of | |||
| service instance. | the service instance. | |||
| And finally, the SAIN architecture is flexible in terms of what it | And finally, the SAIN architecture is flexible in terms of what it | |||
| monitors. Most, if not all examples, in this document refer to | monitors. Most, if not all, examples in this document refer to | |||
| physical components, but this is not a constraint. Indeed, the | physical components, but this is not a constraint. Indeed, the | |||
| assurance of virtual components would follow the same principles and | assurance of virtual components would follow the same principles, and | |||
| an assurance graph composed of virtualized components (or a mix of | an assurance graph composed of virtualized components (or a mix of | |||
| virtualized and physical ones) is supported by this architecture. | virtualized and physical ones) is supported by this architecture. | |||
| 3.8. Time window for symptoms history | 3.8. Time Window for Symptoms' History | |||
| The health status reported via the YANG modules contains, for each | The health status reported via the YANG modules contains, for each | |||
| subservice, the list of symptoms. Symptoms have a start and end | subservice, the list of symptoms. Symptoms have a start and end | |||
| date, making it is possible to report symptoms that are no longer | date, making it is possible to report symptoms that are no longer | |||
| occurring. | occurring. | |||
| The SAIN agent might have to remove some symptoms for specific | The SAIN agent might have to remove some symptoms for specific | |||
| subservice symptoms, because there are outdated and not relevant any | subservice symptoms because they are outdated and no longer relevant | |||
| longer, or simply because the SAIN agent needs to free up some space. | or simply because the SAIN agent needs to free up some space. | |||
| Regardless of the reason, it's important for a SAIN collector | Regardless of the reason, it's important for a SAIN collector | |||
| (re-)connecting to a SAIN agent to understand the effect of this | connecting/reconnecting to a SAIN agent to understand the effect of | |||
| garbage collection. | this garbage collection. | |||
| Therefore, the SAIN agent contains a YANG object specifying the date | Therefore, the SAIN agent contains a YANG object specifying the date | |||
| and time at which the symptoms' history starts for the subservice | and time at which the symptoms' history starts for the subservice | |||
| instances. The subservice reports only symptoms that are occurring | instances. The subservice reports only symptoms that are occurring | |||
| or that have been occurring after the history start date. | or that have been occurring after the history start date. | |||
| 3.9. New Assurance Graph Generation | 3.9. New Assurance Graph Generation | |||
| The assurance graph will change over time, because services and | The assurance graph will change over time, because services and | |||
| subservices come and go (changing the dependencies between | subservices come and go (changing the dependencies between | |||
| subservices), or as a result of resolving maintenance issues. | subservices) or as a result of resolving maintenance issues. | |||
| Therefore, an assurance graph version must be maintained, along with | Therefore, an assurance graph version must be maintained, along with | |||
| the date and time of its last generation. The date and time of a | the date and time of its last generation. The date and time of a | |||
| particular subservice instance (again dependencies or under | particular subservice instance (again dependencies or under | |||
| maintenance) might be kept. From a client point of view, an | maintenance) might be kept. From a client point of view, an | |||
| assurance graph change is triggered by the value of the assurance- | assurance graph change is triggered by the value of the assurance- | |||
| graph-version and assurance-graph-last-change YANG leaves. At that | graph-version and assurance-graph-last-change YANG leaves. At that | |||
| point in time, the client (collector) follows the following process: | point in time, the client (collector) follows the following process: | |||
| * Keep the previous assurance-graph-last-change value (let's call it | * Keep the previous assurance-graph-last-change value (let's call it | |||
| time T) | time T). | |||
| * Run through all subservice instances and process the subservice | * Run through all the subservice instances and process the | |||
| instances for which the last-change is newer that the time T | subservice instances for which the last-change is newer than the | |||
| time T. | ||||
| * Keep the new assurance-graph-last-change as the new referenced | * Keep the new assurance-graph-last-change as the new referenced | |||
| date and time | date and time. | |||
| 4. Security Considerations | 4. IANA Considerations | |||
| This document has no IANA actions. | ||||
| 5. Security Considerations | ||||
| The SAIN architecture helps operators to reduce the mean time to | The SAIN architecture helps operators to reduce the mean time to | |||
| detect and mean time to repair. However, the SAIN agents must be | detect and the mean time to repair. However, the SAIN agents must be | |||
| secured: a compromised SAIN agent may be sending wrong root causes or | secured; a compromised SAIN agent may be sending incorrect root | |||
| symptoms to the management systems. Securing the agents falls back | causes or symptoms to the management systems. Securing the agents | |||
| to ensuring the integrity and confidentiality of the assurance graph. | falls back to ensuring the integrity and confidentiality of the | |||
| This can be partially achieved by correctly setting permissions of | assurance graph. This can be partially achieved by correctly setting | |||
| each node in the YANG model as described in Section 6 of | permissions of each node in the YANG data model, as described in | |||
| [I-D.ietf-opsawg-service-assurance-yang]. | Section 6 of [RFC9418]. | |||
| Except for the configuration of telemetry, the agents do not need | Except for the configuration of telemetry, the agents do not need | |||
| "write access" to the devices they monitor. This configuration is | "write access" to the devices they monitor. This configuration is | |||
| applied with a YANG module, whose protection is covered by Secure | applied with a YANG module, whose protection is covered by Secure | |||
| Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. | Shell (SSH) [RFC6242] for the Network Configuration Protocol | |||
| Devices should be configured so that agents have their own | (NETCONF) or TLS [RFC8446] for RESTCONF. Devices should be | |||
| credentials with write access only for the YANG nodes configuring the | configured so that agents have their own credentials with write | |||
| telemetry. | access only for the YANG nodes configuring the telemetry. | |||
| The data collected by SAIN could potentially be compromising to the | The data collected by SAIN could potentially be compromising to the | |||
| network or provide more insight into how the network is designed. | network or provide more insight into how the network is designed. | |||
| Considering the data that SAIN requires (including CLI access in some | Considering the data that SAIN requires (including CLI access in some | |||
| cases), one should weigh data access concerns with the impact that | cases), one should weigh data access concerns with the impact that | |||
| reduced visibility will have on being able to rapidly identify root | reduced visibility will have on being able to rapidly identify root | |||
| causes. | causes. | |||
| For building the assurance graph, the SAIN orchestrator needs to | For building the assurance graph, the SAIN orchestrator needs to | |||
| obtain the configuration from the service orchestrator. The latter | obtain the configuration from the service orchestrator. The latter | |||
| should restrict access of the SAIN orchestrator to information needed | should restrict access of the SAIN orchestrator to information needed | |||
| to build the assurance graph. | to build the assurance graph. | |||
| If a closed loop system relies on this architecture then the well | If a closed loop system relies on this architecture, then the well- | |||
| known issue of those systems also applies, i.e., a lying device or | known issue of those systems also applies, i.e., a lying device or | |||
| compromised agent could trigger partial reconfiguration of the | compromised agent could trigger partial reconfiguration of the | |||
| service or network. The SAIN architecture neither augments nor | service or network. The SAIN architecture neither augments nor | |||
| reduces this risk. An extension of SAIN, out of scope for this | reduces this risk. An extension of SAIN, which is out of scope for | |||
| document, could detect discrepancies between symptoms reported by | this document, could detect discrepancies between symptoms reported | |||
| different agents and thus detect anomalies if an agent or a device is | by different agents, and thus detect anomalies if an agent or a | |||
| lying. | device is lying. | |||
| If NTP service goes down, the devices clocks might lose their | If NTP service goes down, the devices clocks might lose their | |||
| synchronization. In that case, correlating information from | synchronization. In that case, correlating information from | |||
| different devices, such as detecting symptoms about a link or | different devices, such as detecting symptoms about a link or | |||
| correlating symptoms from different devices, will give inaccurate | correlating symptoms from different devices, will give inaccurate | |||
| results. | results. | |||
| 5. IANA Considerations | 6. References | |||
| This document includes no request to IANA. | ||||
| 6. Contributors | ||||
| * Youssef El Fathi | ||||
| * Eric Vyncke | ||||
| 7. References | ||||
| 7.1. Normative References | ||||
| [I-D.ietf-opsawg-service-assurance-yang] | 6.1. Normative References | |||
| Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. | ||||
| Arumugam, "YANG Modules for Service Assurance", Work in | ||||
| Progress, Internet-Draft, draft-ietf-opsawg-service- | ||||
| assurance-yang-10, 28 November 2022, | ||||
| <https://www.ietf.org/archive/id/draft-ietf-opsawg- | ||||
| service-assurance-yang-10.txt>. | ||||
| [RFC8309] Wu, Q., Liu, W., Farrel, A., and RFC Publisher, "Service | [RFC8309] Wu, Q., Liu, W., and A. Farrel, "Service Models | |||
| Models Explained", RFC 8309, DOI 10.17487/RFC8309, January | Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018, | |||
| 2018, <https://www.rfc-editor.org/info/rfc8309>. | <https://www.rfc-editor.org/info/rfc8309>. | |||
| [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., Geng, | [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and | |||
| L., and RFC Publisher, "A Framework for Automating Service | L. Geng, "A Framework for Automating Service and Network | |||
| and Network Management with YANG", RFC 8969, | Management with YANG", RFC 8969, DOI 10.17487/RFC8969, | |||
| DOI 10.17487/RFC8969, January 2021, | January 2021, <https://www.rfc-editor.org/info/rfc8969>. | |||
| <https://www.rfc-editor.org/info/rfc8969>. | ||||
| 7.2. Informative References | [RFC9418] Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. | |||
| Arumugam, "YANG Modules for Service Assurance", RFC 9418, | ||||
| DOI 10.17487/RFC9418, June 2023, | ||||
| <https://www.rfc-editor.org/info/rfc9418>. | ||||
| [I-D.ietf-opsawg-yang-vpn-service-pm] | 6.2. Informative References | |||
| Wu, B., Wu, Q., Boucadair, M., de Dios, O. G., and B. Wen, | ||||
| "A YANG Model for Network and VPN Service Performance | ||||
| Monitoring", Work in Progress, Internet-Draft, draft-ietf- | ||||
| opsawg-yang-vpn-service-pm-15, 11 November 2022, | ||||
| <https://www.ietf.org/archive/id/draft-ietf-opsawg-yang- | ||||
| vpn-service-pm-15.txt>. | ||||
| [OpenConfig] | [OpenConfig] | |||
| "OpenConfig", <https://openconfig.net>. | "OpenConfig", <https://openconfig.net>. | |||
| [Piovesan2017] | [Piovesan2017] | |||
| Piovesan, A. and E. Griffor, "Reasoning About Safety and | Piovesan, A. and E. Griffor, "7 - Reasoning About Safety | |||
| Security: The Logic of Assurance", 2017, | and Security: The Logic of Assurance", | |||
| DOI 10.1016/B978-0-12-803773-7.00007-3, 2017, | ||||
| <https://doi.org/10.1016/B978-0-12-803773-7.00007-3>. | <https://doi.org/10.1016/B978-0-12-803773-7.00007-3>. | |||
| [RFC2865] Rigney, C., Willens, S., Rubens, A., Simpson, W., and RFC | [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, | |||
| Publisher, "Remote Authentication Dial In User Service | "Remote Authentication Dial In User Service (RADIUS)", | |||
| (RADIUS)", RFC 2865, DOI 10.17487/RFC2865, June 2000, | RFC 2865, DOI 10.17487/RFC2865, June 2000, | |||
| <https://www.rfc-editor.org/info/rfc2865>. | <https://www.rfc-editor.org/info/rfc2865>. | |||
| [RFC5424] Gerhards, R. and RFC Publisher, "The Syslog Protocol", | [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, | |||
| RFC 5424, DOI 10.17487/RFC5424, March 2009, | DOI 10.17487/RFC5424, March 2009, | |||
| <https://www.rfc-editor.org/info/rfc5424>. | <https://www.rfc-editor.org/info/rfc5424>. | |||
| [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., Kasch, W., and | [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, | |||
| RFC Publisher, "Network Time Protocol Version 4: Protocol | "Network Time Protocol Version 4: Protocol and Algorithms | |||
| and Algorithms Specification", RFC 5905, | Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, | |||
| DOI 10.17487/RFC5905, June 2010, | ||||
| <https://www.rfc-editor.org/info/rfc5905>. | <https://www.rfc-editor.org/info/rfc5905>. | |||
| [RFC6242] Wasserman, M. and RFC Publisher, "Using the NETCONF | [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure | |||
| Protocol over Secure Shell (SSH)", RFC 6242, | Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, | |||
| DOI 10.17487/RFC6242, June 2011, | ||||
| <https://www.rfc-editor.org/info/rfc6242>. | <https://www.rfc-editor.org/info/rfc6242>. | |||
| [RFC7011] Claise, B., Ed., Trammell, B., Ed., Aitken, P., and RFC | [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, | |||
| Publisher, "Specification of the IP Flow Information | "Specification of the IP Flow Information Export (IPFIX) | |||
| Export (IPFIX) Protocol for the Exchange of Flow | Protocol for the Exchange of Flow Information", STD 77, | |||
| Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, | RFC 7011, DOI 10.17487/RFC7011, September 2013, | |||
| September 2013, <https://www.rfc-editor.org/info/rfc7011>. | <https://www.rfc-editor.org/info/rfc7011>. | |||
| [RFC7149] Boucadair, M., Jacquenet, C., and RFC Publisher, | [RFC7149] Boucadair, M. and C. Jacquenet, "Software-Defined | |||
| "Software-Defined Networking: A Perspective from within a | Networking: A Perspective from within a Service Provider | |||
| Service Provider Environment", RFC 7149, | Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014, | |||
| DOI 10.17487/RFC7149, March 2014, | ||||
| <https://www.rfc-editor.org/info/rfc7149>. | <https://www.rfc-editor.org/info/rfc7149>. | |||
| [RFC7665] Halpern, J., Ed., Pignataro, C., Ed., and RFC Publisher, | [RFC7665] Halpern, J., Ed. and C. Pignataro, Ed., "Service Function | |||
| "Service Function Chaining (SFC) Architecture", RFC 7665, | Chaining (SFC) Architecture", RFC 7665, | |||
| DOI 10.17487/RFC7665, October 2015, | DOI 10.17487/RFC7665, October 2015, | |||
| <https://www.rfc-editor.org/info/rfc7665>. | <https://www.rfc-editor.org/info/rfc7665>. | |||
| [RFC7950] Bjorklund, M., Ed. and RFC Publisher, "The YANG 1.1 Data | [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", | |||
| Modeling Language", RFC 7950, DOI 10.17487/RFC7950, August | RFC 7950, DOI 10.17487/RFC7950, August 2016, | |||
| 2016, <https://www.rfc-editor.org/info/rfc7950>. | <https://www.rfc-editor.org/info/rfc7950>. | |||
| [RFC8199] Bogdanovic, D., Claise, B., Moberg, C., and RFC Publisher, | [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module | |||
| "YANG Module Classification", RFC 8199, | Classification", RFC 8199, DOI 10.17487/RFC8199, July | |||
| DOI 10.17487/RFC8199, July 2017, | 2017, <https://www.rfc-editor.org/info/rfc8199>. | |||
| <https://www.rfc-editor.org/info/rfc8199>. | ||||
| [RFC8446] Rescorla, E. and RFC Publisher, "The Transport Layer | [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol | |||
| Security (TLS) Protocol Version 1.3", RFC 8446, | Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, | |||
| DOI 10.17487/RFC8446, August 2018, | ||||
| <https://www.rfc-editor.org/info/rfc8446>. | <https://www.rfc-editor.org/info/rfc8446>. | |||
| [RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., Jalil, L., and RFC | [RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG | |||
| Publisher, "A YANG Data Model for Layer 2 Virtual Private | Data Model for Layer 2 Virtual Private Network (L2VPN) | |||
| Network (L2VPN) Service Delivery", RFC 8466, | Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October | |||
| DOI 10.17487/RFC8466, October 2018, | 2018, <https://www.rfc-editor.org/info/rfc8466>. | |||
| <https://www.rfc-editor.org/info/rfc8466>. | ||||
| [RFC8641] Clemm, A., Voit, E., and RFC Publisher, "Subscription to | [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications | |||
| YANG Notifications for Datastore Updates", RFC 8641, | for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, | |||
| DOI 10.17487/RFC8641, September 2019, | September 2019, <https://www.rfc-editor.org/info/rfc8641>. | |||
| <https://www.rfc-editor.org/info/rfc8641>. | ||||
| [RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., Grant, | [RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L. | |||
| L., and RFC Publisher, "The Terminal Access Controller | Grant, "The Terminal Access Controller Access-Control | |||
| Access-Control System Plus (TACACS+) Protocol", RFC 8907, | System Plus (TACACS+) Protocol", RFC 8907, | |||
| DOI 10.17487/RFC8907, September 2020, | DOI 10.17487/RFC8907, September 2020, | |||
| <https://www.rfc-editor.org/info/rfc8907>. | <https://www.rfc-editor.org/info/rfc8907>. | |||
| [RFC9315] Clemm, A., Ciavaglia, L., Granville, L. Z., Tantsura, J., | [RFC9315] Clemm, A., Ciavaglia, L., Granville, L. Z., and J. | |||
| and RFC Publisher, "Intent-Based Networking - Concepts and | Tantsura, "Intent-Based Networking - Concepts and | |||
| Definitions", RFC 9315, DOI 10.17487/RFC9315, October | Definitions", RFC 9315, DOI 10.17487/RFC9315, October | |||
| 2022, <https://www.rfc-editor.org/info/rfc9315>. | 2022, <https://www.rfc-editor.org/info/rfc9315>. | |||
| Appendix A. Changes between revisions | [RFC9375] Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de | |||
| Dios, O., and B. Wen, "A YANG Data Model for Network and | ||||
| [[RFC editor: please remove this section before publication.]] | VPN Service Performance Monitoring", RFC 9375, | |||
| DOI 10.17487/RFC9375, April 2023, | ||||
| v12 - 13 | <https://www.rfc-editor.org/info/rfc9375>. | |||
| * Addressing IESG telechat feedback | ||||
| v11 - 12 | ||||
| * Addressing comments from Last call | ||||
| v10 - v11 | ||||
| * Adding reference to example of network performance model | ||||
| v09 - v10 | ||||
| * Addressing comments from Rob Wilton | ||||
| v08 - v09 | ||||
| * Addressing comments from Michael Richardson | ||||
| v07 - v08 | ||||
| * Propagating removal of under-maintenance flag from the YANG module | ||||
| v06-07 | ||||
| Addressing comments from Dhruv Dhody and applying pending changes | ||||
| v03 - v04 | ||||
| * Address comments from Mohamed Boucadair | ||||
| v00 - v01 | ||||
| * Cover the feedback received during the WG call for adoption | ||||
| Acknowledgements | Acknowledgements | |||
| The authors would like to thank Stephane Litkowski, Charles Eckel, | The authors would like to thank Stephane Litkowski, Charles Eckel, | |||
| Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, | Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, | |||
| Eric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson and | Éric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson, and | |||
| Rob Wilton for their reviews and feedback. | Rob Wilton for their reviews and feedback. | |||
| Contributors | ||||
| * Youssef El Fathi | ||||
| * Éric Vyncke | ||||
| Authors' Addresses | Authors' Addresses | |||
| Benoit Claise | Benoit Claise | |||
| Huawei | Huawei | |||
| Email: benoit.claise@huawei.com | Email: benoit.claise@huawei.com | |||
| Jean Quilbeuf | Jean Quilbeuf | |||
| Huawei | Huawei | |||
| Email: jean.quilbeuf@huawei.com | Email: jean.quilbeuf@huawei.com | |||
| Diego R. Lopez | Diego R. Lopez | |||
| Telefonica I+D | Telefonica I+D | |||
| Don Ramon de la Cruz, 82 | Don Ramon de la Cruz, 82 | |||
| Madrid 28006 | 28006 Madrid | |||
| Spain | Spain | |||
| Email: diego.r.lopez@telefonica.com | Email: diego.r.lopez@telefonica.com | |||
| Dan Voyer | Dan Voyer | |||
| Bell Canada | Bell Canada | |||
| Canada | Canada | |||
| Email: daniel.voyer@bell.ca | Email: daniel.voyer@bell.ca | |||
| Thangam Arumugam | Thangam Arumugam | |||
| Cisco Systems, Inc. | Consultant | |||
| Milpitas (California), | Milpitas, California | |||
| United States of America | United States of America | |||
| Email: tarumuga@cisco.com | Email: thangavelu@yahoo.com | |||
| End of changes. 151 change blocks. | ||||
| 591 lines changed or deleted | 545 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||