<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="4"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF" category="info"
consensus="true" docName="draft-ietf-opsawg-service-assurance-architecture-13" number="9417" ipr="trust200902" docName="draft-ietf-opsawg-service-assurance-architecture-13"> obsoletes="" updates="" xml:lang="en" tocInclude="true"
tocDepth="4" symRefs="true" sortRefs="true" version="3">

  <!-- xml2rfc v2v3 conversion 3.16.0 -->
  <front>
    <title abbrev="SAIN Architecture">Service Assurance for Intent-based Intent-Based Networking Architecture</title>
    <seriesInfo name="RFC" value="9417"/>
    <author fullname="Benoit Claise" initials="B" surname="Claise">
      <organization>Huawei</organization>
      <address>
        <email>benoit.claise@huawei.com</email>
      </address>
    </author>
    <author fullname="Jean Quilbeuf" initials="J" surname="Quilbeuf ">
      <organization>Huawei</organization>
      <address>
        <email>jean.quilbeuf@huawei.com</email>
      </address>
    </author>
    <author fullname="Diego R. Lopez" initials="D" surname="Lopez ">
      <organization>Telefonica I+D</organization>
      <address>
        <postal>
          <street>Don Ramon de la Cruz, 82</street>
          <city>Madrid  28006</city>
          <city>Madrid</city>
	  <code>28006</code>
          <country>Spain</country>
        </postal>
        <email>diego.r.lopez@telefonica.com</email>
      </address>
    </author>
    <author fullname="Dan Voyer" initials="D" surname="Voyer ">
      <organization>Bell Canada</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <country>Canada</country>
        </postal>
        <email>daniel.voyer@bell.ca</email>
      </address>
    </author>
    <author fullname="Thangam Arumugam" initials="T" surname="Arumugam">
      <organization>Cisco Systems, Inc.</organization>
      <organization>Consultant</organization>
      <address>
        <postal>
          <street/>
          <city>Milpitas (California)</city>
          <city>Milpitas</city>
          <region>California</region>
          <country>United States of America</country>
        </postal>
        <email>tarumuga@cisco.com</email>
        <email>thangavelu@yahoo.com</email>
      </address>
    </author>
    <date/>
    <area>OPS</area>
    <workgroup>OPSAWG</workgroup>
    <date year="2023" month="June"/>
    <area>ops</area>
    <workgroup>opsawg</workgroup>
    <abstract>
      <t>
        This document describes an architecture that aims at assuring provides some assurance that service instances are running as expected.
        As services rely upon multiple sub-services subservices provided by a variety of elements elements, including the underlying network devices and functions,
          getting the assurance of a healthy service is only possible with a holistic view of all involved elements.
          This architecture not only helps to correlate the service degradation with symptoms of a specific network component but but, it also to list lists the services impacted by the failure or degradation of a specific network component.
      </t>
    </abstract>
  </front>
  <middle>
    <section title="Terminology" anchor="terminology">
      <t>
          SAIN agent: A functional component that communicates with a device, a set of devices,
          or another agent to build an expression graph from a received assurance graph and
          perform the corresponding computation of the health status and symptoms. A SAIN agent might
          be running directly on the device it monitors.
      </t>
      <t>
          Assurance case: "An assurance case is a structured argument, supported by evidence, intended to justify that a system is acceptably assured relative to a concern (such as safety or security) in the intended operating environment" <xref target="Piovesan2017"/>.
      </t>
      <t>
          Service instance: A specific instance of a service.
      </t>
      <t>
          Intent:
          "A set of operational goals (that a network should meet) and outcomes (that a network is supposed to deliver), defined in a declarative manner without specifying how to achieve or implement them" <xref target="RFC9315"/>.
      </t>
      <t>
          Subservice: Part or functionality of the network system that can be independently assured as a single entity in assurance graph.
      </t>
      <t>
          Assurance graph: A Directed Acyclic Graph (DAG) representing the assurance case for one or several service instances.
          The nodes (also known as vertices in the context of DAG) are the service instances themselves and the subservices, the edges indicate a dependency relation.
      </t>
      <t>
          SAIN collector: A functional component that fetches or receives the computer-consumable output of the SAIN agent(s) and process it locally (including displaying it in a user-friendly form).
      </t>
      <t>
          DAG: Directed Acyclic Graph.
      </t>
      <t>
          ECMP: Equal Cost Multiple Paths
      </t>
      <t>
          Expression graph: A generic term for a DAG representing a computation in SAIN. More specific terms are:
          <list style="symbols">
              <t>Subservice expressions: Is an expression graph representing all the computations to execute for a subservice.</t>
              <t>Service expressions: Is an expression graph representing all the computations to execute for a service instance, i.e., including the computations for all dependent subservices.</t>
              <t>Global computation graph: Is an expression graph representing all the computations to execute for all services instances  (i.e., all computations performed).</t>
          </list>
      </t>
      <t>
          Dependency: The directed relationship between subservice instances in the assurance graph.
      </t>
      <t>
          Metric: A piece of information retrieved from the network running the assured service.
      </t>
      <t>
          Metric engine: A functional component, part of the SAIN agent, that maps metrics to a list of candidate metric implementations depending on the network element.
      </t>
      <t>
          Metric implementation: Actual way of retrieving a metric from a network element.
      </t> anchor="intro" numbered="true" toc="default">
      <name>Introduction</name>
      <t>
        Network service YANG module: describes the characteristics of a service as agreed upon with consumers of that service <xref target="RFC8199"/>.
      </t>
      <t>
          Service orchestrator: Quoting RFC8199, "Network Service YANG Modules describe the characteristics of a service, as agreed upon with consumers of that service. That is, a service module does not expose the detailed configuration parameters of all participating network elements and features but describes an abstract model that allows instances of the service to be decomposed into instance data according to the Network Element YANG Modules of the participating network elements. The service-to-element decomposition is a separate process; the details depend on how the network operator chooses to realize the service. For the purpose of this document, the term "orchestrator" is used to describe a system implementing such a process."
      </t>
      <t>
          SAIN orchestrator: A functional component that is in charge of fetching the configuration specific to each service instance and converting it into an assurance graph.
      </t>
      <t>
          Health status: Score and symptoms indicating whether a service instance or a subservice is "healthy". A non-maximal score must always be explained by one or more symptoms.
      </t>
      <t>
          Health score: Integer ranging from 0 to 100 indicating the health of a subservice.
          A score of 0 means that the subservice is broken, a score of 100 means that the subservice in question is operating as expected.
          The special value -1 can be used to specify that no value could be computed for that health-score, for instance if some metric needed for that computation could not be collected.
      </t>
      <t>
          Strongly connected component: subset of a directed graph such that there
          is a (directed) path from any node of the subset to any other node. A
          DAG does not contain any strongly connected component.
      </t>
      <t>
          Symptom: Reason explaining why a service instance or a subservice is not completely healthy.
      </t>
    </section>

    <section anchor="intro" title="Introduction">

       <t>
        Network service YANG modules  <xref target="RFC8199"/> target="RFC8199" format="default"/> describe the configuration, state data, operations, and notifications of abstract representations of services implemented on one or multiple network elements.
      </t>
      <t>
        Service orchestrators use Network service Service YANG modules Modules that will infer network-wide configuration and, therefore therefore, the invocation of the appropriate device modules (Section 3 of <xref target="RFC8969"/>). (<xref target="RFC8969" format="default" sectionFormat="of" section="3"/>).
           Knowing that a configuration is applied doesn't imply that the provisioned service instance is up and running as expected.
           For instance, the service might be degraded because of a failure in the network, the service quality may be degraded, or a service function may be reachable at the IP level but does not provide its intended function.
           Thus, the network operator must monitor the service’s service's operational data at the same time as the configuration (Section 3.3 of <xref target="RFC8969"/>). (<xref target="RFC8969" format="default" sectionFormat="of" section="3.3"/>).
           To feed fuel that task, the industry has been standardizing on telemetry to push network element performance information (e.g., <xref target="I-D.ietf-opsawg-yang-vpn-service-pm"/>). target="RFC9375" format="default"/>).
      </t>
      <t>
        A network administrator needs to monitor their its network and services as a whole, independently of the management protocols.
           With different protocols come different data models, models and different ways to model the same type of information.
           When network administrators deal with multiple management protocols, the network management entities have to perform the difficult and time-consuming job of mapping data models: models,
           e.g., the model used for configuration with the model used for monitoring when separate models or protocols are used.
           This problem is compounded by a large, disparate set of data sources (MIB (e.g., MIB modules, YANG data models <xref target="RFC7950"/>, IPFIX target="RFC7950" format="default"/>, IP Flow Information Export (IPFIX) information elements <xref target="RFC7011"/>, target="RFC7011" format="default"/>, syslog plain text <xref target="RFC5424"/>, TACACS+ target="RFC5424" format="default"/>, Terminal Access Controller Access-Control System Plus (TACACS+) <xref target="RFC8907"/>, target="RFC8907" format="default"/>, RADIUS <xref target="RFC2865"/>, target="RFC2865" format="default"/>, etc.).
           In order to avoid this data model mapping, the industry converged on model-driven telemetry to stream the service operational data, reusing the YANG data models used for configuration.
           Model-driven telemetry greatly facilitates the notion of closed-loop automation automation, whereby events and updated operational state states streamed from the network drive remediation changes change back into the network.
      </t>
      <t>
        However, it proves difficult for network operators to correlate the service degradation with the network root cause.
           For cause,
        for example, "Why does my layer 3 virtual private network (L3VPN) fail to connect?" or "Why is this specific service not highly responsive?". responsive?"
           The reverse, i.e., which services are impacted when a network component fails or degrades, is also important for operators.
           For operators,
           for example, "Which services are impacted when this specific optic decibel milliwatt (dBm) begins to degrade?",
             "Which applications are impacted by an imbalance in this equal cost multiple paths Equal-Cost Multipath (ECMP) bundle?", or "Is that issue actually impacting any other customers?". customers?"
           This task usually falls under the so-called "Service Impact Analysis" functional block.
      </t>
      <t>
           In this document, we propose
           This document defines an architecture implementing Service Assurance for Intent-Based Intent-based Networking (SAIN).
           Intent-based approaches are often declarative, starting from a statement of “The "The service works as expected” expected" and trying to enforce it.
           However, some already defined already-defined services might have been designed using a different approach.
           Aligned with Section 3.3 of <xref target="RFC7149"/>, target="RFC7149" format="default" sectionFormat="of" section="3.3"/>, and instead of requiring a declarative intent as a starting point,
           this architecture focuses on already defined already-defined services and tries to infer the meaning of “The "The service works as expected”. expected".
           To do so, the architecture works from an assurance graph, deduced from the configuration pushed to the device for enabling the service instance.
           If the SAIN orchestrator supports it, the service model (Section 2 of <xref target="RFC8309"/>) (<xref target="RFC8309" format="default" sectionFormat="of" section="2"/>) or the network model (Section 2.1 of <xref target="RFC8969"/>) (<xref target="RFC8969" format="default" sectionFormat="of" section="2.1"/>) can also be used to build the assurance graph.
           In that case and if the service model includes the declarative intent as well, the SAIN orchestrator can rely on the declared intent instead of inferring it.
           The assurance graph may also be explicitly completed to add an intent not exposed in the service model itself.
      </t>
      <t>
           The assurance graph of a service instance is decomposed into components, which are then assured independently.
           The top of the assurance graph represents the service instance to assure, and its children represent components identified as its direct dependencies; each component can have dependencies as well.
            Components involved in the assurance graph of a service are called subservices.
           The SAIN orchestrator updates automatically the assurance graph  automatically when the service instance is modified.
      </t>
      <t>
          When a service is degraded, the SAIN architecture will highlight where in the assurance service graph to look, as opposed to going hop by hop to troubleshoot the issue.
          More precisely, the SAIN architecture will associate to each service instance a list of symptoms originating from specific subservices, corresponding to components of the network.
          These components are good candidates for explaining the source of a service degradation.
          Not only can this architecture help to correlate service degradation with network root cause/symptoms, but it can deduce from the assurance graph the list of service instances impacted by a component degradation/failure.
          This added value informs the operational team where to focus its attention for maximum return.
          Indeed, the operational team is likely to focus their priority on the degrading/failing components impacting the highest number of their customers, especially the ones with the SLA Service-Level Agreement (SLA) contracts involving penalties in case of failure.
      </t>
      <t>
        This architecture provides the building blocks to assure both physical and virtual entities and is flexible with respect to services and subservices, subservices of (distributed) graphs, graphs and of components (<xref target="flexible_architecture"/>). target="flexible_architecture" format="default"/>).
      </t>
      <t>
            The architecture presented in this document is implemented by a set of YANG modules defined in a companion document <xref target="I-D.ietf-opsawg-service-assurance-yang"/>. target="RFC9418" format="default"/>.
            These YANG modules properly define the interfaces between the various components of the architecture in order to foster interoperability.
      </t>
    </section>
    <section anchor="terminology" numbered="true" toc="default">
      <name>Terminology</name>
      <dl newline="false" spacing="normal">
        <dt>SAIN agent:</dt>
	<dd>A functional component that communicates with a device, a set of devices,
          or another agent to build an expression graph from a received assurance graph and
          perform the corresponding computation of the health status and symptoms. A SAIN agent might
          be running directly on the device it monitors.</dd>
          <dt>Assurance case:</dt>
	  <dd>"An assurance case is a structured argument, supported by evidence, intended to justify that a system is acceptably assured relative to a concern (such as safety or security) in the intended operating environment" <xref target="Piovesan2017" format="default"/>.</dd>
          <dt>Service instance:</dt>
	  <dd>A specific instance of a service.</dd>
          <dt>Intent:</dt>
          <dd>"A set of operational goals (that a network should meet) and outcomes (that a network is supposed to deliver) defined in a declarative manner without specifying how to achieve or implement them" <xref target="RFC9315" format="default"/>.</dd>
          <dt>Subservice:</dt>
	  <dd>A part or functionality of the network system that can be independently assured as a single entity in an assurance graph.</dd>
          <dt>Assurance graph:</dt>
	  <dd>A Directed Acyclic Graph (DAG) representing the assurance case for one or several service instances.
          The nodes (also known as vertices in the context of DAG) are the service instances themselves and the subservices; the edges indicate a dependency relation.</dd>
          <dt>SAIN collector:</dt>
	  <dd>A functional component that fetches or receives the computer-consumable output of the SAIN agent(s) and processes it locally (including displaying it in a user-friendly form).</dd>
          <dt>DAG:</dt>
	  <dd>Directed Acyclic Graph.</dd>
          <dt>ECMP:</dt>
	  <dd>Equal-Cost Multipath.</dd>
          <dt>Expression graph:</dt>
	  <dd><t>A generic term for a DAG representing a computation in SAIN. More specific terms are listed below:</t>
      <dl newline="true" spacing="normal">
        <dt>Subservice expressions:</dt>
	<dd>An expression graph representing all the computations to execute for a subservice.</dd>
        <dt>Service expressions:</dt>
	<dd>An expression graph representing all the computations to execute for a service instance, i.e., including the computations for all dependent subservices.</dd>
        <dt>Global computation graph:</dt>
	<dd>An expression graph representing all the computations to execute for all services instances  (i.e., all computations performed).</dd>
      </dl>
	  </dd>
	  <dt>Dependency:</dt>
	  <dd>The directed relationship between subservice instances in the assurance graph.</dd>
          <dt>Metric:</dt>
	  <dd>A piece of information retrieved from the network running the assured service.</dd>
          <dt>Metric engine:</dt>
	  <dd>A functional component, part of the SAIN agent, that maps metrics to a list of candidate metric implementations, depending on the network element.</dd>
          <dt>Metric implementation:</dt>
	  <dd>The actual way of retrieving a metric from a network element.</dd>
          <dt>Network Service YANG Module:</dt>
	  <dd>The characteristics of a service, as agreed upon with consumers of that service <xref target="RFC8199" format="default"/>.</dd>
          <dt>Service orchestrator:</dt>
	  <dd>"Network Service YANG Modules describe the characteristics of a service, as agreed upon with consumers of that service. That is, a service module does not expose the detailed configuration parameters of all participating network elements and features but describes an abstract model that allows instances of the service to be decomposed into instance data according to the Network Element YANG Modules of the participating network elements. The service-to-element decomposition is a separate process; the details depend on how the network operator chooses to realize the service. For the purpose of this document, the term "orchestrator" is used to describe a system implementing such a process" <xref target="RFC8199" format="default"/>.</dd>
          <dt>SAIN orchestrator:</dt>
	  <dd>A functional component that is in charge of fetching the configuration specific to each service instance and converting it into an assurance graph.</dd>
          <dt>Health status:</dt>
	  <dd>The score and symptoms indicating whether a service instance or a subservice is "healthy". A non-maximal score must always be explained by one or more symptoms.</dd>
          <dt>Health score:</dt>
	  <dd>An integer ranging from 0 to 100 that indicates the health of a subservice.
          A score of 0 means that the subservice is broken, a score of 100 means that the subservice in question is operating as expected, and
          the special value -1 can be used to specify that no value could be computed for that health score, for instance, if some metric needed for that computation could not be collected.</dd>
          <dt>Strongly connected component:</dt>
	  <dd>A subset of a directed graph such that there
          is a (directed) path from any node of the subset to any other node. A
          DAG does not contain any strongly connected component.</dd>
          <dt>Symptom:</dt>
	  <dd>A reason explaining why a service instance or a subservice is not completely healthy.</dd>
      </dl>
    </section>
    <section anchor="architecture" title="A numbered="true" toc="default">
      <name>A Functional Architecture"> Architecture</name>
      <t>
        The goal of SAIN is to assure that service instances are operating as expected (i.e., the observed service is matching the expected service) and and, if not, to pinpoint what is wrong.
          More precisely, SAIN computes a score for each service instance and outputs symptoms explaining that score.
          The only valid situation where no symptoms are returned is when the score is maximal, indicating that no issues were detected for that service instance.
          The score augmented with the symptoms is called the health status. The exact meaning of the health score value is out of scope of this document. However However, the following constraints should be followed: the higher the score, the better the service health is; is and the two extrema being are 0 meaning the service is completely broken broken, and 100 meaning the service is completely operational.
      </t>
      <t>
        The SAIN architecture is a generic architecture, which generates an assurance graph from service instance(s), as specified in <xref target="inferring"/>). target="inferring" format="default"/>.
          This architecture is applicable to not only multiple environments (e.g. wireline, wireless), (e.g., wireline and wireless)
          but also different domains (e.g. (e.g., 5G network function virtualization (NFV) domain with a virtual infrastructure manager (VIM), etc.),
          and etc.)
          and, as already noted, for physical or virtual devices, as well as virtual functions.
          Thanks to the distributed graph design principle, graphs from different environments/orchestrator environments and orchestrators can be combined to obtain the graph of a service instance that spans over multiple domains.
      </t>
      <t>
        As an example of a service, let us consider a point-to-point level layer 2 virtual private network (L2VPN).
        <xref target="RFC8466"/> target="RFC8466" format="default"/> specifies the parameters for such a service.
          Examples of symptoms might be symptoms reported by specific subservices subservices, including "Interface has high error rate" or rate", "Interface flapping", or "Device almost out of memory" memory", as well as symptoms more specific to the service such (such as "Site disconnected from VPN". VPN").
      </t>
      <t>
          To compute the health status of an instance of such a service, the service definition is decomposed into an assurance graph formed by subservices linked through dependencies. Each subservice is then turned into an expression graph that details how to fetch metrics from the devices and compute the health status of the subservice. The subservice expressions are combined according to the dependencies between the subservices in order to obtain the expression graph which that computes the health status of the service instance.
      </t>
      <t>
         The overall SAIN architecture is presented in <xref target="figure_1"/>. target="figure_1" format="default"/>.
          Based on the service configuration provided by the service orchestrator, the SAIN orchestrator decomposes the assurance graph.
          It then sends to the SAIN agents the assurance graph along with some other configuration options.
          The SAIN agents are responsible for building the expression graph and computing the health statuses in a distributed manner.
          The collector is in charge of collecting and displaying the current inferred health status of the service instances and subservices.
   The
   collector also detects changes in the assurance graph structures, for instance when structures (e.g., an
   occurrence of a switchover from primary to backup path occurs, path) and
   forwards the information to the orchestrator, which reconfigures the agents.
          Finally, the automation loop is closed by having the SAIN collector providing provide feedback to the network/service orchestrator.
      </t>
      <t>
    In order to make agents, orchestrators orchestrators, and collectors from different vendors interoperable, their interface is defined as a YANG model module in a companion document <xref target="I-D.ietf-opsawg-service-assurance-yang"/>. target="RFC9418" format="default"/>.
          In <xref target="figure_1"/>, target="figure_1" format="default"/>, the communications that are normalized by this YANG model module are tagged with a "Y".
          The use of this YANG model module is further explained in <xref target="open_interfaces_with_YANG_modules"/>. target="open_interfaces_with_YANG_modules" format="default"/>.
      </t>
      <t>
      <figure anchor="figure_1" title="SAIN Architecture">
          <artwork><![CDATA[ anchor="figure_1">
        <name>SAIN Architecture</name>
        <artwork name="" type="" align="left" alt=""><![CDATA[
     +-----------------+
     | Service         |
     | Orchestrator    |<----------------------+
     |                 |                       |
     +-----------------+                       |
        |            ^                         |
        |            | Network                 |
        |            | Service                 | Feedback
        |            | Instance                | Loop
        |            | Configuration           |
        |            |                         |
        |            V                         |
        |        +-----------------+  Graph  +-------------------+
        |        | SAIN            | updates Updates | SAIN              |
        |        | Orchestrator    |<--------| Collector         |
        |        +-----------------+         +-------------------+
        |            |                          ^
        |           Y| Configuration            | Health Status
        |            | (assurance graph) (Assurance Graph)       Y| (Score + Symptoms)
        |            V                          | Streamed
        |     +-------------------+             | via Telemetry
        |     |+-------------------+            |
        |     ||+-------------------+           |
        |     +|| SAIN              |-----------+
        |      +| agent Agent             |
        |       +-------------------+
        |               ^ ^ ^
        |               | | |
        |               | | |  Metric Collection
        V               V V V
    +-------------------------------------------------------------+
    |           (Network) System                                  |
    |                                                             |
    +-------------------------------------------------------------+
        ]]></artwork>
        </figure></t>
      </figure>
      <t>
        In order to produce the score assigned to a service instance, the various involved components perform the following tasks:
        <list style="symbols">
          <t>
      </t>
      <ul spacing="normal">
        <li>
              Analyze the configuration pushed to the network device(s) for configuring the service instance.
              From there, determine which information (called a metric) must be collected from the device(s) and which operations to apply to the metrics to compute the health status.
          </t>
          <t>
        </li>
        <li>
            Stream (via telemetry telemetry, such as YANG-Push <xref target="RFC8641"/>) target="RFC8641" format="default"/>) operational and config metric values when possible, else continuously poll.
          </t>
          <t>
          </li>
        <li>
            Continuously compute the health status of the service instances, instances based on the metric values.
          </t>
        </list>
      </t>
          </li>
      </ul>
      <t>
          The SAIN architecture requires time synchronization, with the Network Time Protocol (NTP) <xref target="RFC5905"/> target="RFC5905" format="default"/> as a candidate, between all elements: monitored entities, SAIN agents, Service service orchestrator, the SAIN collector, as well as the SAIN orchestrator. This guarantees the correlations of all symptoms in the system, correlated with the right assurance graph version.
      </t>
      <section anchor="inferring" title="Translating numbered="true" toc="default">
        <name>Translating a Service Instance Configuration into an Assurance Graph"> Graph</name>
        <t>
          In order to structure the assurance of a service instance, the SAIN orchestrator decomposes the service instance into so-called subservice instances.
            Each subservice instance focuses on a specific feature or subpart of the service.
        </t>
        <t>
          The decomposition into subservices is an important function of the architecture, architecture for the following reasons:
          <list style="symbols">
            <t>
        </t>
        <ul spacing="normal">
          <li>
              The result of this decomposition provides a relational picture of a service instance, that which can be represented as a graph (called an assurance graph) to the operator.
            </t>
            <t>
            </li>
          <li>
              Subservices provide a scope for particular expertise and thereby enable contribution from external experts.
                For instance, the subservice dealing with the optics optic's health should be reviewed and extended by an expert in optical interfaces.
            </t>
            <t>
            </li>
          <li>
              Subservices that are common to several service instances are reused for reducing the amount of computation needed.
                For instance, the subservice assuring a given interface is reused by any service instance relying on that interface.
            </t>
          </list>
        </t>
            </li>
        </ul>
        <t>
          The assurance graph of a service instance is a DAG representing the structure of the assurance case for the service instance. The nodes of this graph are service instances or subservice instances. Each edge of this graph indicates a dependency between the two nodes at its extremities: extremities, i.e., the service or subservice at the source of the edge depends on the service or subservice at the destination of the edge.
        </t>
        <t>
          <xref target="figure_2"/> target="figure_2" format="default"/> depicts a simplistic example of the assurance graph for a tunnel service. The node at the top is the service instance, instance; the nodes below are its dependencies. In the example, the tunnel service instance depends on the "peer1" and "peer2" tunnel interfaces (the tunnel interfaces created on the peer1 and peer2 devices, respectively), which in turn depend on the respective physical interfaces, which finally depend on the respective "peer1" and "peer2" devices. The tunnel service instance also depends on the IP connectivity that depends on the IS-IS routing protocol.
        </t>
        <t>
        <figure anchor="figure_2" title="Assurance anchor="figure_2">
          <name>Assurance Graph Example">
            <artwork><![CDATA[ Example</name>
          <artwork name="" type="" align="left" alt=""><![CDATA[
                         +------------------+
                         | Tunnel           |
                         | Service Instance |
                         +------------------+
                                   |
              +--------------------+-------------------+
              |                    |                   |
              v                    v                   v
         +-------------+    +--------------+    +-------------+
         | Peer1       |    | IP           |    | Peer2       |
         | Tunnel      |    | Connectivity |    | Tunnel      |
         | Interface   |    |              |    | Interface   |
         +-------------+    +--------------+    +-------------+
                |                  |                  |
                |    +-------------+--------------+   |
                |    |             |              |   |
                v    v             v              v   v
         +-------------+    +-------------+     +-------------+
         | Peer1       |    | IS-IS       |     | Peer2       |
         | Physical    |    | Routing     |     | Physical    |
         | Interface   |    | Protocol    |     | Interface   |
         +-------------+    +-------------+     +-------------+
                |                                     |
                v                                     v
         +-------------+                        +-------------+
         |             |                        |             |
         | Peer1       |                        | Peer2       |
         | Device      |                        | Device      |
         +-------------+                        +-------------+
         ]]></artwork>
        </figure>
        </t>
        <t>
          Depicting the assurance graph helps the operator to understand (and assert) the decomposition.
            The assurance graph shall be maintained during normal operation with addition, modification modification, and removal of service instances.
            A change in the network configuration or topology shall automatically be reflected in the assurance graph.
            As a first example, a change of the routing protocol from IS-IS to OSPF would change the assurance graph accordingly.
            As a second example, assuming assume that the ECMP is in place for the source router for that specific tunnel; in that case, multiple interfaces must now be monitored, on top of the in addition to monitoring the ECMP health itself.
        </t>
        <section anchor="circular_dependencies" title="Circular Dependencies"> numbered="true" toc="default">
          <name>Circular Dependencies</name>
          <t>
            The edges of the assurance graph represent dependencies. An
            assurance graph is a DAG if and only if there are no circular
            dependencies among the subservices, and every assurance
            graph should avoid circular dependencies. However, in some cases,
            circular dependencies might appear in the assurance graph.
          </t>
          <t>
            First, the assurance graph of a whole system is obtained by
            combining the assurance graph of every service running on that
            system. Here Here, combining means that two subservices having the
            same type and the same parameters are in fact the same
            subservice and thus a single node in the graph. For instance,
            the subservice of type "device" with the only parameter
            (the device ID) set to "PE1" will appear only once in the
            whole assurance graph graph, even if several service instances rely
            on that device. Now, if two engineers design assurance graphs for
            two different services, and engineer Engineer A decides that an interface
            depends on the link it is connected to, but engineer Engineer B decides that
            the link depends on the interface it is connected to, then when
            combining the two assurance graphs, we will have a circular
            dependency interface -&gt; link -&gt; interface.
          </t>
          <t>
              Another case possibly resulting in circular dependencies is when subservices are not properly identified.
              Assume that we want to assure a cloud-based computing cluster that runs containers.
              We could represent the cluster by a subservice and the network service connecting containers on the cluster by another subservice.
              We will would likely model that as the network service depends depending on the cluster, because the network service runs in a container supported by the cluster.
              Conversely, the cluster depends on the network service for connectivity between containers, which creates a circular dependency.
              A finer decomposition might distinguish between the resources for executing containers (a part of our cluster subservice) and the communication between the containers (which could be modelled modeled in the same way as communication between routers).
          </t>
          <t>
            In any case, it is likely that circular dependencies will show up in
            the assurance graph. A first step would be to detect
            circular dependencies as soon as possible in the SAIN
            architecture. Such a detection could be carried out by
            the SAIN orchestrator. Whenever a circular dependency
            is detected, the newly added service would not be
            monitored until more careful modelling modeling or alignment
            between the different teams (engineer (Engineers A and B) remove the circular
            dependency.
          </t>
          <t>
            As a more elaborate solution solution, we could consider a graph transformation:
        <list style="symbols">
            <t>Decompose
          </t>
          <ul spacing="normal">
            <li>Decompose the graph into strongly connected components.</t> components.</li>
            <li>
              <t>
               For each strongly connected component:
               <list style="symbols">
                 <t>Remove
              </t>
              <ul spacing="normal">
                <li>remove all edges between nodes of the strongly connected component</t>
                 <t>Add component;</li>
                <li>add a new "synthetic" node for the strongly connected component</t>
                 <t>For component;</li>
                <li>for each edge pointing to a node in the strongly connected component, change the destination to the "synthetic" node</t>
                 <t>Add node; and</li>
                <li>add a dependency from the "synthetic" node to every node in the strongly connected component.</t>
               </list>
            </t>
        </list>
      </t> component.</li>
              </ul>
            </li>
          </ul>
          <t>
            Such an algorithm would include all symptoms detected by any
            subservice in one of the strongly component connected components and make it
          available to any subservice that depends on it.
         <xref target="graph_transformation"/> target="graph_transformation" format="default"/> shows an example
            of such a transformation. On the left-hand side, the nodes c, d, e e,
            and f form a strongly connected component. The status of node a should
         depend on the status of nodes c, d, e, f, g, and h, but this is hard to
            compute because of the circular dependency. On the right hand-side, right-hand side,
            node a depends on all these nodes as well, but there the circular
            dependency has been removed.
          </t>
      <t>
          <figure anchor="graph_transformation" title="Graph transformation">
          <artwork><![CDATA[ anchor="graph_transformation">
            <name>Graph Transformation</name>
            <artwork name="" type="" align="left" alt=""><![CDATA[
      +---+    +---+          |                +---+    +---+
      | a |    | b |          |                | a |    | b |
      +---+    +---+          |                +---+    +---+
        |        |            |                  |        |
        v        v            |                  v        v
      +---+    +---+          |                +------------+
      | c |--->| d |          |                |  synthetic |
      +---+    +---+          |                +------------+
        ^        |            |               /   |      |   \
        |        |            |              /    |      |    \
        |        v            |             v     v      v     v
      +---+    +---+          |          +---+  +---+  +---+  +---+
      | f |<---| e |          |          | f |  | c |  | d |  | e |
      +---+    +---+          |          +---+  +---+  +---+  +---+
        |        |            |            |                    |
        v        v            |            v                    v
      +---+    +---+          |          +---+                +---+
      | g |    | h |          |          | g |                | h |
      +---+    +---+          |          +---+                +---+

         Before                                     After
      Transformation                           Transformation
          ]]></artwork>
          </figure>
        </t>
          <t>
            We consider a concrete example to illustrate this transformation.
            Let’s
            Let's assume that Engineer A is building an assurance graph dealing with IS-IS and Engineer B is building an assurance graph dealing with OSPF.
            The graph from Engineer A could contain the following:
          </t>
        <t>
          <figure anchor="is-is_link" title="Fragment anchor="is-is_link">
            <name>Fragment of assurance graph the Assurance Graph from Engineer A">
          <artwork><![CDATA[ A</name>
            <artwork name="" type="" align="left" alt=""><![CDATA[
                +------------+
                | IS-IS Link |
                +------------+
                      |
                      v
                +------------+
                | Phys. Link |
                +------------+
                  |       |
                  v       v
       +-------------+  +-------------+
       | Interface 1 |  | Interface 2 |
       +-------------+  +-------------+
          ]]></artwork>
          </figure>
      </t>
          <t>
            The graph from Engineer B could contain the following:
          </t>
       <t>
          <figure anchor="ospf_link" title="Fragment anchor="ospf_link">
            <name>Fragment of assurance graph the Assurance Graph from Engineer B">
          <artwork><![CDATA[ B</name>
            <artwork name="" type="" align="left" alt=""><![CDATA[
                +------------+
                | OSPF Link  |
                +------------+
                  |   |   |
                  v   |   v
     +-------------+  |  +-------------+
     | Interface 1 |  |  | Interface 2 |
     +-------------+  |  +-------------+
                   |  |   |
                   v  v   v
                +------------+
                | Phys. Link |
                +------------+
           ]]></artwork>
          </figure>
      </t>
          <t>
            Each
            The Interface subservice subservices and the Physical Link subservice are common to both fragments above.
            Each of these subservice appears subservices appear only once in the graph merging the two fragments.
            Dependencies from both fragments are included in the merged graph, resulting in a circular dependency:
          </t>
      <t>
          <figure anchor="ospf_isis_circ_dep" title="Merging graphs anchor="ospf_isis_circ_dep">
            <name>Merging Graphs from Engineers A and B">
         <artwork><![CDATA[ B</name>
            <artwork name="" type="" align="left" alt=""><![CDATA[
      +------------+      +------------+
      | IS-IS Link |      | OSPF Link  |---+
      +------------+      +------------+   |
            |               |     |        |
            |     +-------- +     |        |
            v     v               |        |
      +------------+              |        |
      | Phys. Link |<-------+     |        |
      +------------+        |     |        |
        |  ^     |          |     |        |
        |  |     +-------+  |     |        |
        v  |             v  |     v        |
      +-------------+  +-------------+     |
      | Interface 1 |  | Interface 2 |     |
      +-------------+  +-------------+     |
            ^                              |
            |                              |
            +------------------------------+
          ]]></artwork>
          </figure>
      </t>
          <t>
            The solution presented above would result in a graph looking as follows, where a new "synthetic" node is included.
            Using that transformation, all dependencies are indirectly satisfied for the nodes outside the circular dependency, in the sense that both IS-IS and OSPF links have indirect dependencies to the two interfaces and the link.
  However, the dependencies between the link and the
  interfaces are lost as since they were causing the circular dependency.
          </t>
      <t>
          <figure anchor="ospf_isis_no_circ_dep" title="Removing circular dependencies anchor="ospf_isis_no_circ_dep">
            <name>Removing Circular Dependencies after merging graphs Merging Graphs from Engineers A and B">
          <artwork><![CDATA[ B</name>
            <artwork name="" type="" align="left" alt=""><![CDATA[
            +------------+      +------------+
            | IS-IS Link |      | OSPF Link  |
            +------------+      +------------+
                       |          |
                       v          v
                      +------------+
                      |  synthetic |
                      +------------+
                            |
                +-----------+-------------+
                |           |             |
                v           v             v
      +-------------+ +------------+ +-------------+
      | Interface 1 | | Phys. Link | | Interface 2 |
      +-------------+ +------------+ +-------------+
          ]]></artwork>
          </figure>
       </t>
        </section>
      </section>
      <section anchor="intent" title="Intent numbered="true" toc="default">
        <name>Intent and Assurance Graph"> Graph</name>
        <t>
          The SAIN orchestrator analyzes the configuration of a service instance to:
          <list style="symbols">
            <t> to do the following:
        </t>
        <ul spacing="normal">
          <li>
              Try to capture the intent of the service instance, i.e., what What is the service instance trying to achieve. achieve?
                At least, a minimum, this requires the SAIN orchestrator to know the YANG modules that are being configured on the devices to enable the service.
                Note that that, if the service model or the network model is known to the SAIN orchestrator, the latter can exploit it.
                In that case, the intent could be directly extracted and include more details, such as the notion of sites for a VPN, which is out of scope of the device configuration.
            </t>
            <t>
            </li>
          <li>
              Decompose the service instance into subservices representing the network features on which the service instance relies.
            </t>
          </list>
        </t>
            </li>
        </ul>
<t>
   The SAIN orchestrator must be able to analyze the configuration pushed to
   various devices for configuring of a service instance and produce the
   assurance graph for that service instance.
        </t>
        <t>
   To schematize what a SAIN orchestrator does, assume that the configuration for
   a service instance touches two devices and configure on each device
   configures a virtual tunnel interface. interface on each device. Then:
          <list style="symbols">
            <t>
              Capturing
        </t>
        <ul spacing="normal">
          <li>Capturing the intent would start by detecting that the service
     instance is actually a tunnel between the two devices, devices and stating
     that this tunnel must be functional. operational.
                This solution is minimally invasive invasive, as it does not require modifying nor knowing the service model.
                If the service model or network model is known by the SAIN orchestrator, it can be used to further capture the intent and include more information information, such as Service Level Objectives.
                For instance, Service-Level Objectives (e.g.,
                the latency and bandwidth requirements for the tunnel, tunnel) if present in the service model
            </t>
            <t> model.
            </li>
          <li>
              Decomposing the service instance into subservices would result in the assurance graph depicted in <xref target="figure_2"/>, target="figure_2" format="default"/>, for instance.
            </t>
          </list>
        </t>
            </li>
        </ul>
        <t>
            The assurance graph, or more precisely the subservices and dependencies that a SAIN orchestrator can instantiate, should be curated.
              The organization of such a process is out-of-scope for this document and should aim to:
            <list style="symbols">
                <t>Ensure (i.e., ensure that existing subservices are reused as much as possible.</t>
                <t>Avoid possible
  and avoid circular dependencies.</t>
            </list> dependencies) is out-of-scope for this
  document.
        </t>
        <t>
          To be applied, SAIN requires a mechanism mapping a service instance to the configuration actually required on the devices for that service instance to run.
            While the <xref target="figure_1"/> target="figure_1" format="default"/> makes a distinction between the SAIN orchestrator and a different component providing the service instance configuration, in practice those two components are mostly most likely combined.
            The internals of the orchestrator are out of scope of this document.
        </t>
      </section>
      <section anchor="subservices" title="Subservices"> numbered="true" toc="default">
        <name>Subservices</name>
        <t>
          A subservice corresponds to a subpart or a feature of the network system that is needed for a service instance to function properly.
            In the context of SAIN, a subservice is associated to its assurance, that which is the method for assuring that a subservice behaves correctly.
        </t>
        <t>
          Subservices, just as with services, have high-level parameters that specify the instance to be assured.
            The needed parameters depend on the subservice type.
            For example, assuring a device requires a specific deviceId as parameter.
            For example, a parameter and
            assuring an interface requires a specific combination of deviceId and interfaceId.
        </t>
        <t>
          When designing a new type of subservice, one should carefully define what is the assured object or functionality.
   Then, the parameters
   must be chosen as a minimal set that completely identify identifies the object
   (see examples from the previous paragraph).
            Parameters cannot change during the lifecycle life cycle of a subservice.
            For instance, an IP address is a good parameter when assuring a connectivity towards that address (i.e. (i.e., a given device can reach a given IP address), however it’s address); however, it's not a good parameter to identify an interface interface, as the IP address assigned to that interface can be changed.
        </t>
        <t>
          A subservice is also characterized by a list of metrics to fetch and a list of operations to apply to these metrics in order to infer a health status.
        </t>
      </section>
      <section anchor="building_the_expression_graph_from_the_assurance_graph" title="Building numbered="true" toc="default">
        <name>Building the Expression Graph from the Assurance Graph"> Graph</name>
        <t>
          From the assurance graph is derived graph, a so-called global computation graph. graph is derived.
            First, each subservice instance is transformed into a set of subservice expressions that take metrics and constants as input (i.e., sources of the DAG) and produce the status of the subservice, subservice based on some heuristics.
            For instance, the health of an interface is 0 (minimal score) with the symptom "interface admin-down" if the interface is disabled in the configuration.
            Then
            Then, for each service instance, the service expressions are constructed by combining the subservice expressions of its dependencies.
            The way service expressions are combined depends on the dependency types (impacting or informational).
            Finally, the global computation graph is built by combining the service expressions, expressions to get a global view of all subservices.
            In other words, the global computation graph encodes all the operations needed to produce health statuses from the collected metrics.
        </t>
        <t>
          The two types of dependencies for combining subservices are:
          <list>
          <t>
              Informational Dependency: Type
        </t>
        <dl newline="true" spacing="normal">
          <dt>Informational Dependency:</dt>
	  <dd>The type of dependency whose health score does not impact the health score of its parent subservice or service instance(s) in the assurance graph. However, the symptoms should be taken into account in the parent service instance or subservice instance(s), instance(s) for informational reasons.
          </t>
          <t>
              Impacting Dependency: Type reasons.</dd>
          <dt>Impacting Dependency:</dt>
	  <dd>The type of dependency whose health score impacts the health score of its parent subservice or service instance(s) in the assurance graph.
              The symptoms are taken into account in the parent service instance or subservice instance(s), instance(s) as the impacting reasons.
          </t>
          </list> reasons.</dd>
        </dl>
        <t>
          The set of dependency type types presented here is not exhaustive.
          More specific dependency types can be defined by extending the YANG model. module.
          For instance, a connectivity subservice depending on several path subservices is only partially impacted if only one of these paths fails.
          Adding these new dependency types requires defining the corresponding operation for combining statuses of subservices.
        </t>
        <t>
          Subservices shall not be dependent on the protocol used to retrieve the metrics.
            To justify this, let's consider the interface operational status.
            Depending on the device capabilities, this status can be collected by an industry-accepted YANG module (IETF, (e.g., IETF or Openconfig <xref target="OpenConfig"/>), target="OpenConfig" format="default"/>), by a vendor-specific YANG module, or even by a MIB module.
            If the subservice was dependent on the mechanism to collect the operational status, then we would need multiple subservice definitions in order to support all different mechanisms.
            This also implies that, while waiting for all the metrics to be available via standard YANG modules, SAIN agents might have to retrieve metric values via non-standard nonstandard YANG data models, via MIB modules, Command Line the Command-Line Interface (CLI), etc., effectively implementing a normalization layer between data models and information models.
        </t>
        <t>
             In order to keep subservices independent of metric collection method, or, method
   (or, expressed differently, to support multiple combinations of
   platforms, OSes, and even vendors, vendors), the architecture introduces the
   concept of "metric engine".
          The metric engine maps each device-independent metric used in the subservices to a list of device-specific metric implementations that precisely define how to fetch values for that metric.
          The mapping is parameterized by the characteristics (model, (i.e., model, OS version, etc.) of the device from which the metrics are fetched.
          This metric engine is included in the SAIN agent.
        </t>
      </section>
      <section anchor="open_interfaces_with_YANG_modules" title="Open numbered="true" toc="default">
        <name>Open Interfaces with YANG Modules"> Modules</name>
        <t>
            The interfaces between the architecture components are open thanks to the YANG modules specified in <xref target="I-D.ietf-opsawg-service-assurance-yang"/>; target="RFC9418" format="default"/>;
            they specify objects for assuring network services based on their decomposition into so-called subservices, according to the SAIN  architecture.
        </t>
        <t>
          These modules are intended for the following use cases:
          <list style="symbols">
        </t>
        <ul spacing="normal">
          <li>
            <t>
              Assurance graph configuration:
              <list style="symbols">
                <t>
            </t>
            <ul spacing="normal">
              <li>
                  Subservices: configure Configure a set of subservices to assure, assure by specifying their types and parameters.
                </t>
                <t>
                </li>
              <li>
                  Dependencies: configure Configure the dependencies between the subservices, along with their types.
                </t>
              </list>
            </t>
            <t>
                </li>
            </ul>
          </li>
          <li>
              Assurance telemetry: export Export the health status of the subservices, along with the observed symptoms.
            </t>
          </list>
        </t>
            </li>
        </ul>
        <t>
          Some examples of YANG instances can be found in Appendix A of <xref target="I-D.ietf-opsawg-service-assurance-yang"/>. target="RFC9418" format="default" sectionFormat="of" section="A"/>.
        </t>
      </section>
      <section anchor="maintenance" title="Handling numbered="true" toc="default">
        <name>Handling Maintenance Windows"> Windows</name>
        <t>
              Whenever network components are under maintenance, the operator wants to inhibit the emission of symptoms from those components.
              A typical use case is device maintenance, during which the device is not supposed to be operational.
              As such, symptoms related to the device health should be ignored.
              Symptoms related to the device-specific subservices, such as the interfaces, might also be ignored because their state changes are probably the consequence of the maintenance.
        </t>
        <t>
              The ietf-service-assurance model proposed described in <xref target="I-D.ietf-opsawg-service-assurance-yang"/> target="RFC9418" format="default"/> enables flagging subservices as under maintenance, maintenance and, in that case, requires a string that identifies the person or process who that requested the maintenance.
              When a service or subservice is flagged as under maintenance, it must report a generic "Under Maintenance" symptom, symptom for propagation towards subservices that depend on this specific subservice.
              Any other symptom from this service, service or by one of its impacting dependencies must not be reported.
        </t>
        <t>
             We illustrate this mechanism on three independent examples based on the assurance graph depicted in <xref target="figure_2"/>:
             <list style="symbols">
               <t> target="figure_2" format="default"/>:
        </t>
        <ul spacing="normal">
          <li>  Device maintenance, for instance instance, upgrading the device OS. The operator
                 flags the subservice "Peer1" device as under maintenance.
                 This inhibits the emission of symptoms, except "Under Maintenance", Maintenance" from "Peer1
                 Physical Interface", "Peer1 Tunnel Interface" Interface", and "Tunnel Service
                 Instance". All other subservices are unaffected.
               </t>
               <t>
               </li>
          <li>
                 Interface maintenance, for instance instance, replacing a broken optic.
                 The operator flags the subservice "Peer1 Physical Interface" as under maintenance.
                 This inhibits the emission of symptoms, except "Under Maintenance"
                 from "Peer 1 Tunnel Interface" and "Tunnel Service Instance". All
               other subservices are unaffected.
               </t>
               <t>
               </li>
          <li>
                 Routing protocol maintenance, for instance instance, modifying parameters or
               redistribution. The operator marks the subservice "IS-IS Routing Protocol" as under maintenance.
               This inhibits the emission of symptoms, except "Under Maintenance", Maintenance" from "IP connectivity" and "Tunnel Service Instance".
               All other subservices are unaffected.
               </t>
             </list>
          </t>
               </li>
        </ul>
        <t>
              In each example above, the subservice under maintenance is completely impacting the service instance, putting it under maintenance as well.
              There are use cases where the subservice under maintenance only partially impacts the service instance.
              For instance, consider a service instance  supported by both a primary and backup path.
              If a subservice impacting the primary path is under maintenance, the service instance might still be functional but degraded.
              In that case, the status of the service instance might include "Primary path Under Maintenance", "No redundancy" redundancy", as well as other symptoms from the backup path to explain the lower health score.
              In general, the computation of the service instance status from the subservices is done in the SAIN collector whose implementation is out of scope for this document.
        </t>
        <t>
              The maintenance of a subservice might modify or hide modifications of the structure of the assurance graph.
              Therefore, unflagging a subservice as under maintenance should trigger an update of the assurance graph.
        </t>
      </section>
      <section anchor="flexible_architecture" title="Flexible numbered="true" toc="default">
        <name>Flexible Functional Architecture"> Architecture</name>
        <t>
          The SAIN architecture is flexible in terms of components. While the
          SAIN architecture in <xref target="figure_1"/> target="figure_1" format="default"/> makes a distinction between two components,
            the service orchestrator and the SAIN orchestrator, in practice those the two components are mostly most likely combined.
          Similarly, the SAIN agents are displayed in <xref target="figure_1"/> target="figure_1" format="default"/> as being separate components. Practically, In practice, the SAIN agents could be either independent
          components or directly integrated in monitored entities.
          A practical example is an agent in a router.
        </t>
        <t>
            The SAIN architecture is also flexible in terms of services and subservices.
            In the proposed defined architecture, the SAIN orchestrator is coupled to a service orchestrator orchestrator, which defines the kinds of services that the architecture handles.
            Most examples in this document deal with the notion of Network Service YANG modules, Modules with well-known services services, such as L2VPN or tunnels.
            However, the concept of services is general enough to cross into different domains.
            One of them is the domain of service management on network elements, which also require their own assurance.
            Examples include a DHCP server on a Linux server, a data plane, an IPFIX export, etc.
            The notion of "service" is generic in this architecture and depends on the service orchestrator and underlying network system, as illustrated by the following examples:
            <list style="symbols">
                <t>if
        </t>
        <ul spacing="normal">
          <li>If a main service orchestrator coordinates several lower level lower-level controllers, a service for the controller can be a subservice from the point of view of the orchestrator.</t>
                <t>A orchestrator.</li>
          <li>A DHCP server/data plane/IPFIX server / data plane / IPFIX export can be considered as subservices for a device.</t>
                <t>A device.</li>
          <li>A routing instance can be considered as a subservice for a L3VPN.</t>
                <t>A an L3VPN.</li>
          <li>A tunnel can be considered as a subservice for an application in the cloud.</t>
                <t>A cloud.</li>
          <li>A service function can be considered as a subservice for a service function chain <xref target="RFC7665"/>.</t>
            </list> target="RFC7665" format="default"/>.</li>
        </ul>
        <t>
            The assurance graph is created to be flexible and open, regardless of the subservice types, locations, or domains.
        </t>
        <t>
          The SAIN architecture is also flexible in terms of distributed graphs.
          As shown in  <xref target="figure_1"/>, target="figure_1" format="default"/>, the architecture comprises several agents.
          Each agent is responsible for handling a subgraph of the assurance graph.
          The collector is responsible for fetching the sub-graphs subgraphs from the different
          agents and gluing them together.  As an example, in the graph from  <xref target="figure_2"/>, target="figure_2" format="default"/>, the subservices relative to Peer 1 might be handled by a
          different agent than the subservices relative to Peer 2 2, and the Connectivity
          and IS-IS subservices might be handled by yet another agent.  The agents will
          export their partial graph graph, and the collector will stitch them together as
          dependencies of the service instance.
        </t>
        <t>
          And finally, the SAIN architecture is flexible in terms of what it monitors.
          Most, if not all examples, all, examples in this document refer to physical components, but
          this is not a constraint. Indeed, the assurance of virtual components would
          follow the same principles principles, and an assurance graph composed of virtualized
          components (or a mix of virtualized and physical ones) is supported by
          this architecture.
        </t>
      </section>
      <section anchor="garbage_collection" title="Time window numbered="true" toc="default">
        <name>Time Window for symptoms history"> Symptoms' History</name>
        <t>
              The health status reported via the YANG modules contains, for each subservice, the list of symptoms.
              Symptoms have a start and end date, making it is possible to report symptoms that are no longer occurring.
        </t>
        <t>
          The SAIN agent might have to remove some symptoms for specific subservice symptoms, symptoms because
          there
          they are outdated and not no longer relevant any longer, or simply because the SAIN agent needs to
          free up some space. Regardless of the reason, it's important for a SAIN collector
          (re-)connecting
          connecting/reconnecting to a SAIN agent to understand the effect of this garbage collection.
        </t>
        <t>
            Therefore, the SAIN agent contains a YANG object specifying the date and time at which
            the symptoms' history starts for the subservice instances.
            The subservice reports only symptoms that are occurring or that have been occurring after the history start date.
        </t>
      </section>
      <section anchor="new_assurance_graph_generation" title="New numbered="true" toc="default">
        <name>New Assurance Graph Generation"> Generation</name>
        <t>
          The assurance graph will change over time, because services and subservices come and go (changing the dependencies between subservices), subservices) or as a result of resolving maintenance issues. Therefore, an assurance graph version must be maintained, along with the date and time of its last generation. The date and time of a particular subservice instance (again dependencies or under maintenance) might be kept. From a client point of view, an assurance graph change is triggered by the value of the assurance-graph-version and assurance-graph-last-change YANG leaves. At that point in time, the client (collector) follows the following process:
          <list style="symbols">
            <t>
        </t>
        <ul spacing="normal">
          <li>
              Keep the previous assurance-graph-last-change value (let's call it time T)
            </t>
            <t> T).
            </li>
          <li>
              Run through all the subservice instances and process the subservice instances for which the last-change is newer that than the time T
            </t>
            <t> T.
            </li>
          <li>
              Keep the new assurance-graph-last-change as the new referenced date and time
            </t>
          </list>
        </t> time.
            </li>
        </ul>
      </section>
    </section>
    <section anchor="iana" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.
      </t>
    </section>
    <section anchor="security" title="Security Considerations">
      <t>
         The numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>The SAIN architecture helps operators to reduce the mean time to detect and the mean time to repair.
         However, the SAIN agents must be secured: secured; a compromised SAIN agent may be sending wrong incorrect root causes or symptoms to the management systems.
         Securing the agents falls back to ensuring the integrity and confidentiality of the assurance graph.
          This can be partially achieved by correctly setting permissions of each node in the YANG model data model, as described in Section 6 of <xref target="I-D.ietf-opsawg-service-assurance-yang"/>. target="RFC9418" format="default" sectionFormat="of" section="6"/>.
      </t>
      <t>
         Except for the configuration of telemetry, the agents do not need "write access" to the devices they monitor.
          This configuration is applied with a YANG module, whose protection is covered by Secure Shell (SSH) <xref target="RFC6242"/> target="RFC6242" format="default"/> for NETCONF the Network Configuration Protocol (NETCONF) or  TLS <xref target="RFC8446"/> target="RFC8446" format="default"/> for RESTCONF.
          Devices should be configured so that agents have their own credentials with write access only for the YANG nodes configuring the telemetry.
      </t>
      <t>
         The data collected by SAIN could potentially be compromising to the network or provide more insight into how the network is designed.
          Considering the data that SAIN requires (including CLI access in some cases), one should weigh data access concerns with the impact that reduced visibility will have on being able to rapidly identify root causes.
      </t>
      <t>
          For building the assurance graph, the SAIN orchestrator needs to obtain the configuration from the service orchestrator.
          The latter should restrict access of the SAIN orchestrator to information needed to build the assurance graph.
      </t>
      <t>
        If a closed loop system relies on this architecture architecture, then the well known well-known issue of those systems also applies, i.e., a lying device or compromised agent could trigger partial reconfiguration of the service or network.
          The SAIN architecture neither augments nor reduces this risk.
          An extension of SAIN, which is out of scope for this document, could detect discrepancies between symptoms reported by different agents agents, and thus detect anomalies if an agent or a device is lying.
      </t>
      <t>
         If NTP service goes down, the devices clocks might lose their synchronization.
          In that case, correlating information from different devices, such as detecting symptoms about a link or correlating symptoms from different devices, will give inaccurate results.
      </t>
    </section>

    <section anchor="iana" title="IANA Considerations">
      <t>
        This document includes no request to IANA.
      </t>
    </section>

    <section title="Contributors">
      <t>
        <list style="symbols">
          <t>Youssef El Fathi</t>
          <t>Eric Vyncke</t>
        </list>
      </t>
    </section>
  </middle>
  <back>
    <references title="Normative References">
      <?rfc include="reference.I-D.ietf-opsawg-service-assurance-yang"?>
        <?rfc include="reference.RFC.8309"?>
        <?rfc include="reference.RFC.8969"?>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>

<!-- [I-D.ietf-opsawg-service-assurance-yang] RFC 9418 -->

<reference anchor='RFC9418' target='https://www.rfc-editor.org/info/rfc9418'>
<front>
<title>YANG Modules for Service Assurance</title>
<author initials="B." surname="Claise" fullname="Benoit Claise">
</author>
<author initials="J." surname="Quilbeuf" fullname="Jean Quilbeuf">
</author>
<author initials="P." surname="Lucente" fullname="Paolo Lucente">
</author>
<author initials="P." surname="Fasano" fullname="Paolo Fasano">
</author>
<author initials="T." surname="Arumugam" fullname="Thangam Arumugam">
</author>
<date month="June" year="2023"/>
</front>
<seriesInfo name="RFC" value="9418"/>
<seriesInfo name="DOI" value="10.17487/RFC9418"/>
</reference>

        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8309.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8969.xml"/>
      </references>
    <references title="Informative References">
      <?rfc include='reference.RFC.2865'?>
      <?rfc include='reference.RFC.5424'?>
      <?rfc include='reference.RFC.5905'?>
      <?rfc include='reference.RFC.6242'?>
      <?rfc include="reference.RFC.7011"?>
      <?rfc include="reference.RFC.7149"?>
      <?rfc include="reference.RFC.7665"?>
      <?rfc include='reference.RFC.7950'?>
      <?rfc include="reference.RFC.8199"?>
      <?rfc include="reference.RFC.8446"?>
      <?rfc include="reference.RFC.8466"?>
      <?rfc include="reference.RFC.8641"?>
      <?rfc include="reference.RFC.8907"?>
      <?rfc include="reference.RFC.9315"?>
      <?rfc include="reference.I-D.ietf-opsawg-yang-vpn-service-pm"?>
      <references>
        <name>Informative References</name>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2865.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5424.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5905.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6242.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7011.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7149.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7665.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7950.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8199.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8446.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8466.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8641.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8907.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9315.xml"/>

<!-- [I-D.ietf-opsawg-yang-vpn-service-pm] RFC 9375 -->
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9375.xml"/>

        <!--   <?rfc include="reference.I-D.irtf-nmrg-ibn-intent-classification"?> -->

      <reference anchor="Piovesan2017" target="https://doi.org/10.1016/B978-0-12-803773-7.00007-3">
          <front>
        <title>Reasoning
            <title>7 - Reasoning About Safety and Security: The Logic of Assurance</title>
            <author initials="A." surname="Piovesan" fullname="A.  Piovesan"><organization/></author>  Piovesan">
              <organization/>
            </author>
            <author initials="E." surname="Griffor" fullname="E.  Griffor"><organization/></author>  Griffor">
              <organization/>
            </author>
            <date year="2017" /> year="2017"/>
          </front>
	  <seriesInfo name="DOI" value="10.1016/B978-0-12-803773-7.00007-3"/>
        </reference>

        <reference anchor="OpenConfig" target="https://openconfig.net">
          <front>
            <title>OpenConfig</title>
            <author/>
                <date/>
          </front>
        </reference>
      </references>
    <?rfc needLines="100"?>

    <section title="Changes between revisions">
        <t>[[RFC editor: please remove this section before publication.]]</t>
        <t>v12 - 13
            <list style="symbols">
                <t> Addressing IESG telechat feedback</t>
            </list>
        </t>
        <t>v11 - 12
            <list style="symbols">
                <t> Addressing comments from Last call</t>
            </list>
        </t>
        <t>v10 - v11
            <list style="symbols">
                <t>Adding reference to example of network performance model</t>
            </list>
        </t>
        <t>v09 - v10
            <list style="symbols">
                <t>Addressing comments from Rob Wilton</t>
            </list>
        </t>
        <t>v08 - v09
            <list style="symbols">
                <t>Addressing comments from Michael Richardson</t>
            </list>
        </t>
        <t>v07 - v08
          <list style="symbols">
            <t>Propagating removal of under-maintenance flag from the YANG module </t>
          </list>
        </t>
        <t>v06-07
            <list>
                <t>Addressing comments from Dhruv Dhody and applying pending changes</t>
            </list>
        </t>
        <t>v03 - v04
            <list style="symbols">
                <t>Address comments from Mohamed Boucadair</t>
            </list>
        </t>
      <t>v00 - v01
        <list style="symbols">
          <t>Cover the feedback received during the WG call for adoption</t>
        </list>
      </t>
    </section>
    </references>
    <section title="Acknowledgements" numbered="no"> numbered="false" toc="default">
      <name>Acknowledgements</name>
      <t>
          The authors would like to thank Stephane Litkowski, Charles Eckel, Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, Eric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson and Rob Wilton <contact fullname="Stephane Litkowski"/>, <contact fullname="Charles Eckel"/>, <contact fullname="Rob Wilton"/>, <contact fullname="Vladimir Vassiliev"/>, <contact fullname="Gustavo Alburquerque"/>, <contact fullname="Stefan Vallin"/>, <contact fullname="Éric Vyncke"/>, <contact fullname="Mohamed Boucadair"/>, <contact fullname="Dhruv Dhody"/>, <contact fullname="Michael Richardson"/>, and <contact fullname="Rob Wilton"/> for their reviews and feedback.
      </t>
    </section>
    <section numbered="false" toc="default">
      <name>Contributors</name>
      <ul spacing="normal">
        <li><t><contact fullname="Youssef El Fathi"/></t></li>
        <li><t><contact fullname="Éric Vyncke"/></t></li>
      </ul>
    </section>

<!--[rfced] Terminology questions

c) We have received guidance from the YANG Doctors
that "YANG module" and "YANG data model" are preferred.
Some occurrences may need an update, for example:

Original:
   The use of this YANG model is further
   explained in Section 3.5.

Where Section 3.5 is "Open Interfaces with YANG Modules.”

Please review and specify any needed updates.
-->
  </back>
</rfc>
<!-- Local Variables: -->
<!-- fill-column:72 -->
<!-- End: -->