rfc9417xml2.original.xml   rfc9417.xml 
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [ <!DOCTYPE rfc [
<!ENTITY nbsp "&#160;">
<!ENTITY zwsp "&#8203;">
<!ENTITY nbhy "&#8209;">
<!ENTITY wj "&#8288;">
]> ]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc toc="yes"?> <rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF" category="
<?rfc tocompact="yes"?> info"
<?rfc tocdepth="4"?> consensus="true" docName="draft-ietf-opsawg-service-assurance-architecture-13" n
<?rfc tocindent="yes"?> umber="9417" ipr="trust200902" obsoletes="" updates="" xml:lang="en" tocInclude=
<?rfc symrefs="yes"?> "true"
<?rfc sortrefs="yes"?> tocDepth="4" symRefs="true" sortRefs="true" version="3">
<?rfc comments="yes"?>
<?rfc inline="yes"?> <!-- xml2rfc v2v3 conversion 3.16.0 -->
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" ipr="trust200902" docName="draft-ietf-opsawg-service-assura
nce-architecture-13">
<front> <front>
<title abbrev="SAIN Architecture">Service Assurance for Intent-based Network <title abbrev="SAIN Architecture">Service Assurance for Intent-Based Network
ing Architecture</title> ing Architecture</title>
<seriesInfo name="RFC" value="9417"/>
<author fullname="Benoit Claise" initials="B" surname="Claise"> <author fullname="Benoit Claise" initials="B" surname="Claise">
<organization>Huawei</organization> <organization>Huawei</organization>
<address> <address>
<email>benoit.claise@huawei.com</email> <email>benoit.claise@huawei.com</email>
</address> </address>
</author> </author>
<author fullname="Jean Quilbeuf" initials="J" surname="Quilbeuf "> <author fullname="Jean Quilbeuf" initials="J" surname="Quilbeuf ">
<organization>Huawei</organization> <organization>Huawei</organization>
<address> <address>
<email>jean.quilbeuf@huawei.com</email> <email>jean.quilbeuf@huawei.com</email>
</address> </address>
</author> </author>
<author fullname="Diego R. Lopez" initials="D" surname="Lopez "> <author fullname="Diego R. Lopez" initials="D" surname="Lopez ">
<organization>Telefonica I+D</organization> <organization>Telefonica I+D</organization>
<address> <address>
<postal> <postal>
<street>Don Ramon de la Cruz, 82</street> <street>Don Ramon de la Cruz, 82</street>
<city>Madrid 28006</city> <city>Madrid</city>
<code>28006</code>
<country>Spain</country> <country>Spain</country>
</postal> </postal>
<email>diego.r.lopez@telefonica.com</email> <email>diego.r.lopez@telefonica.com</email>
</address> </address>
</author> </author>
<author fullname="Dan Voyer" initials="D" surname="Voyer "> <author fullname="Dan Voyer" initials="D" surname="Voyer ">
<organization>Bell Canada</organization> <organization>Bell Canada</organization>
<address> <address>
<postal> <postal>
<street/> <street/>
<city/> <city/>
<country>Canada</country> <country>Canada</country>
</postal> </postal>
<email>daniel.voyer@bell.ca</email> <email>daniel.voyer@bell.ca</email>
</address> </address>
</author> </author>
<author fullname="Thangam Arumugam" initials="T" surname="Arumugam"> <author fullname="Thangam Arumugam" initials="T" surname="Arumugam">
<organization>Cisco Systems, Inc.</organization> <organization>Consultant</organization>
<address> <address>
<postal> <postal>
<street/> <street/>
<city>Milpitas (California)</city> <city>Milpitas</city>
<region>California</region>
<country>United States of America</country> <country>United States of America</country>
</postal> </postal>
<email>tarumuga@cisco.com</email> <email>thangavelu@yahoo.com</email>
</address> </address>
</author> </author>
<date/> <date year="2023" month="June"/>
<area>OPS</area> <area>ops</area>
<workgroup>OPSAWG</workgroup> <workgroup>opsawg</workgroup>
<abstract> <abstract>
<t> <t>
This document describes an architecture that aims at assuring that servi This document describes an architecture that provides some assurance tha
ce instances are running as expected. t service instances are running as expected.
As services rely upon multiple sub-services provided by a variety of ele As services rely upon multiple subservices provided by a variety of elem
ments including the underlying network devices and functions, ents, including the underlying network devices and functions,
getting the assurance of a healthy service is only possible with a hol istic view of all involved elements. getting the assurance of a healthy service is only possible with a hol istic view of all involved elements.
This architecture not only helps to correlate the service degradation with symptoms of a specific network component but also to list the services impa cted by the failure or degradation of a specific network component. This architecture not only helps to correlate the service degradation with symptoms of a specific network component but, it also lists the services im pacted by the failure or degradation of a specific network component.
</t> </t>
</abstract> </abstract>
</front> </front>
<middle> <middle>
<section title="Terminology" anchor="terminology"> <section anchor="intro" numbered="true" toc="default">
<t> <name>Introduction</name>
SAIN agent: A functional component that communicates with a device, a
set of devices,
or another agent to build an expression graph from a received assuranc
e graph and
perform the corresponding computation of the health status and symptom
s. A SAIN agent might
be running directly on the device it monitors.
</t>
<t>
Assurance case: "An assurance case is a structured argument, supported
by evidence, intended to justify that a system is acceptably assured relative t
o a concern (such as safety or security) in the intended operating environment"
<xref target="Piovesan2017"/>.
</t>
<t>
Service instance: A specific instance of a service.
</t>
<t>
Intent:
"A set of operational goals (that a network should meet) and outcomes
(that a network is supposed to deliver), defined in a declarative manner without
specifying how to achieve or implement them" <xref target="RFC9315"/>.
</t>
<t>
Subservice: Part or functionality of the network system that can be in
dependently assured as a single entity in assurance graph.
</t>
<t>
Assurance graph: A Directed Acyclic Graph (DAG) representing the assur
ance case for one or several service instances.
The nodes (also known as vertices in the context of DAG) are the servi
ce instances themselves and the subservices, the edges indicate a dependency rel
ation.
</t>
<t>
SAIN collector: A functional component that fetches or receives the co
mputer-consumable output of the SAIN agent(s) and process it locally (including
displaying it in a user-friendly form).
</t>
<t>
DAG: Directed Acyclic Graph.
</t>
<t>
ECMP: Equal Cost Multiple Paths
</t>
<t>
Expression graph: A generic term for a DAG representing a computation
in SAIN. More specific terms are:
<list style="symbols">
<t>Subservice expressions: Is an expression graph representing all
the computations to execute for a subservice.</t>
<t>Service expressions: Is an expression graph representing all th
e computations to execute for a service instance, i.e., including the computatio
ns for all dependent subservices.</t>
<t>Global computation graph: Is an expression graph representing a
ll the computations to execute for all services instances (i.e., all computatio
ns performed).</t>
</list>
</t>
<t>
Dependency: The directed relationship between subservice instances in
the assurance graph.
</t>
<t>
Metric: A piece of information retrieved from the network running the
assured service.
</t>
<t>
Metric engine: A functional component, part of the SAIN agent, that ma
ps metrics to a list of candidate metric implementations depending on the networ
k element.
</t>
<t>
Metric implementation: Actual way of retrieving a metric from a networ
k element.
</t>
<t>
Network service YANG module: describes the characteristics of a servic
e as agreed upon with consumers of that service <xref target="RFC8199"/>.
</t>
<t>
Service orchestrator: Quoting RFC8199, "Network Service YANG Modules d
escribe the characteristics of a service, as agreed upon with consumers of that
service. That is, a service module does not expose the detailed configuration pa
rameters of all participating network elements and features but describes an abs
tract model that allows instances of the service to be decomposed into instance
data according to the Network Element YANG Modules of the participating network
elements. The service-to-element decomposition is a separate process; the detail
s depend on how the network operator chooses to realize the service. For the pur
pose of this document, the term "orchestrator" is used to describe a system impl
ementing such a process."
</t>
<t> <t>
SAIN orchestrator: A functional component that is in charge of fetchin g the configuration specific to each service instance and converting it into an assurance graph. Network Service YANG Modules <xref target="RFC8199" format="default"/> describe the configuration, state data, operations, and notifications of abstrac t representations of services implemented on one or multiple network elements.
</t> </t>
<t> <t>
Health status: Score and symptoms indicating whether a service instanc Service orchestrators use Network Service YANG Modules that will infer n
e or a subservice is "healthy". A non-maximal score must always be explained by etwork-wide configuration and, therefore, the invocation of the appropriate devi
one or more symptoms. ce modules (<xref target="RFC8969" format="default" sectionFormat="of" section="
3"/>).
Knowing that a configuration is applied doesn't imply that the provis
ioned service instance is up and running as expected.
For instance, the service might be degraded because of a failure in t
he network, the service quality may be degraded, or a service function may be re
achable at the IP level but does not provide its intended function.
Thus, the network operator must monitor the service's operational dat
a at the same time as the configuration (<xref target="RFC8969" format="default"
sectionFormat="of" section="3.3"/>).
To fuel that task, the industry has been standardizing on telemetry t
o push network element performance information (e.g., <xref target="RFC9375" for
mat="default"/>).
</t> </t>
<t> <t>
Health score: Integer ranging from 0 to 100 indicating the health of a A network administrator needs to monitor its network and services as a w
subservice. hole, independently of the management protocols.
A score of 0 means that the subservice is broken, a score of 100 means With different protocols come different data models and different way
that the subservice in question is operating as expected. s to model the same type of information.
The special value -1 can be used to specify that no value could be com When network administrators deal with multiple management protocols,
puted for that health-score, for instance if some metric needed for that computa the network management entities have to perform the difficult and time-consuming
tion could not be collected. job of mapping data models,
e.g., the model used for configuration with the model used for monito
ring when separate models or protocols are used.
This problem is compounded by a large, disparate set of data sources
(e.g., MIB modules, YANG data models <xref target="RFC7950" format="default"/>,
IP Flow Information Export (IPFIX) information elements <xref target="RFC7011" f
ormat="default"/>, syslog plain text <xref target="RFC5424" format="default"/>,
Terminal Access Controller Access-Control System Plus (TACACS+) <xref target="RF
C8907" format="default"/>, RADIUS <xref target="RFC2865" format="default"/>, etc
.).
In order to avoid this data model mapping, the industry converged on
model-driven telemetry to stream the service operational data, reusing the YANG
data models used for configuration.
Model-driven telemetry greatly facilitates the notion of closed-loop
automation, whereby events and updated operational states streamed from the netw
ork drive remediation change back into the network.
</t> </t>
<t> <t>
Strongly connected component: subset of a directed graph such that the However, it proves difficult for network operators to correlate the serv
re ice degradation with the network root cause,
is a (directed) path from any node of the subset to any other node. A for example, "Why does my layer 3 virtual private network (L3VPN) fail t
DAG does not contain any strongly connected component. o connect?" or "Why is this specific service not highly responsive?"
The reverse, i.e., which services are impacted when a network compone
nt fails or degrades, is also important for operators,
for example, "Which services are impacted when this specific optic de
cibel milliwatt (dBm) begins to degrade?",
"Which applications are impacted by an imbalance in this Equal-Cost
Multipath (ECMP) bundle?", or "Is that issue actually impacting any other custo
mers?"
This task usually falls under the so-called "Service Impact Analysis"
functional block.
</t> </t>
<t> <t>
Symptom: Reason explaining why a service instance or a subservice is n This document defines an architecture implementing Service Assurance
ot completely healthy. for Intent-based Networking (SAIN).
</t> Intent-based approaches are often declarative, starting from a statem
</section> ent of "The service works as expected" and trying to enforce it.
However, some already-defined services might have been designed using
<section anchor="intro" title="Introduction"> a different approach.
Aligned with <xref target="RFC7149" format="default" sectionFormat="o
<t> f" section="3.3"/>, and instead of requiring a declarative intent as a starting
Network service YANG modules <xref target="RFC8199"/> describe the conf point,
iguration, state data, operations, and notifications of abstract representations this architecture focuses on already-defined services and tries to in
of services implemented on one or multiple network elements. fer the meaning of "The service works as expected".
</t>
<t>
Service orchestrators use Network service YANG modules that will infer n
etwork-wide configuration and, therefore the invocation of the appropriate devic
e modules (Section 3 of <xref target="RFC8969"/>).
Knowing that a configuration is applied doesn't imply that the provis
ioned service instance is up and running as expected.
For instance, the service might be degraded because of a failure in t
he network, the service quality may be degraded, or a service function may be re
achable at the IP level but does not provide its intended function.
Thus, the network operator must monitor the service’s operational dat
a at the same time as the configuration (Section 3.3 of <xref target="RFC8969"/>
).
To feed that task, the industry has been standardizing on telemetry t
o push network element performance information (e.g., <xref target="I-D.ietf-ops
awg-yang-vpn-service-pm"/>).
</t>
<t>
A network administrator needs to monitor their network and services as a
whole, independently of the management protocols.
With different protocols come different data models, and different wa
ys to model the same type of information.
When network administrators deal with multiple management protocols,
the network management entities have to perform the difficult and time-consuming
job of mapping data models:
e.g., the model used for configuration with the model used for monito
ring when separate models or protocols are used.
This problem is compounded by a large, disparate set of data sources
(MIB modules, YANG models <xref target="RFC7950"/>, IPFIX information elements <
xref target="RFC7011"/>, syslog plain text <xref target="RFC5424"/>, TACACS+ <xr
ef target="RFC8907"/>, RADIUS <xref target="RFC2865"/>, etc.).
In order to avoid this data model mapping, the industry converged on
model-driven telemetry to stream the service operational data, reusing the YANG
models used for configuration.
Model-driven telemetry greatly facilitates the notion of closed-loop
automation whereby events and updated operational state streamed from the networ
k drive remediation changes back into the network.
</t>
<t>
However, it proves difficult for network operators to correlate the serv
ice degradation with the network root cause.
For example, "Why does my layer 3 virtual private network (L3VPN) fai
l to connect?" or "Why is this specific service not highly responsive?".
The reverse, i.e., which services are impacted when a network compone
nt fails or degrades, is also important for operators.
For example, "Which services are impacted when this specific optic de
cibel milliwatt (dBm) begins to degrade?",
"Which applications are impacted by an imbalance in this equal cost
multiple paths (ECMP) bundle?", or "Is that issue actually impacting any other
customers?".
This task usually falls under the so-called "Service Impact Analysis"
functional block.
</t>
<t>
In this document, we propose an architecture implementing Service Ass
urance for Intent-Based Networking (SAIN).
Intent-based approaches are often declarative, starting from a statem
ent of “The service works as expected” and trying to enforce it.
However, some already defined services might have been designed using
a different approach.
Aligned with Section 3.3 of <xref target="RFC7149"/>, and instead of
requiring a declarative intent as a starting point,
this architecture focuses on already defined services and tries to in
fer the meaning of “The service works as expected”.
To do so, the architecture works from an assurance graph, deduced fro m the configuration pushed to the device for enabling the service instance. To do so, the architecture works from an assurance graph, deduced fro m the configuration pushed to the device for enabling the service instance.
If the SAIN orchestrator supports it, the service model (Section 2 of <xref target="RFC8309"/>) or the network model (Section 2.1 of <xref target="RF C8969"/>) can also be used to build the assurance graph. If the SAIN orchestrator supports it, the service model (<xref target ="RFC8309" format="default" sectionFormat="of" section="2"/>) or the network mod el (<xref target="RFC8969" format="default" sectionFormat="of" section="2.1"/>) can also be used to build the assurance graph.
In that case and if the service model includes the declarative intent as well, the SAIN orchestrator can rely on the declared intent instead of infer ring it. In that case and if the service model includes the declarative intent as well, the SAIN orchestrator can rely on the declared intent instead of infer ring it.
The assurance graph may also be explicitly completed to add an intent not exposed in the service model itself. The assurance graph may also be explicitly completed to add an intent not exposed in the service model itself.
</t> </t>
<t> <t>
The assurance graph of a service instance is decomposed into componen ts, which are then assured independently. The assurance graph of a service instance is decomposed into componen ts, which are then assured independently.
The top of the assurance graph represents the service instance to ass ure, and its children represent components identified as its direct dependencies ; each component can have dependencies as well. The top of the assurance graph represents the service instance to ass ure, and its children represent components identified as its direct dependencies ; each component can have dependencies as well.
Components involved in the assurance graph of a service are called s ubservices. Components involved in the assurance graph of a service are called s ubservices.
The SAIN orchestrator updates automatically the assurance graph when The SAIN orchestrator updates the assurance graph automatically when
the service instance is modified. the service instance is modified.
</t> </t>
<t> <t>
When a service is degraded, the SAIN architecture will highlight where in the assurance service graph to look, as opposed to going hop by hop to troub leshoot the issue. When a service is degraded, the SAIN architecture will highlight where in the assurance service graph to look, as opposed to going hop by hop to troub leshoot the issue.
More precisely, the SAIN architecture will associate to each service i nstance a list of symptoms originating from specific subservices, corresponding to components of the network. More precisely, the SAIN architecture will associate to each service i nstance a list of symptoms originating from specific subservices, corresponding to components of the network.
These components are good candidates for explaining the source of a se rvice degradation. These components are good candidates for explaining the source of a se rvice degradation.
Not only can this architecture help to correlate service degradation w ith network root cause/symptoms, but it can deduce from the assurance graph the list of service instances impacted by a component degradation/failure. Not only can this architecture help to correlate service degradation w ith network root cause/symptoms, but it can deduce from the assurance graph the list of service instances impacted by a component degradation/failure.
This added value informs the operational team where to focus its atten tion for maximum return. This added value informs the operational team where to focus its atten tion for maximum return.
Indeed, the operational team is likely to focus their priority on the Indeed, the operational team is likely to focus their priority on the
degrading/failing components impacting the highest number of their customers, es degrading/failing components impacting the highest number of their customers, es
pecially the ones with the SLA contracts involving penalties in case of failure. pecially the ones with the Service-Level Agreement (SLA) contracts involving pen
</t> alties in case of failure.
<t> </t>
This architecture provides the building blocks to assure both physical a <t>
nd virtual entities and is flexible with respect to services and subservices, of This architecture provides the building blocks to assure both physical a
(distributed) graphs, and of components (<xref target="flexible_architecture"/> nd virtual entities and is flexible with respect to services and subservices of
). (distributed) graphs and components (<xref target="flexible_architecture" format
</t> ="default"/>).
<t> </t>
The architecture presented in this document is implemented by a set <t>
of YANG modules defined in a companion document <xref target="I-D.ietf-opsawg-se The architecture presented in this document is implemented by a set
rvice-assurance-yang"/>. of YANG modules defined in a companion document <xref target="RFC9418" format="d
These YANG modules properly define the interfaces between the variou efault"/>.
s components of the architecture in order to foster interoperability. These YANG modules properly define the interfaces between the variou
</t> s components of the architecture to foster interoperability.
</t>
</section> </section>
<section anchor="terminology" numbered="true" toc="default">
<section anchor="architecture" title="A Functional Architecture"> <name>Terminology</name>
<dl newline="false" spacing="normal">
<dt>SAIN agent:</dt>
<dd>A functional component that communicates with a device, a set of devi
ces,
or another agent to build an expression graph from a received assuranc
e graph and
perform the corresponding computation of the health status and symptom
s. A SAIN agent might
be running directly on the device it monitors.</dd>
<dt>Assurance case:</dt>
<dd>"An assurance case is a structured argument, supported by evidence,
intended to justify that a system is acceptably assured relative to a concern (
such as safety or security) in the intended operating environment" <xref target=
"Piovesan2017" format="default"/>.</dd>
<dt>Service instance:</dt>
<dd>A specific instance of a service.</dd>
<dt>Intent:</dt>
<dd>"A set of operational goals (that a network should meet) and outco
mes (that a network is supposed to deliver) defined in a declarative manner with
out specifying how to achieve or implement them" <xref target="RFC9315" format="
default"/>.</dd>
<dt>Subservice:</dt>
<dd>A part or functionality of the network system that can be independe
ntly assured as a single entity in an assurance graph.</dd>
<dt>Assurance graph:</dt>
<dd>A Directed Acyclic Graph (DAG) representing the assurance case for
one or several service instances.
The nodes (also known as vertices in the context of DAG) are the servi
ce instances themselves and the subservices; the edges indicate a dependency rel
ation.</dd>
<dt>SAIN collector:</dt>
<dd>A functional component that fetches or receives the computer-consum
able output of the SAIN agent(s) and processes it locally (including displaying
it in a user-friendly form).</dd>
<dt>DAG:</dt>
<dd>Directed Acyclic Graph.</dd>
<dt>ECMP:</dt>
<dd>Equal-Cost Multipath.</dd>
<dt>Expression graph:</dt>
<dd><t>A generic term for a DAG representing a computation in SAIN. Mor
e specific terms are listed below:</t>
<dl newline="true" spacing="normal">
<dt>Subservice expressions:</dt>
<dd>An expression graph representing all the computations to execute for
a subservice.</dd>
<dt>Service expressions:</dt>
<dd>An expression graph representing all the computations to execute for
a service instance, i.e., including the computations for all dependent subservic
es.</dd>
<dt>Global computation graph:</dt>
<dd>An expression graph representing all the computations to execute for
all services instances (i.e., all computations performed).</dd>
</dl>
</dd>
<dt>Dependency:</dt>
<dd>The directed relationship between subservice instances in the assur
ance graph.</dd>
<dt>Metric:</dt>
<dd>A piece of information retrieved from the network running the assur
ed service.</dd>
<dt>Metric engine:</dt>
<dd>A functional component, part of the SAIN agent, that maps metrics t
o a list of candidate metric implementations, depending on the network element.<
/dd>
<dt>Metric implementation:</dt>
<dd>The actual way of retrieving a metric from a network element.</dd>
<dt>Network Service YANG Module:</dt>
<dd>The characteristics of a service, as agreed upon with consumers of
that service <xref target="RFC8199" format="default"/>.</dd>
<dt>Service orchestrator:</dt>
<dd>"Network Service YANG Modules describe the characteristics of a ser
vice, as agreed upon with consumers of that service. That is, a service module d
oes not expose the detailed configuration parameters of all participating networ
k elements and features but describes an abstract model that allows instances of
the service to be decomposed into instance data according to the Network Elemen
t YANG Modules of the participating network elements. The service-to-element dec
omposition is a separate process; the details depend on how the network operator
chooses to realize the service. For the purpose of this document, the term "orc
hestrator" is used to describe a system implementing such a process" <xref targe
t="RFC8199" format="default"/>.</dd>
<dt>SAIN orchestrator:</dt>
<dd>A functional component that is in charge of fetching the configurat
ion specific to each service instance and converting it into an assurance graph.
</dd>
<dt>Health status:</dt>
<dd>The score and symptoms indicating whether a service instance or a s
ubservice is "healthy". A non-maximal score must always be explained by one or m
ore symptoms.</dd>
<dt>Health score:</dt>
<dd>An integer ranging from 0 to 100 that indicates the health of a sub
service.
A score of 0 means that the subservice is broken, a score of 100 means
that the subservice in question is operating as expected, and
the special value -1 can be used to specify that no value could be com
puted for that health score, for instance, if some metric needed for that comput
ation could not be collected.</dd>
<dt>Strongly connected component:</dt>
<dd>A subset of a directed graph such that there
is a (directed) path from any node of the subset to any other node. A
DAG does not contain any strongly connected component.</dd>
<dt>Symptom:</dt>
<dd>A reason explaining why a service instance or a subservice is not c
ompletely healthy.</dd>
</dl>
</section>
<section anchor="architecture" numbered="true" toc="default">
<name>A Functional Architecture</name>
<t> <t>
The goal of SAIN is to assure that service instances are operating as ex pected (i.e., the observed service is matching the expected service) and if not, to pinpoint what is wrong. The goal of SAIN is to assure that service instances are operating as ex pected (i.e., the observed service is matching the expected service) and, if not , to pinpoint what is wrong.
More precisely, SAIN computes a score for each service instance and ou tputs symptoms explaining that score. More precisely, SAIN computes a score for each service instance and ou tputs symptoms explaining that score.
The only valid situation where no symptoms are returned is when the sc ore is maximal, indicating that no issues were detected for that service instanc e. The only valid situation where no symptoms are returned is when the sc ore is maximal, indicating that no issues were detected for that service instanc e.
The score augmented with the symptoms is called the health status. The exact meaning of the health score value is out of scope of this document. Howev er the following constraints should be followed: the higher the score, the bette r the service health is; the two extrema being 0 meaning the service is complete ly broken and 100 meaning the service is completely operational. The score augmented with the symptoms is called the health status. The exact meaning of the health score value is out of scope of this document. Howev er, the following constraints should be followed: the higher the score, the bett er the service health is and the two extrema are 0 meaning the service is comple tely broken, and 100 meaning the service is completely operational.
</t> </t>
<t> <t>
The SAIN architecture is a generic architecture, which generates an assu The SAIN architecture is a generic architecture, which generates an assu
rance graph from service instance(s), as specified in <xref target="inferring"/> rance graph from service instance(s), as specified in <xref target="inferring" f
). ormat="default"/>.
This architecture is applicable to multiple environments (e.g. wirelin This architecture is applicable to not only multiple environments (e.g
e, wireless), ., wireline and wireless)
but also different domains (e.g. 5G network function virtualization (N but also different domains (e.g., 5G network function virtualization (
FV) domain with a virtual infrastructure manager (VIM), etc.), NFV) domain with a virtual infrastructure manager (VIM), etc.)
and as already noted, for physical or virtual devices, as well as virt and, as already noted, for physical or virtual devices, as well as vir
ual functions. tual functions.
Thanks to the distributed graph design principle, graphs from differen Thanks to the distributed graph design principle, graphs from differen
t environments/orchestrator can be combined to obtain the graph of a service ins t environments and orchestrators can be combined to obtain the graph of a servic
tance that spans over multiple domains. e instance that spans over multiple domains.
</t> </t>
<t> <t>
As an example of a service, let us consider a point-to-point level 2 vir As an example of a service, let us consider a point-to-point layer 2 vir
tual private network (L2VPN). tual private network (L2VPN).
<xref target="RFC8466"/> specifies the parameters for such a service. <xref target="RFC8466" format="default"/> specifies the parameters for s
Examples of symptoms might be symptoms reported by specific subservice uch a service.
s "Interface has high error rate" or "Interface flapping", or "Device almost out Examples of symptoms might be symptoms reported by specific subservice
of memory" as well as symptoms more specific to the service such as "Site disco s, including "Interface has high error rate", "Interface flapping", or "Device a
nnected from VPN". lmost out of memory", as well as symptoms more specific to the service (such as
"Site disconnected from VPN").
</t> </t>
<t> <t>
To compute the health status of an instance of such a service, the ser vice definition is decomposed into an assurance graph formed by subservices link ed through dependencies. Each subservice is then turned into an expression graph that details how to fetch metrics from the devices and compute the health statu s of the subservice. The subservice expressions are combined according to the de pendencies between the subservices in order to obtain the expression graph which computes the health status of the service instance. To compute the health status of an instance of such a service, the ser vice definition is decomposed into an assurance graph formed by subservices link ed through dependencies. Each subservice is then turned into an expression graph that details how to fetch metrics from the devices and compute the health statu s of the subservice. The subservice expressions are combined according to the de pendencies between the subservices in order to obtain the expression graph that computes the health status of the service instance.
</t> </t>
<t> <t>
The overall SAIN architecture is presented in <xref target="figure_1"/> . The overall SAIN architecture is presented in <xref target="figure_1" f ormat="default"/>.
Based on the service configuration provided by the service orchestrato r, the SAIN orchestrator decomposes the assurance graph. Based on the service configuration provided by the service orchestrato r, the SAIN orchestrator decomposes the assurance graph.
It then sends to the SAIN agents the assurance graph along with some o ther configuration options. It then sends to the SAIN agents the assurance graph along with some o ther configuration options.
The SAIN agents are responsible for building the expression graph and computing the health statuses in a distributed manner. The SAIN agents are responsible for building the expression graph and computing the health statuses in a distributed manner.
The collector is in charge of collecting and displaying the current in ferred health status of the service instances and subservices. The collector is in charge of collecting and displaying the current in ferred health status of the service instances and subservices.
The collector also detects changes in the assurance graph structures, The
for instance when a switchover from primary to backup path occurs, and forwards collector also detects changes in the assurance graph structures (e.g., an
to the orchestrator, which reconfigures the agents. occurrence of a switchover from primary to backup path) and
Finally, the automation loop is closed by having the SAIN collector pr forwards the information to the orchestrator, which reconfigures the agents.
oviding feedback to the network/service orchestrator. Finally, the automation loop is closed by having the SAIN collector pr
ovide feedback to the network/service orchestrator.
</t> </t>
<t> <t>
In order to make agents, orchestrators and collectors from different vendors In order to make agents, orchestrators, and collectors from different vendor
interoperable, their interface is defined as a YANG model in a companion docume s interoperable, their interface is defined as a YANG module in a companion docu
nt <xref target="I-D.ietf-opsawg-service-assurance-yang"/>. ment <xref target="RFC9418" format="default"/>.
In <xref target="figure_1"/>, the communications that are normalized b In <xref target="figure_1" format="default"/>, the communications that
y this YANG model are tagged with a "Y". are normalized by this YANG module are tagged with a "Y".
The use of this YANG model is further explained in <xref target="open_ The use of this YANG module is further explained in <xref target="open
interfaces_with_YANG_modules"/>. _interfaces_with_YANG_modules" format="default"/>.
</t> </t>
<t> <figure anchor="figure_1">
<figure anchor="figure_1" title="SAIN Architecture"> <name>SAIN Architecture</name>
<artwork><![CDATA[ <artwork name="" type="" align="left" alt=""><![CDATA[
+-----------------+ +-----------------+
| Service | | Service |
| Orchestrator |<----------------------+ | Orchestrator |<----------------------+
| | | | | |
+-----------------+ | +-----------------+ |
| ^ | | ^ |
| | Network | | | Network |
| | Service | Feedback | | Service | Feedback
| | Instance | Loop | | Instance | Loop
| | Configuration | | | Configuration |
| | | | | |
| V | | V |
| +-----------------+ Graph +-------------------+ | +-----------------+ Graph +-------------------+
| | SAIN | updates | SAIN | | | SAIN | Updates | SAIN |
| | Orchestrator |<--------| Collector | | | Orchestrator |<--------| Collector |
| +-----------------+ +-------------------+ | +-----------------+ +-------------------+
| | ^ | | ^
| Y| Configuration | Health Status | Y| Configuration | Health Status
| | (assurance graph) Y| (Score + Symptoms) | | (Assurance Graph) Y| (Score + Symptoms)
| V | Streamed | V | Streamed
| +-------------------+ | via Telemetry | +-------------------+ | via Telemetry
| |+-------------------+ | | |+-------------------+ |
| ||+-------------------+ | | ||+-------------------+ |
| +|| SAIN |-----------+ | +|| SAIN |-----------+
| +| agent | | +| Agent |
| +-------------------+ | +-------------------+
| ^ ^ ^ | ^ ^ ^
| | | | | | | |
| | | | Metric Collection | | | | Metric Collection
V V V V V V V V
+-------------------------------------------------------------+ +-------------------------------------------------------------+
| (Network) System | | (Network) System |
| | | |
+-------------------------------------------------------------+ +-------------------------------------------------------------+
]]></artwork> ]]></artwork>
</figure></t> </figure>
<t> <t>
In order to produce the score assigned to a service instance, the variou s involved components perform the following tasks: In order to produce the score assigned to a service instance, the variou s involved components perform the following tasks:
<list style="symbols"> </t>
<t> <ul spacing="normal">
<li>
Analyze the configuration pushed to the network device(s) for conf iguring the service instance. Analyze the configuration pushed to the network device(s) for conf iguring the service instance.
From there, determine which information (called a metric) must be collected from the device(s) and which operations to apply to the metrics to com pute the health status. From there, determine which information (called a metric) must be collected from the device(s) and which operations to apply to the metrics to com pute the health status.
</t> </li>
<t> <li>
Stream (via telemetry <xref target="RFC8641"/>) operational and conf Stream (via telemetry, such as YANG-Push <xref target="RFC8641" form
ig metric values when possible, else continuously poll. at="default"/>) operational and config metric values when possible, else continu
</t> ously poll.
<t> </li>
Continuously compute the health status of the service instances, bas <li>
ed on the metric values. Continuously compute the health status of the service instances base
</t> d on the metric values.
</list> </li>
</t> </ul>
<t> <t>
The SAIN architecture requires time synchronization, with Network Time Protocol (NTP) <xref target="RFC5905"/> as a candidate, between all elements: m onitored entities, SAIN agents, Service orchestrator, the SAIN collector, as wel l as the SAIN orchestrator. This guarantees the correlations of all symptoms in the system, correlated with the right assurance graph version. The SAIN architecture requires time synchronization, with the Network Time Protocol (NTP) <xref target="RFC5905" format="default"/> as a candidate, be tween all elements: monitored entities, SAIN agents, service orchestrator, the S AIN collector, as well as the SAIN orchestrator. This guarantees the correlation s of all symptoms in the system, correlated with the right assurance graph versi on.
</t> </t>
<section anchor="inferring" title="Translating a Service Instance Configur <section anchor="inferring" numbered="true" toc="default">
ation into an Assurance Graph"> <name>Translating a Service Instance Configuration into an Assurance Gra
ph</name>
<t> <t>
In order to structure the assurance of a service instance, the SAIN or chestrator decomposes the service instance into so-called subservice instances. In order to structure the assurance of a service instance, the SAIN or chestrator decomposes the service instance into so-called subservice instances.
Each subservice instance focuses on a specific feature or subpart of the service. Each subservice instance focuses on a specific feature or subpart of the service.
</t> </t>
<t> <t>
The decomposition into subservices is an important function of the arc The decomposition into subservices is an important function of the arc
hitecture, for the following reasons: hitecture for the following reasons:
<list style="symbols"> </t>
<t> <ul spacing="normal">
The result of this decomposition provides a relational picture of <li>
a service instance, that can be represented as a graph (called assurance graph) The result of this decomposition provides a relational picture of
to the operator. a service instance, which can be represented as a graph (called an assurance gra
</t> ph) to the operator.
<t> </li>
<li>
Subservices provide a scope for particular expertise and thereby e nable contribution from external experts. Subservices provide a scope for particular expertise and thereby e nable contribution from external experts.
For instance, the subservice dealing with the optics health shou For instance, the subservice dealing with the optic's health sho
ld be reviewed and extended by an expert in optical interfaces. uld be reviewed and extended by an expert in optical interfaces.
</t> </li>
<t> <li>
Subservices that are common to several service instances are reuse d for reducing the amount of computation needed. Subservices that are common to several service instances are reuse d for reducing the amount of computation needed.
For instance, the subservice assuring a given interface is reuse d by any service instance relying on that interface. For instance, the subservice assuring a given interface is reuse d by any service instance relying on that interface.
</t> </li>
</list> </ul>
</t>
<t> <t>
The assurance graph of a service instance is a DAG representing the st ructure of the assurance case for the service instance. The nodes of this graph are service instances or subservice instances. Each edge of this graph indicates a dependency between the two nodes at its extremities: the service or subservic e at the source of the edge depends on the service or subservice at the destinat ion of the edge. The assurance graph of a service instance is a DAG representing the st ructure of the assurance case for the service instance. The nodes of this graph are service instances or subservice instances. Each edge of this graph indicates a dependency between the two nodes at its extremities, i.e., the service or sub service at the source of the edge depends on the service or subservice at the de stination of the edge.
</t> </t>
<t> <t>
<xref target="figure_2"/> depicts a simplistic example of the assuranc e graph for a tunnel service. The node at the top is the service instance, the n odes below are its dependencies. In the example, the tunnel service instance dep ends on the "peer1" and "peer2" tunnel interfaces (the tunnel interfaces created on the peer1 and peer2 devices, respectively), which in turn depend on the resp ective physical interfaces, which finally depend on the respective "peer1" and " peer2" devices. The tunnel service instance also depends on the IP connectivity that depends on the IS-IS routing protocol. <xref target="figure_2" format="default"/> depicts a simplistic exampl e of the assurance graph for a tunnel service. The node at the top is the servic e instance; the nodes below are its dependencies. In the example, the tunnel ser vice instance depends on the "peer1" and "peer2" tunnel interfaces (the tunnel i nterfaces created on the peer1 and peer2 devices, respectively), which in turn d epend on the respective physical interfaces, which finally depend on the respect ive "peer1" and "peer2" devices. The tunnel service instance also depends on the IP connectivity that depends on the IS-IS routing protocol.
</t> </t>
<t> <figure anchor="figure_2">
<figure anchor="figure_2" title="Assurance Graph Example"> <name>Assurance Graph Example</name>
<artwork><![CDATA[ <artwork name="" type="" align="left" alt=""><![CDATA[
+------------------+ +------------------+
| Tunnel | | Tunnel |
| Service Instance | | Service Instance |
+------------------+ +------------------+
| |
+--------------------+-------------------+ +--------------------+-------------------+
| | | | | |
v v v v v v
+-------------+ +--------------+ +-------------+ +-------------+ +--------------+ +-------------+
| Peer1 | | IP | | Peer2 | | Peer1 | | IP | | Peer2 |
skipping to change at line 375 skipping to change at line 361
| Interface | | Protocol | | Interface | | Interface | | Protocol | | Interface |
+-------------+ +-------------+ +-------------+ +-------------+ +-------------+ +-------------+
| | | |
v v v v
+-------------+ +-------------+ +-------------+ +-------------+
| | | | | | | |
| Peer1 | | Peer2 | | Peer1 | | Peer2 |
| Device | | Device | | Device | | Device |
+-------------+ +-------------+ +-------------+ +-------------+
]]></artwork> ]]></artwork>
</figure> </figure>
</t>
<t> <t>
Depicting the assurance graph helps the operator to understand (and as sert) the decomposition. Depicting the assurance graph helps the operator to understand (and as sert) the decomposition.
The assurance graph shall be maintained during normal operation with addition, modification and removal of service instances. The assurance graph shall be maintained during normal operation with addition, modification, and removal of service instances.
A change in the network configuration or topology shall automaticall y be reflected in the assurance graph. A change in the network configuration or topology shall automaticall y be reflected in the assurance graph.
As a first example, a change of routing protocol from IS-IS to OSPF As a first example, a change of the routing protocol from IS-IS to O
would change the assurance graph accordingly. SPF would change the assurance graph accordingly.
As a second example, assuming that ECMP is in place for the source r As a second example, assume that the ECMP is in place for the source
outer for that specific tunnel; in that case, multiple interfaces must now be mo router for that specific tunnel; in that case, multiple interfaces must now be
nitored, on top of the monitoring the ECMP health itself. monitored, in addition to monitoring the ECMP health itself.
</t> </t>
<section anchor="circular_dependencies" title="Circular Dependencies"> <section anchor="circular_dependencies" numbered="true" toc="default">
<t> <name>Circular Dependencies</name>
<t>
The edges of the assurance graph represent dependencies. An The edges of the assurance graph represent dependencies. An
assurance graph is a DAG if and only if there are no circular assurance graph is a DAG if and only if there are no circular
dependencies among the subservices, and every assurance dependencies among the subservices, and every assurance
graph should avoid circular dependencies. However, in some cases, graph should avoid circular dependencies. However, in some cases,
circular dependencies might appear in the assurance graph. circular dependencies might appear in the assurance graph.
</t> </t>
<t> <t>
First, the assurance graph of a whole system is obtained by First, the assurance graph of a whole system is obtained by
combining the assurance graph of every service running on that combining the assurance graph of every service running on that
system. Here combining means that two subservices having the system. Here, combining means that two subservices having the
same type and the same parameters are in fact the same same type and the same parameters are in fact the same
subservice and thus a single node in the graph. For instance, subservice and thus a single node in the graph. For instance,
the subservice of type "device" with the only parameter the subservice of type "device" with the only parameter
(the device ID) set to "PE1" will appear only once in the (the device ID) set to "PE1" will appear only once in the
whole assurance graph even if several service instances rely whole assurance graph, even if several service instances rely
on that device. Now, if two engineers design assurance graphs for on that device. Now, if two engineers design assurance graphs for
two different services, and engineer A decides that an interface two different services, and Engineer A decides that an interface
depends on the link it is connected to, but engineer B decides that depends on the link it is connected to, but Engineer B decides that
the link depends on the interface it is connected to, then when the link depends on the interface it is connected to, then when
combining the two assurance graphs, we will have a circular combining the two assurance graphs, we will have a circular
dependency interface -&gt; link -&gt; interface. dependency interface -&gt; link -&gt; interface.
</t> </t>
<t> <t>
Another case possibly resulting in circular dependencies is when s ubservices are not properly identified. Another case possibly resulting in circular dependencies is when s ubservices are not properly identified.
Assume that we want to assure a cloud-based computing cluster that runs containers. Assume that we want to assure a cloud-based computing cluster that runs containers.
We could represent the cluster by a subservice and the network ser vice connecting containers on the cluster by another subservice. We could represent the cluster by a subservice and the network ser vice connecting containers on the cluster by another subservice.
We will likely model that the network service depends on the clust er, because the network service runs in a container supported by the cluster. We would likely model that as the network service depending on the cluster, because the network service runs in a container supported by the clust er.
Conversely, the cluster depends on the network service for connect ivity between containers, which creates a circular dependency. Conversely, the cluster depends on the network service for connect ivity between containers, which creates a circular dependency.
A finer decomposition might distinguish between the resources for executing containers (a part of our cluster subservice) and the communication be tween the containers (which could be modelled in the same way as communication b etween routers). A finer decomposition might distinguish between the resources for executing containers (a part of our cluster subservice) and the communication be tween the containers (which could be modeled in the same way as communication be tween routers).
</t> </t>
<t> <t>
In any case, it is likely that circular dependencies will show up in In any case, it is likely that circular dependencies will show up in
the assurance graph. A first step would be to detect the assurance graph. A first step would be to detect
circular dependencies as soon as possible in the SAIN circular dependencies as soon as possible in the SAIN
architecture. Such a detection could be carried out by architecture. Such a detection could be carried out by
the SAIN orchestrator. Whenever a circular dependency the SAIN orchestrator. Whenever a circular dependency
is detected, the newly added service would not be is detected, the newly added service would not be
monitored until more careful modelling or alignment monitored until more careful modeling or alignment
between the different teams (engineer A and B) remove the circular between the different teams (Engineers A and B) remove the circular
dependency. dependency.
</t> </t>
<t> <t>
As more elaborate solution we could consider a graph transformation: As a more elaborate solution, we could consider a graph transformati
<list style="symbols"> on:
<t>Decompose the graph into strongly connected components.</t> </t>
<t> <ul spacing="normal">
<li>Decompose the graph into strongly connected components.</li>
<li>
<t>
For each strongly connected component: For each strongly connected component:
<list style="symbols"> </t>
<t>Remove all edges between nodes of the strongly connected com <ul spacing="normal">
ponent</t> <li>remove all edges between nodes of the strongly connected com
<t>Add a new "synthetic" node for the strongly connected compon ponent;</li>
ent</t> <li>add a new "synthetic" node for the strongly connected compon
<t>For each edge pointing to a node in the strongly connected c ent;</li>
omponent, change the destination to the "synthetic" node</t> <li>for each edge pointing to a node in the strongly connected c
<t>Add a dependency from the "synthetic" node to every node in omponent, change the destination to the "synthetic" node; and</li>
the strongly connected component.</t> <li>add a dependency from the "synthetic" node to every node in
</list> the strongly connected component.</li>
</t> </ul>
</list> </li>
</t> </ul>
<t> <t>
Such an algorithm would include all symptoms detected by any Such an algorithm would include all symptoms detected by any
subservice in one of the strongly component and make it subservice in one of the strongly connected components and make it
available to any subservice that depends on it. available to any subservice that depends on it.
<xref target="graph_transformation"/> shows an example <xref target="graph_transformation" format="default"/> shows an example
of such a transformation. On the left-hand side, the nodes c, d, e of such a transformation. On the left-hand side, the nodes c, d, e,
and f form a strongly connected component. The status of node a shou ld and f form a strongly connected component. The status of node a shou ld
depend on the status of nodes c, d, e, f, g, and h, but this is hard to depend on the status of nodes c, d, e, f, g, and h, but this is hard to
compute because of the circular dependency. On the right hand-side, compute because of the circular dependency. On the right-hand side,
a depends on all these nodes as well, but there the circular node a depends on all these nodes as well, but the circular
dependency has been removed. dependency has been removed.
</t> </t>
<t> <figure anchor="graph_transformation">
<figure anchor="graph_transformation" title="Graph transformation"> <name>Graph Transformation</name>
<artwork><![CDATA[ <artwork name="" type="" align="left" alt=""><![CDATA[
+---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+
| a | | b | | | a | | b | | a | | b | | | a | | b |
+---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+
| | | | | | | | | |
v v | v v v v | v v
+---+ +---+ | +------------+ +---+ +---+ | +------------+
| c |--->| d | | | synthetic | | c |--->| d | | | synthetic |
+---+ +---+ | +------------+ +---+ +---+ | +------------+
^ | | / | | \ ^ | | / | | \
| | | / | | \ | | | / | | \
skipping to change at line 480 skipping to change at line 468
+---+ +---+ | +---+ +---+ +---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+
| | | | | | | | | |
v v | v v v v | v v
+---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+
| g | | h | | | g | | h | | g | | h | | | g | | h |
+---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+
Before After Before After
Transformation Transformation Transformation Transformation
]]></artwork> ]]></artwork>
</figure> </figure>
</t>
<t> <t>
We consider a concrete example to illustrate this transformation. We consider a concrete example to illustrate this transformation.
Lets assume that Engineer A is building an assurance graph dealing with IS-IS and Engineer B is building an assurance graph dealing with OSPF. Let's assume that Engineer A is building an assurance graph dealing with IS-IS and Engineer B is building an assurance graph dealing with OSPF.
The graph from Engineer A could contain the following: The graph from Engineer A could contain the following:
</t> </t>
<t> <figure anchor="is-is_link">
<figure anchor="is-is_link" title="Fragment of assurance graph from En <name>Fragment of the Assurance Graph from Engineer A</name>
gineer A"> <artwork name="" type="" align="left" alt=""><![CDATA[
<artwork><![CDATA[
+------------+ +------------+
| IS-IS Link | | IS-IS Link |
+------------+ +------------+
| |
v v
+------------+ +------------+
| Phys. Link | | Phys. Link |
+------------+ +------------+
| | | |
v v v v
skipping to change at line 504 skipping to change at line 490
| |
v v
+------------+ +------------+
| Phys. Link | | Phys. Link |
+------------+ +------------+
| | | |
v v v v
+-------------+ +-------------+ +-------------+ +-------------+
| Interface 1 | | Interface 2 | | Interface 1 | | Interface 2 |
+-------------+ +-------------+ +-------------+ +-------------+
]]></artwork> ]]></artwork>
</figure> </figure>
</t>
<t> <t>
The graph from Engineer B could contain the following: The graph from Engineer B could contain the following:
</t> </t>
<t> <figure anchor="ospf_link">
<figure anchor="ospf_link" title="Fragment of assurance graph from Engin <name>Fragment of the Assurance Graph from Engineer B</name>
eer B"> <artwork name="" type="" align="left" alt=""><![CDATA[
<artwork><![CDATA[
+------------+ +------------+
| OSPF Link | | OSPF Link |
+------------+ +------------+
| | | | | |
v | v v | v
+-------------+ | +-------------+ +-------------+ | +-------------+
| Interface 1 | | | Interface 2 | | Interface 1 | | | Interface 2 |
+-------------+ | +-------------+ +-------------+ | +-------------+
| | | | | |
v v v v v v
skipping to change at line 528 skipping to change at line 511
| | | | | |
v | v v | v
+-------------+ | +-------------+ +-------------+ | +-------------+
| Interface 1 | | | Interface 2 | | Interface 1 | | | Interface 2 |
+-------------+ | +-------------+ +-------------+ | +-------------+
| | | | | |
v v v v v v
+------------+ +------------+
| Phys. Link | | Phys. Link |
+------------+ +------------+
]]></artwork> ]]></artwork>
</figure> </figure>
</t>
<t> <t>
Each Interface subservice and the Physical Link subservice are commo The Interface subservices and the Physical Link subservice are commo
n to both fragments above. n to both fragments above.
Each of these subservice appears only once in the graph merging the Each of these subservices appear only once in the graph merging the
two fragments. two fragments.
Dependencies from both fragments are included in the merged graph, r esulting in a circular dependency: Dependencies from both fragments are included in the merged graph, r esulting in a circular dependency:
</t> </t>
<t> <figure anchor="ospf_isis_circ_dep">
<figure anchor="ospf_isis_circ_dep" title="Merging graphs from A and B"> <name>Merging Graphs from Engineers A and B</name>
<artwork><![CDATA[ <artwork name="" type="" align="left" alt=""><![CDATA[
+------------+ +------------+ +------------+ +------------+
| IS-IS Link | | OSPF Link |---+ | IS-IS Link | | OSPF Link |---+
+------------+ +------------+ | +------------+ +------------+ |
| | | | | | | |
| +-------- + | | | +-------- + | |
v v | | v v | |
+------------+ | | +------------+ | |
| Phys. Link |<-------+ | | | Phys. Link |<-------+ | |
+------------+ | | | +------------+ | | |
| ^ | | | | | ^ | | | |
skipping to change at line 559 skipping to change at line 539
+------------+ | | | +------------+ | | |
| ^ | | | | | ^ | | | |
| | +-------+ | | | | | +-------+ | | |
v | v | v | v | v | v |
+-------------+ +-------------+ | +-------------+ +-------------+ |
| Interface 1 | | Interface 2 | | | Interface 1 | | Interface 2 | |
+-------------+ +-------------+ | +-------------+ +-------------+ |
^ | ^ |
| | | |
+------------------------------+ +------------------------------+
]]></artwork> ]]></artwork>
</figure> </figure>
</t>
<t> <t>
The solution presented above would result in graph looking as follow s, where a new "synthetic" node is included. The solution presented above would result in a graph looking as foll ows, where a new "synthetic" node is included.
Using that transformation, all dependencies are indirectly satisfied for the nodes outside the circular dependency, in the sense that both IS-IS and OSPF links have indirect dependencies to the two interfaces and the link. Using that transformation, all dependencies are indirectly satisfied for the nodes outside the circular dependency, in the sense that both IS-IS and OSPF links have indirect dependencies to the two interfaces and the link.
However, the dependencies between the link and the interfaces are lo However, the dependencies between the link and the
st as they were causing the circular dependency. interfaces are lost since they were causing the circular dependency.
</t> </t>
<t> <figure anchor="ospf_isis_no_circ_dep">
<figure anchor="ospf_isis_no_circ_dep" title="Removing circular depend <name>Removing Circular Dependencies after Merging Graphs from Engin
encies after merging graphs from A and B"> eers A and B</name>
<artwork><![CDATA[ <artwork name="" type="" align="left" alt=""><![CDATA[
+------------+ +------------+ +------------+ +------------+
| IS-IS Link | | OSPF Link | | IS-IS Link | | OSPF Link |
+------------+ +------------+ +------------+ +------------+
| | | |
v v v v
+------------+ +------------+
| synthetic | | synthetic |
+------------+ +------------+
| |
+-----------+-------------+ +-----------+-------------+
skipping to change at line 587 skipping to change at line 565
+------------+ +------------+
| synthetic | | synthetic |
+------------+ +------------+
| |
+-----------+-------------+ +-----------+-------------+
| | | | | |
v v v v v v
+-------------+ +------------+ +-------------+ +-------------+ +------------+ +-------------+
| Interface 1 | | Phys. Link | | Interface 2 | | Interface 1 | | Phys. Link | | Interface 2 |
+-------------+ +------------+ +-------------+ +-------------+ +------------+ +-------------+
]]></artwork> ]]></artwork>
</figure> </figure>
</t> </section>
</section>
</section> </section>
<section anchor="intent" numbered="true" toc="default">
<section anchor="intent" title="Intent and Assurance Graph"> <name>Intent and Assurance Graph</name>
<t> <t>
The SAIN orchestrator analyzes the configuration of a service instance The SAIN orchestrator analyzes the configuration of a service instance
to: to do the following:
<list style="symbols"> </t>
<t> <ul spacing="normal">
Try to capture the intent of the service instance, i.e., what is t <li>
he service instance trying to achieve. Try to capture the intent of the service instance, i.e., What is t
At least, this requires the SAIN orchestrator to know the YANG m he service instance trying to achieve?
odules that are being configured on the devices to enable the service. At a minimum, this requires the SAIN orchestrator to know the YA
Note that if the service model or the network model is known to NG modules that are being configured on the devices to enable the service.
the SAIN orchestrator, the latter can exploit it. Note that, if the service model or the network model is known to
the SAIN orchestrator, the latter can exploit it.
In that case, the intent could be directly extracted and include more details, such as the notion of sites for a VPN, which is out of scope of t he device configuration. In that case, the intent could be directly extracted and include more details, such as the notion of sites for a VPN, which is out of scope of t he device configuration.
</t> </li>
<t> <li>
Decompose the service instance into subservices representing the n etwork features on which the service instance relies. Decompose the service instance into subservices representing the n etwork features on which the service instance relies.
</t> </li>
</list> </ul>
</t> <t>
<t> The SAIN orchestrator must be able to analyze the configuration pushed to
The SAIN orchestrator must be able to analyze configuration pushed to various devices of a service instance and produce the
various devices for configuring a service instance and produce the assurance gra assurance graph for that service instance.
ph for that service instance.
</t> </t>
<t> <t>
To schematize what a SAIN orchestrator does, assume that the configura To schematize what a SAIN orchestrator does, assume that
tion for a service instance touches two devices and configure on each device a v a service instance touches two devices and
irtual tunnel interface. Then: configures a virtual tunnel interface on each device. Then:
<list style="symbols">
<t>
Capturing the intent would start by detecting that the service ins
tance is actually a tunnel between the two devices, and stating that this tunnel
must be functional.
This solution is minimally invasive as it does not require modif
ying nor knowing the service model.
If the service model or network model is known by the SAIN orche
strator, it can be used to further capture the intent and include more informati
on such as Service Level Objectives.
For instance, the latency and bandwidth requirements for the tun
nel, if present in the service model
</t>
<t>
Decomposing the service instance into subservices would result in
the assurance graph depicted in <xref target="figure_2"/>, for instance.
</t>
</list>
</t> </t>
<ul spacing="normal">
<li>Capturing the intent would start by detecting that the service
instance is actually a tunnel between the two devices and stating
that this tunnel must be operational.
This solution is minimally invasive, as it does not require modi
fying nor knowing the service model.
If the service model or network model is known by the SAIN orche
strator, it can be used to further capture the intent and include more informati
on, such as Service-Level Objectives (e.g.,
the latency and bandwidth requirements for the tunnel) if presen
t in the service model.
</li>
<li>
Decomposing the service instance into subservices would result in
the assurance graph depicted in <xref target="figure_2" format="default"/>, for
instance.
</li>
</ul>
<t> <t>
The assurance graph, or more precisely the subservices and dependenc ies that a SAIN orchestrator can instantiate, should be curated. The assurance graph, or more precisely the subservices and dependenc ies that a SAIN orchestrator can instantiate, should be curated.
The organization of such a process is out-of-scope for this document The organization of such a process (i.e., ensure that existing sub
and should aim to: services are reused as much as possible
<list style="symbols"> and avoid circular dependencies) is out-of-scope for this
<t>Ensure that existing subservices are reused as much as possib document.
le.</t>
<t>Avoid circular dependencies.</t>
</list>
</t> </t>
<t> <t>
To be applied, SAIN requires a mechanism mapping a service instance to the configuration actually required on the devices for that service instance to run. To be applied, SAIN requires a mechanism mapping a service instance to the configuration actually required on the devices for that service instance to run.
While the <xref target="figure_1"/> makes a distinction between the SAIN orchestrator and a different component providing the service instance confi guration, in practice those two components are mostly likely combined. While <xref target="figure_1" format="default"/> makes a distinction between the SAIN orchestrator and a different component providing the service i nstance configuration, in practice those two components are most likely combined .
The internals of the orchestrator are out of scope of this document. The internals of the orchestrator are out of scope of this document.
</t> </t>
</section> </section>
<section anchor="subservices" numbered="true" toc="default">
<section anchor="subservices" title="Subservices"> <name>Subservices</name>
<t> <t>
A subservice corresponds to subpart or a feature of the network system A subservice corresponds to a subpart or a feature of the network syst
that is needed for a service instance to function properly. em that is needed for a service instance to function properly.
In the context of SAIN, a subservice is associated to its assurance, In the context of SAIN, a subservice is associated to its assurance,
that is the method for assuring that a subservice behaves correctly. which is the method for assuring that a subservice behaves correctly.
</t> </t>
<t> <t>
Subservices, just as with services, have high-level parameters that sp ecify the instance to be assured. Subservices, just as with services, have high-level parameters that sp ecify the instance to be assured.
The needed parameters depend on the subservice type. The needed parameters depend on the subservice type.
For example, assuring a device requires a specific deviceId as param For example, assuring a device requires a specific deviceId as a par
eter. ameter and
For example, assuring an interface requires a specific combination o assuring an interface requires a specific combination of deviceId an
f deviceId and interfaceId. d interfaceId.
</t> </t>
<t> <t>
When designing a new type of subservice, one should carefully define When designing a new type of subservice, one should carefully define w
what is the assured object or functionality. hat is the assured object or functionality.
Then, the parameters must be chosen as a minimal set that completely Then, the parameters
identify the object (see examples from the previous paragraph). must be chosen as a minimal set that completely identifies the object
Parameters cannot change during the lifecycle of a subservice. (see examples from the previous paragraph).
For instance, an IP address is a good parameter when assuring a conn Parameters cannot change during the life cycle of a subservice.
ectivity towards that address (i.e. a given device can reach a given IP address) For instance, an IP address is a good parameter when assuring a conn
, however it’s not a good parameter to identify an interface as the IP address a ectivity towards that address (i.e., a given device can reach a given IP address
ssigned to that interface can be changed. ); however, it's not a good parameter to identify an interface, as the IP addres
s assigned to that interface can be changed.
</t> </t>
<t> <t>
A subservice is also characterized by a list of metrics to fetch and a list of operations to apply to these metrics in order to infer a health status. A subservice is also characterized by a list of metrics to fetch and a list of operations to apply to these metrics in order to infer a health status.
</t> </t>
</section> </section>
<section anchor="building_the_expression_graph_from_the_assurance_graph" n
<section anchor="building_the_expression_graph_from_the_assurance_graph" t umbered="true" toc="default">
itle="Building the Expression Graph from the Assurance Graph"> <name>Building the Expression Graph from the Assurance Graph</name>
<t> <t>
From the assurance graph is derived a so-called global computation gra From the assurance graph, a so-called global computation graph is deri
ph. ved.
First, each subservice instance is transformed into a set of subserv First, each subservice instance is transformed into a set of subserv
ice expressions that take metrics and constants as input (i.e., sources of the D ice expressions that take metrics and constants as input (i.e., sources of the D
AG) and produce the status of the subservice, based on some heuristics. AG) and produce the status of the subservice based on some heuristics.
For instance, the health of an interface is 0 (minimal score) with t he symptom "interface admin-down" if the interface is disabled in the configurat ion. For instance, the health of an interface is 0 (minimal score) with t he symptom "interface admin-down" if the interface is disabled in the configurat ion.
Then for each service instance, the service expressions are construc ted by combining the subservice expressions of its dependencies. Then, for each service instance, the service expressions are constru cted by combining the subservice expressions of its dependencies.
The way service expressions are combined depends on the dependency t ypes (impacting or informational). The way service expressions are combined depends on the dependency t ypes (impacting or informational).
Finally, the global computation graph is built by combining the serv ice expressions, to get a global view of all subservices. Finally, the global computation graph is built by combining the serv ice expressions to get a global view of all subservices.
In other words, the global computation graph encodes all the operati ons needed to produce health statuses from the collected metrics. In other words, the global computation graph encodes all the operati ons needed to produce health statuses from the collected metrics.
</t> </t>
<t> <t>
The two types of dependencies for combining subservices are: The two types of dependencies for combining subservices are:
<list> </t>
<t> <dl newline="true" spacing="normal">
Informational Dependency: Type of dependency whose health score do <dt>Informational Dependency:</dt>
es not impact the health score of its parent subservice or service instance(s) i <dd>The type of dependency whose health score does not impact the healt
n the assurance graph. However, the symptoms should be taken into account in the h score of its parent subservice or service instance(s) in the assurance graph.
parent service instance or subservice instance(s), for informational reasons. However, the symptoms should be taken into account in the parent service instanc
</t> e or subservice instance(s) for informational reasons.</dd>
<t> <dt>Impacting Dependency:</dt>
Impacting Dependency: Type of dependency whose score impacts the s <dd>The type of dependency whose health score impacts the health score
core of its parent subservice or service instance(s) in the assurance graph. of its parent subservice or service instance(s) in the assurance graph.
The symptoms are taken into account in the parent service instance The symptoms are taken into account in the parent service instance
or subservice instance(s), as the impacting reasons. or subservice instance(s) as the impacting reasons.</dd>
</t> </dl>
</list> <t>
The set of dependency type presented here is not exhaustive. The set of dependency types presented here is not exhaustive.
More specific dependency types can be defined by extending the YANG mo More specific dependency types can be defined by extending the YANG mo
del. dule.
For instance, a connectivity subservice depending on several path subs For instance, a connectivity subservice depending on several path subs
ervices is only partially impacted if only one of these paths fails. ervices is partially impacted if only one of these paths fails.
Adding these new dependency types requires defining the corresponding operation for combining statuses of subservices. Adding these new dependency types requires defining the corresponding operation for combining statuses of subservices.
</t> </t>
<t> <t>
Subservices shall not be dependent on the protocol used to retrieve th e metrics. Subservices shall not be dependent on the protocol used to retrieve th e metrics.
To justify this, let's consider the interface operational status. To justify this, let's consider the interface operational status.
Depending on the device capabilities, this status can be collected b y an industry-accepted YANG module (IETF, Openconfig <xref target="OpenConfig"/> ), by a vendor-specific YANG module, or even by a MIB module. Depending on the device capabilities, this status can be collected b y an industry-accepted YANG module (e.g., IETF or Openconfig <xref target="OpenC onfig" format="default"/>), by a vendor-specific YANG module, or even by a MIB m odule.
If the subservice was dependent on the mechanism to collect the oper ational status, then we would need multiple subservice definitions in order to s upport all different mechanisms. If the subservice was dependent on the mechanism to collect the oper ational status, then we would need multiple subservice definitions in order to s upport all different mechanisms.
This also implies that, while waiting for all the metrics to be avai lable via standard YANG modules, SAIN agents might have to retrieve metric value s via non-standard YANG models, via MIB modules, Command Line Interface (CLI), e tc., effectively implementing a normalization layer between data models and info rmation models. This also implies that, while waiting for all the metrics to be avai lable via standard YANG modules, SAIN agents might have to retrieve metric value s via nonstandard YANG data models, MIB modules, the Command-Line Interface (CLI ), etc., effectively implementing a normalization layer between data models and information models.
</t> </t>
<t> <t>
In order to keep subservices independent of metric collection method, In order to keep subservices independent of metric collection metho
or, expressed differently, to support multiple combinations of platforms, OSes, d
and even vendors, the architecture introduces the concept of "metric engine". (or, expressed differently, to support multiple combinations of
platforms, OSes, and even vendors), the architecture introduces the
concept of "metric engine".
The metric engine maps each device-independent metric used in the subs ervices to a list of device-specific metric implementations that precisely defin e how to fetch values for that metric. The metric engine maps each device-independent metric used in the subs ervices to a list of device-specific metric implementations that precisely defin e how to fetch values for that metric.
The mapping is parameterized by the characteristics (model, OS version , etc.) of the device from which the metrics are fetched. The mapping is parameterized by the characteristics (i.e., model, OS v ersion, etc.) of the device from which the metrics are fetched.
This metric engine is included in the SAIN agent. This metric engine is included in the SAIN agent.
</t> </t>
</section> </section>
<section anchor="open_interfaces_with_YANG_modules" numbered="true" toc="d
<section anchor="open_interfaces_with_YANG_modules" title="Open Interfaces efault">
with YANG Modules"> <name>Open Interfaces with YANG Modules</name>
<t> <t>
The interfaces between the architecture components are open thanks t o the YANG modules specified in <xref target="I-D.ietf-opsawg-service-assurance- yang"/>; The interfaces between the architecture components are open thanks t o the YANG modules specified in <xref target="RFC9418" format="default"/>;
they specify objects for assuring network services based on their de composition into so-called subservices, according to the SAIN architecture. they specify objects for assuring network services based on their de composition into so-called subservices, according to the SAIN architecture.
</t> </t>
<t> <t>
These modules are intended for the following use cases: These modules are intended for the following use cases:
<list style="symbols"> </t>
<ul spacing="normal">
<li>
<t> <t>
Assurance graph configuration: Assurance graph configuration:
<list style="symbols">
<t>
Subservices: configure a set of subservices to assure, by spec
ifying their types and parameters.
</t>
<t>
Dependencies: configure the dependencies between the subservic
es, along with their types.
</t>
</list>
</t> </t>
<t> <ul spacing="normal">
Assurance telemetry: export the health status of the subservices, <li>
along with the observed symptoms. Subservices: Configure a set of subservices to assure by speci
</t> fying their types and parameters.
</list> </li>
</t> <li>
Dependencies: Configure the dependencies between the subservic
es, along with their types.
</li>
</ul>
</li>
<li>
Assurance telemetry: Export the health status of the subservices,
along with the observed symptoms.
</li>
</ul>
<t> <t>
Some examples of YANG instances can be found in Appendix A of <xref ta rget="I-D.ietf-opsawg-service-assurance-yang"/>. Some examples of YANG instances can be found in <xref target="RFC9418" format="default" sectionFormat="of" section="A"/>.
</t> </t>
</section> </section>
<section anchor="maintenance" numbered="true" toc="default">
<section anchor="maintenance" title="Handling Maintenance Windows"> <name>Handling Maintenance Windows</name>
<t> <t>
Whenever network components are under maintenance, the operator wa nts to inhibit the emission of symptoms from those components. Whenever network components are under maintenance, the operator wa nts to inhibit the emission of symptoms from those components.
A typical use case is device maintenance, during which the device is not supposed to be operational. A typical use case is device maintenance, during which the device is not supposed to be operational.
As such, symptoms related to the device health should be ignored. As such, symptoms related to the device health should be ignored.
Symptoms related to the device-specific subservices, such as the i nterfaces, might also be ignored because their state changes are probably the co nsequence of the maintenance. Symptoms related to the device-specific subservices, such as the i nterfaces, might also be ignored because their state changes are probably the co nsequence of the maintenance.
</t> </t>
<t> <t>
The ietf-service-assurance model proposed in <xref target="I-D.iet The ietf-service-assurance model described in <xref target="RFC941
f-opsawg-service-assurance-yang"/> enables flagging subservices as under mainten 8" format="default"/> enables flagging subservices as under maintenance and, in
ance, and, in that case, requires a string that identifies the person or process that case, requires a string that identifies the person or process that requeste
who requested the maintenance. d the maintenance.
When a service or subservice is flagged as under maintenance, it m When a service or subservice is flagged as under maintenance, it m
ust report a generic "Under Maintenance" symptom, for propagation towards subser ust report a generic "Under Maintenance" symptom for propagation towards subserv
vices that depend on this specific subservice. ices that depend on this specific subservice.
Any other symptom from this service, or by one of its impacting de Any other symptom from this service or by one of its impacting dep
pendencies must not be reported. endencies must not be reported.
</t> </t>
<t> <t>
We illustrate this mechanism on three independent examples based on We illustrate this mechanism on three independent examples based on
the assurance graph depicted in <xref target="figure_2"/>: the assurance graph depicted in <xref target="figure_2" format="default"/>:
<list style="symbols"> </t>
<t> Device maintenance, for instance upgrading the device OS. Th <ul spacing="normal">
e operator <li> Device maintenance, for instance, upgrading the device OS. The o
perator
flags the subservice "Peer1" device as under maintenance. flags the subservice "Peer1" device as under maintenance.
This inhibits the emission of symptoms, except "Under Maintenan This inhibits the emission of symptoms, except "Under Maintenan
ce", from "Peer1 ce" from "Peer1
Physical Interface", "Peer1 Tunnel Interface" and "Tunnel Servi Physical Interface", "Peer1 Tunnel Interface", and "Tunnel Serv
ce ice
Instance". All other subservices are unaffected. Instance". All other subservices are unaffected.
</t> </li>
<t> <li>
Interface maintenance, for instance replacing a broken optic. Interface maintenance, for instance, replacing a broken optic.
The operator flags the subservice "Peer1 Physical Interface" as under maintenance. The operator flags the subservice "Peer1 Physical Interface" as under maintenance.
This inhibits the emission of symptoms, except "Under Maintenan ce" This inhibits the emission of symptoms, except "Under Maintenan ce"
from "Peer 1 Tunnel Interface" and "Tunnel Service Instance". A ll from "Peer 1 Tunnel Interface" and "Tunnel Service Instance". A ll
other subservices are unaffected. other subservices are unaffected.
</t> </li>
<t> <li>
Routing protocol maintenance, for instance modifying parameters Routing protocol maintenance, for instance, modifying parameter
or s or
redistribution. The operator marks the subservice "IS-IS Routing Protocol" as under maintenance. redistribution. The operator marks the subservice "IS-IS Routing Protocol" as under maintenance.
This inhibits the emission of symptoms, except "Under Maintenance ", from "IP connectivity" and "Tunnel Service Instance". This inhibits the emission of symptoms, except "Under Maintenance " from "IP connectivity" and "Tunnel Service Instance".
All other subservices are unaffected. All other subservices are unaffected.
</t> </li>
</list> </ul>
</t> <t>
<t>
In each example above, the subservice under maintenance is complet ely impacting the service instance, putting it under maintenance as well. In each example above, the subservice under maintenance is complet ely impacting the service instance, putting it under maintenance as well.
There are use cases where the subservice under maintenance only pa rtially impacts the service instance. There are use cases where the subservice under maintenance only pa rtially impacts the service instance.
For instance, consider a service instance supported by both a pri mary and backup path. For instance, consider a service instance supported by both a pri mary and backup path.
If a subservice impacting the primary path is under maintenance, t he service instance might still be functional but degraded. If a subservice impacting the primary path is under maintenance, t he service instance might still be functional but degraded.
In that case, the status of the service instance might include "Pr imary path Under Maintenance", "No redundancy" as well as other symptoms from th e backup path to explain the lower health score. In that case, the status of the service instance might include "Pr imary path Under Maintenance", "No redundancy", as well as other symptoms from t he backup path to explain the lower health score.
In general, the computation of the service instance status from th e subservices is done in the SAIN collector whose implementation is out of scope for this document. In general, the computation of the service instance status from th e subservices is done in the SAIN collector whose implementation is out of scope for this document.
</t> </t>
<t> <t>
The maintenance of a subservice might modify or hide modifications of the structure of the assurance graph. The maintenance of a subservice might modify or hide modifications of the structure of the assurance graph.
Therefore, unflagging a subservice as under maintenance should tri gger an update of the assurance graph. Therefore, unflagging a subservice as under maintenance should tri gger an update of the assurance graph.
</t> </t>
</section> </section>
<section anchor="flexible_architecture" numbered="true" toc="default">
<section anchor="flexible_architecture" title="Flexible Functional Archite <name>Flexible Functional Architecture</name>
cture">
<t> <t>
The SAIN architecture is flexible in terms of components. While the The SAIN architecture is flexible in terms of components. While the
SAIN architecture in <xref target="figure_1"/> makes a distinction bet SAIN architecture in <xref target="figure_1" format="default"/> makes
ween two components, a distinction between two components,
the service orchestrator and the SAIN orchestrator, in practice thos the service orchestrator and the SAIN orchestrator, in practice the
e two components are mostly likely combined. two components are most likely combined.
Similarly, the SAIN agents are displayed in <xref target="figure_1"/> Similarly, the SAIN agents are displayed in <xref target="figure_1" fo
as being separate components. Practically, the SAIN agents could be either indep rmat="default"/> as being separate components. In practice, the SAIN agents coul
endent d be either independent
components or directly integrated in monitored entities. components or directly integrated in monitored entities.
A practical example is an agent in a router. A practical example is an agent in a router.
</t> </t>
<t> <t>
The SAIN architecture is also flexible in terms of services and subs ervices. The SAIN architecture is also flexible in terms of services and subs ervices.
In the proposed architecture, the SAIN orchestrator is coupled to a In the defined architecture, the SAIN orchestrator is coupled to a s
service orchestrator which defines the kinds of services that the architecture h ervice orchestrator, which defines the kinds of services that the architecture h
andles. andles.
Most examples in this document deal with the notion of Network Servi Most examples in this document deal with the notion of Network Servi
ce YANG modules, with well-known services such as L2VPN or tunnels. ce YANG Modules with well-known services, such as L2VPN or tunnels.
However, the concept of services is general enough to cross into dif ferent domains. However, the concept of services is general enough to cross into dif ferent domains.
One of them is the domain of service management on network elements, which also require their own assurance. One of them is the domain of service management on network elements, which also require their own assurance.
Examples include a DHCP server on a Linux server, a data plane, an I PFIX export, etc. Examples include a DHCP server on a Linux server, a data plane, an I PFIX export, etc.
The notion of "service" is generic in this architecture and depends on the service orchestrator and underlying network system, as illustrated by the following examples: The notion of "service" is generic in this architecture and depends on the service orchestrator and underlying network system, as illustrated by the following examples:
<list style="symbols"> </t>
<t>if a main service orchestrator coordinates several lower leve <ul spacing="normal">
l controllers, a service for the controller can be a subservice from the point o <li>If a main service orchestrator coordinates several lower-level con
f view of the orchestrator.</t> trollers, a service for the controller can be a subservice from the point of vie
<t>A DHCP server/data plane/IPFIX export can be considered as su w of the orchestrator.</li>
bservices for a device.</t> <li>A DHCP server / data plane / IPFIX export can be considered subser
<t>A routing instance can be considered as a subservice for a L3 vices for a device.</li>
VPN.</t> <li>A routing instance can be considered a subservice for an L3VPN.</l
<t>A tunnel can be considered as a subservice for an application i>
in the cloud.</t> <li>A tunnel can be considered a subservice for an application in the
<t>A service function can be considered as a subservice for a se cloud.</li>
rvice function chain <xref target="RFC7665"/>.</t> <li>A service function can be considered a subservice for a service fu
</list> nction chain <xref target="RFC7665" format="default"/>.</li>
</ul>
<t>
The assurance graph is created to be flexible and open, regardless o f the subservice types, locations, or domains. The assurance graph is created to be flexible and open, regardless o f the subservice types, locations, or domains.
</t> </t>
<t> <t>
The SAIN architecture is also flexible in terms of distributed graphs. The SAIN architecture is also flexible in terms of distributed graphs.
As shown in <xref target="figure_1"/>, the architecture comprises sev eral agents. As shown in <xref target="figure_1" format="default"/>, the architect ure comprises several agents.
Each agent is responsible for handling a subgraph of the assurance gra ph. Each agent is responsible for handling a subgraph of the assurance gra ph.
The collector is responsible for fetching the sub-graphs from the diff The collector is responsible for fetching the subgraphs from the diffe
erent rent
agents and gluing them together. As an example, in the graph from <x agents and gluing them together. As an example, in the graph from <x
ref target="figure_2"/>, the subservices relative to Peer 1 might be handled by ref target="figure_2" format="default"/>, the subservices relative to Peer 1 mig
a ht be handled by a
different agent than the subservices relative to Peer 2 and the Connec different agent than the subservices relative to Peer 2, and the Conne
tivity ctivity
and IS-IS subservices might be handled by yet another agent. The agen ts will and IS-IS subservices might be handled by yet another agent. The agen ts will
export their partial graph and the collector will stitch them together as export their partial graph, and the collector will stitch them togethe r as
dependencies of the service instance. dependencies of the service instance.
</t> </t>
<t> <t>
And finally, the SAIN architecture is flexible in terms of what it mon itors. And finally, the SAIN architecture is flexible in terms of what it mon itors.
Most, if not all examples, in this document refer to physical componen ts, but Most, if not all, examples in this document refer to physical componen ts, but
this is not a constraint. Indeed, the assurance of virtual components would this is not a constraint. Indeed, the assurance of virtual components would
follow the same principles and an assurance graph composed of virtuali zed follow the same principles, and an assurance graph composed of virtual ized
components (or a mix of virtualized and physical ones) is supported by components (or a mix of virtualized and physical ones) is supported by
this architecture. this architecture.
</t> </t>
</section> </section>
<section anchor="garbage_collection" numbered="true" toc="default">
<section anchor="garbage_collection" title="Time window for symptoms histo <name>Time Window for Symptoms' History</name>
ry"> <t>
<t>
The health status reported via the YANG modules contains, for each subservice, the list of symptoms. The health status reported via the YANG modules contains, for each subservice, the list of symptoms.
Symptoms have a start and end date, making it is possible to repor t symptoms that are no longer occurring. Symptoms have a start and end date, making it is possible to repor t symptoms that are no longer occurring.
</t> </t>
<t> <t>
The SAIN agent might have to remove some symptoms for specific subserv The SAIN agent might have to remove some symptoms for specific subserv
ice symptoms, because ice symptoms because
there are outdated and not relevant any longer, or simply because the they are outdated and no longer relevant or simply because the SAIN ag
SAIN agent needs to ent needs to
free up some space. Regardless of the reason, it's important for a SAI N collector free up some space. Regardless of the reason, it's important for a SAI N collector
(re-)connecting to a SAIN agent to understand the effect of this garba ge collection. connecting/reconnecting to a SAIN agent to understand the effect of th is garbage collection.
</t> </t>
<t> <t>
Therefore, the SAIN agent contains a YANG object specifying the date and time at which Therefore, the SAIN agent contains a YANG object specifying the date and time at which
the symptoms' history starts for the subservice instances. the symptoms' history starts for the subservice instances.
The subservice reports only symptoms that are occurring or that have been occurring after the history start date. The subservice reports only symptoms that are occurring or that have been occurring after the history start date.
</t> </t>
</section> </section>
<section anchor="new_assurance_graph_generation" numbered="true" toc="defa
<section anchor="new_assurance_graph_generation" title="New Assurance Grap ult">
h Generation"> <name>New Assurance Graph Generation</name>
<t> <t>
The assurance graph will change over time, because services and subser The assurance graph will change over time, because services and subser
vices come and go (changing the dependencies between subservices), or as a resul vices come and go (changing the dependencies between subservices) or as a result
t of resolving maintenance issues. Therefore, an assurance graph version must be of resolving maintenance issues. Therefore, an assurance graph version must be
maintained, along with the date and time of its last generation. The date and t maintained, along with the date and time of its last generation. The date and ti
ime of a particular subservice instance (again dependencies or under maintenance me of a particular subservice instance (again dependencies or under maintenance)
) might be kept. From a client point of view, an assurance graph change is trigg might be kept. From a client point of view, an assurance graph change is trigge
ered by the value of the assurance-graph-version and assurance-graph-last-change red by the value of the assurance-graph-version and assurance-graph-last-change
YANG leaves. At that point in time, the client (collector) follows the followin YANG leaves. At that point in time, the client (collector) follows the following
g process: process:
<list style="symbols">
<t>
Keep the previous assurance-graph-last-change value (let's call it
time T)
</t>
<t>
Run through all subservice instances and process the subservice in
stances for which the last-change is newer that the time T
</t>
<t>
Keep the new assurance-graph-last-change as the new referenced dat
e and time
</t>
</list>
</t> </t>
<ul spacing="normal">
<li>
Keep the previous assurance-graph-last-change value (let's call it
time T).
</li>
<li>
Run through all the subservice instances and process the subservic
e instances for which the last-change is newer than the time T.
</li>
<li>
Keep the new assurance-graph-last-change as the new referenced dat
e and time.
</li>
</ul>
</section> </section>
</section> </section>
<section anchor="iana" numbered="true" toc="default">
<section anchor="security" title="Security Considerations"> <name>IANA Considerations</name>
<t> <t>This document has no IANA actions.
The SAIN architecture helps operators to reduce the mean time to detect </t>
and mean time to repair. </section>
However, the SAIN agents must be secured: a compromised SAIN agent may <section anchor="security" numbered="true" toc="default">
be sending wrong root causes or symptoms to the management systems. <name>Security Considerations</name>
<t>The SAIN architecture helps operators to reduce the mean time to detect
and the mean time to repair.
However, the SAIN agents must be secured; a compromised SAIN agent may
be sending incorrect root causes or symptoms to the management systems.
Securing the agents falls back to ensuring the integrity and confidenti ality of the assurance graph. Securing the agents falls back to ensuring the integrity and confidenti ality of the assurance graph.
This can be partially achieved by correctly setting permissions of eac h node in the YANG model as described in Section 6 of <xref target="I-D.ietf-ops awg-service-assurance-yang"/>. This can be partially achieved by correctly setting permissions of eac h node in the YANG data model, as described in <xref target="RFC9418" format="de fault" sectionFormat="of" section="6"/>.
</t> </t>
<t> <t>
Except for the configuration of telemetry, the agents do not need "writ e access" to the devices they monitor. Except for the configuration of telemetry, the agents do not need "writ e access" to the devices they monitor.
This configuration is applied with a YANG module, whose protection is covered by Secure Shell (SSH) <xref target="RFC6242"/> for NETCONF or TLS <xref target="RFC8446"/> for RESTCONF. This configuration is applied with a YANG module, whose protection is covered by Secure Shell (SSH) <xref target="RFC6242" format="default"/> for the Network Configuration Protocol (NETCONF) or TLS <xref target="RFC8446" format=" default"/> for RESTCONF.
Devices should be configured so that agents have their own credentials with write access only for the YANG nodes configuring the telemetry. Devices should be configured so that agents have their own credentials with write access only for the YANG nodes configuring the telemetry.
</t> </t>
<t> <t>
The data collected by SAIN could potentially be compromising to the net work or provide more insight into how the network is designed. The data collected by SAIN could potentially be compromising to the net work or provide more insight into how the network is designed.
Considering the data that SAIN requires (including CLI access in some cases), one should weigh data access concerns with the impact that reduced visib ility will have on being able to rapidly identify root causes. Considering the data that SAIN requires (including CLI access in some cases), one should weigh data access concerns with the impact that reduced visib ility will have on being able to rapidly identify root causes.
</t> </t>
<t> <t>
For building the assurance graph, the SAIN orchestrator needs to obtai n the configuration from the service orchestrator. For building the assurance graph, the SAIN orchestrator needs to obtai n the configuration from the service orchestrator.
The latter should restrict access of the SAIN orchestrator to informat ion needed to build the assurance graph. The latter should restrict access of the SAIN orchestrator to informat ion needed to build the assurance graph.
</t> </t>
<t> <t>
If a closed loop system relies on this architecture then the well known issue of those systems also applies, i.e., a lying device or compromised agent could trigger partial reconfiguration of the service or network. If a closed loop system relies on this architecture, then the well-known issue of those systems also applies, i.e., a lying device or compromised agent could trigger partial reconfiguration of the service or network.
The SAIN architecture neither augments nor reduces this risk. The SAIN architecture neither augments nor reduces this risk.
An extension of SAIN, out of scope for this document, could detect dis crepancies between symptoms reported by different agents and thus detect anomali es if an agent or a device is lying. An extension of SAIN, which is out of scope for this document, could d etect discrepancies between symptoms reported by different agents, and thus dete ct anomalies if an agent or a device is lying.
</t> </t>
<t> <t>
If NTP service goes down, the devices clocks might lose their synchroni zation. If NTP service goes down, the devices clocks might lose their synchroni zation.
In that case, correlating information from different devices, such as detecting symptoms about a link or correlating symptoms from different devices, will give inaccurate results. In that case, correlating information from different devices, such as detecting symptoms about a link or correlating symptoms from different devices, will give inaccurate results.
</t> </t>
</section> </section>
</middle>
<back>
<references>
<name>References</name>
<references>
<name>Normative References</name>
<section anchor="iana" title="IANA Considerations"> <!-- [I-D.ietf-opsawg-service-assurance-yang] RFC 9418 -->
<t>
This document includes no request to IANA.
</t>
</section>
<section title="Contributors"> <reference anchor='RFC9418' target='https://www.rfc-editor.org/info/rfc9418'>
<t> <front>
<list style="symbols"> <title>YANG Modules for Service Assurance</title>
<t>Youssef El Fathi</t> <author initials="B." surname="Claise" fullname="Benoit Claise">
<t>Eric Vyncke</t> </author>
</list> <author initials="J." surname="Quilbeuf" fullname="Jean Quilbeuf">
</t> </author>
</section> <author initials="P." surname="Lucente" fullname="Paolo Lucente">
</author>
<author initials="P." surname="Fasano" fullname="Paolo Fasano">
</author>
<author initials="T." surname="Arumugam" fullname="Thangam Arumugam">
</author>
<date month="June" year="2023"/>
</front>
<seriesInfo name="RFC" value="9418"/>
<seriesInfo name="DOI" value="10.17487/RFC9418"/>
</reference>
</middle> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
309.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
969.xml"/>
</references>
<references>
<name>Informative References</name>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2
865.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5
424.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5
905.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6
242.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7
011.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7
149.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7
665.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7
950.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
199.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
446.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
466.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
641.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8
907.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9
315.xml"/>
<!-- [I-D.ietf-opsawg-yang-vpn-service-pm] RFC 9375 -->
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9
375.xml"/>
<!-- <?rfc include="reference.I-D.irtf-nmrg-ibn-intent-classification"
?> -->
<back>
<references title="Normative References">
<?rfc include="reference.I-D.ietf-opsawg-service-assurance-yang"?>
<?rfc include="reference.RFC.8309"?>
<?rfc include="reference.RFC.8969"?>
</references>
<references title="Informative References">
<?rfc include='reference.RFC.2865'?>
<?rfc include='reference.RFC.5424'?>
<?rfc include='reference.RFC.5905'?>
<?rfc include='reference.RFC.6242'?>
<?rfc include="reference.RFC.7011"?>
<?rfc include="reference.RFC.7149"?>
<?rfc include="reference.RFC.7665"?>
<?rfc include='reference.RFC.7950'?>
<?rfc include="reference.RFC.8199"?>
<?rfc include="reference.RFC.8446"?>
<?rfc include="reference.RFC.8466"?>
<?rfc include="reference.RFC.8641"?>
<?rfc include="reference.RFC.8907"?>
<?rfc include="reference.RFC.9315"?>
<?rfc include="reference.I-D.ietf-opsawg-yang-vpn-service-pm"?>
<!-- <?rfc include="reference.I-D.irtf-nmrg-ibn-intent-classification"?>
-->
<reference anchor="Piovesan2017" target="https://doi.org/10.1016/B978-0-12 -803773-7.00007-3"> <reference anchor="Piovesan2017" target="https://doi.org/10.1016/B978-0-12 -803773-7.00007-3">
<front> <front>
<title>Reasoning About Safety and Security: The Logic of Assurance</titl <title>7 - Reasoning About Safety and Security: The Logic of Assuran
e> ce</title>
<author initials="A." surname="Piovesan" fullname="A. Piovesan"><orga <author initials="A." surname="Piovesan" fullname="A. Piovesan">
nization/></author> <organization/>
<author initials="E." surname="Griffor" fullname="E. Griffor"><organi </author>
zation/></author> <author initials="E." surname="Griffor" fullname="E. Griffor">
<date year="2017" /> <organization/>
</front> </author>
</reference> <date year="2017"/>
<reference anchor="OpenConfig" target="https://openconfig.net"> </front>
<front> <seriesInfo name="DOI" value="10.1016/B978-0-12-803773-7.00007-3"/>
<title>OpenConfig</title>
<author/>
<date/>
</front>
</reference> </reference>
<reference anchor="OpenConfig" target="https://openconfig.net">
<front>
<title>OpenConfig</title>
<author/>
</front>
</reference>
</references>
</references> </references>
<?rfc needLines="100"?> <section numbered="false" toc="default">
<name>Acknowledgements</name>
<section title="Changes between revisions"> <t>
<t>[[RFC editor: please remove this section before publication.]]</t> The authors would like to thank <contact fullname="Stephane Litkowski"
<t>v12 - 13 />, <contact fullname="Charles Eckel"/>, <contact fullname="Rob Wilton"/>, <cont
<list style="symbols"> act fullname="Vladimir Vassiliev"/>, <contact fullname="Gustavo Alburquerque"/>,
<t> Addressing IESG telechat feedback</t> <contact fullname="Stefan Vallin"/>, <contact fullname="Éric Vyncke"/>, <contac
</list> t fullname="Mohamed Boucadair"/>, <contact fullname="Dhruv Dhody"/>, <contact fu
</t> llname="Michael Richardson"/>, and <contact fullname="Rob Wilton"/> for their re
<t>v11 - 12 views and feedback.
<list style="symbols">
<t> Addressing comments from Last call</t>
</list>
</t>
<t>v10 - v11
<list style="symbols">
<t>Adding reference to example of network performance model</t>
</list>
</t>
<t>v09 - v10
<list style="symbols">
<t>Addressing comments from Rob Wilton</t>
</list>
</t>
<t>v08 - v09
<list style="symbols">
<t>Addressing comments from Michael Richardson</t>
</list>
</t>
<t>v07 - v08
<list style="symbols">
<t>Propagating removal of under-maintenance flag from the YANG modul
e </t>
</list>
</t>
<t>v06-07
<list>
<t>Addressing comments from Dhruv Dhody and applying pending cha
nges</t>
</list>
</t>
<t>v03 - v04
<list style="symbols">
<t>Address comments from Mohamed Boucadair</t>
</list>
</t>
<t>v00 - v01
<list style="symbols">
<t>Cover the feedback received during the WG call for adoption</t>
</list>
</t> </t>
</section> </section>
<section numbered="false" toc="default">
<name>Contributors</name>
<ul spacing="normal">
<li><t><contact fullname="Youssef El Fathi"/></t></li>
<li><t><contact fullname="Éric Vyncke"/></t></li>
</ul>
</section>
<section title="Acknowledgements" numbered="no"> <!--[rfced] Terminology questions
<t>
The authors would like to thank Stephane Litkowski, Charles Eckel, Rob c) We have received guidance from the YANG Doctors
Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, Eric Vyncke, M that "YANG module" and "YANG data model" are preferred.
ohamed Boucadair, Dhruv Dhody, Michael Richardson and Rob Wilton for their revie Some occurrences may need an update, for example:
ws and feedback.
</t> Original:
</section> The use of this YANG model is further
explained in Section 3.5.
Where Section 3.5 is "Open Interfaces with YANG Modules.”
Please review and specify any needed updates.
-->
</back> </back>
</rfc> </rfc>
<!-- Local Variables: --> <!-- Local Variables: -->
<!-- fill-column:72 --> <!-- fill-column:72 -->
<!-- End: --> <!-- End: -->
 End of changes. 150 change blocks. 
777 lines changed or deleted 767 lines changed or added

This html diff was produced by rfcdiff 1.48.