| rfc9696v1.txt | rfc9696.txt | |||
|---|---|---|---|---|
| Internet Engineering Task Force (IETF) Y. Wei, Ed. | Internet Engineering Task Force (IETF) Y. Wei, Ed. | |||
| Request for Comments: 9696 Z. Zhang | Request for Comments: 9696 Z. Zhang | |||
| Category: Informational ZTE Corporation | Category: Informational ZTE Corporation | |||
| ISSN: 2070-1721 D. Afanasiev | ISSN: 2070-1721 D. Afanasiev | |||
| Yandex | Yandex | |||
| P. Thubert | P. Thubert | |||
| Cisco Systems | Individual | |||
| T. Przygienda | T. Przygienda | |||
| Juniper Networks | Juniper Networks | |||
| December 2024 | March 2025 | |||
| Routing in Fat Trees (RIFT) Applicability and Operational Considerations | Routing in Fat Trees (RIFT) Applicability and Operational Considerations | |||
| Abstract | Abstract | |||
| This document discusses the properties, applicability, and | This document discusses the properties, applicability, and | |||
| operational considerations of Routing in Fat Trees (RIFT) in | operational considerations of Routing in Fat Trees (RIFT) in | |||
| different network scenarios with the intention of providing a rough | different network scenarios with the intention of providing a rough | |||
| guide on how RIFT can be deployed to simplify routing operations in | guide on how RIFT can be deployed to simplify routing operations in | |||
| Clos topologies and their variations. | Clos topologies and their variations. | |||
| skipping to change at line 41 ¶ | skipping to change at line 41 ¶ | |||
| Internet Engineering Steering Group (IESG). Not all documents | Internet Engineering Steering Group (IESG). Not all documents | |||
| approved by the IESG are candidates for any level of Internet | approved by the IESG are candidates for any level of Internet | |||
| Standard; see Section 2 of RFC 7841. | Standard; see Section 2 of RFC 7841. | |||
| Information about the current status of this document, any errata, | Information about the current status of this document, any errata, | |||
| and how to provide feedback on it may be obtained at | and how to provide feedback on it may be obtained at | |||
| https://www.rfc-editor.org/info/rfc9696. | https://www.rfc-editor.org/info/rfc9696. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2024 IETF Trust and the persons identified as the | Copyright (c) 2025 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Revised BSD License text as described in Section 4.e of the | include Revised BSD License text as described in Section 4.e of the | |||
| Trust Legal Provisions and are provided without warranty as described | Trust Legal Provisions and are provided without warranty as described | |||
| skipping to change at line 112 ¶ | skipping to change at line 112 ¶ | |||
| 8.2. Informative References | 8.2. Informative References | |||
| Acknowledgments | Acknowledgments | |||
| Contributors | Contributors | |||
| Authors' Addresses | Authors' Addresses | |||
| 1. Introduction | 1. Introduction | |||
| This document discusses the properties and applicability of "RIFT: | This document discusses the properties and applicability of "RIFT: | |||
| Routing in Fat Trees" [RFC9692] in different deployment scenarios and | Routing in Fat Trees" [RFC9692] in different deployment scenarios and | |||
| highlights the operational simplicity of the technology compared to | highlights the operational simplicity of the technology compared to | |||
| traditional routing solutions. It also documents special | classical routing solutions. It also documents special | |||
| considerations when RIFT is used with or without overlays and/or | considerations when RIFT is used with or without overlays and/or | |||
| controllers and how RIFT identifies miscablings and reroutes around | controllers and how RIFT identifies miscablings and reroutes around | |||
| node and link failures. | node and link failures. | |||
| 2. Terminology | 2. Terminology | |||
| This document uses the terminology defined in [RFC9692]. The most | This document uses the terminology defined in [RFC9692]. The most | |||
| frequently used terms and their definitions from that document are | frequently used terms and their definitions from that document are | |||
| listed here. | listed here. | |||
| skipping to change at line 138 ¶ | skipping to change at line 138 ¶ | |||
| 2-leaf shortcuts and multiple level shortcuts are possible and | 2-leaf shortcuts and multiple level shortcuts are possible and | |||
| described further in the document. | described further in the document. | |||
| Crossbar: | Crossbar: | |||
| Physical arrangement of ports in a switching matrix without | Physical arrangement of ports in a switching matrix without | |||
| implying any further scheduling or buffering disciplines. | implying any further scheduling or buffering disciplines. | |||
| Directed Acyclic Graph (DAG): | Directed Acyclic Graph (DAG): | |||
| A finite directed graph with no directed cycles (loops). If links | A finite directed graph with no directed cycles (loops). If links | |||
| in a Clos are considered as either being all directed towards the | in a Clos are considered as either being all directed towards the | |||
| top or vice versa, each of two such graphs is a DAG. | top or bottom, each of such two graphs is a DAG. | |||
| Disaggregation: | Disaggregation: | |||
| The process in which a node decides to advertise more specific | The process in which a node decides to advertise more specific | |||
| prefixes southwards, either positively to attract the | prefixes southwards, either positively to attract the | |||
| corresponding traffic or negatively to repel it. Disaggregation | corresponding traffic or negatively to repel it. Disaggregation | |||
| is performed to prevent traffic loss and suboptimal routing to the | is performed to prevent traffic loss and suboptimal routing to the | |||
| more specific prefixes. | more specific prefixes. | |||
| Leaf: | Leaf: | |||
| A node without southbound adjacencies. Level 0 implies a leaf in | A node without southbound adjacencies. Level 0 implies a leaf in | |||
| skipping to change at line 181 ¶ | skipping to change at line 181 ¶ | |||
| as links and address prefixes. A TIE always has a direction and a | as links and address prefixes. A TIE always has a direction and a | |||
| type. North TIEs (sometimes abbreviated as N-TIEs) are used when | type. North TIEs (sometimes abbreviated as N-TIEs) are used when | |||
| dealing with TIEs in the northbound representation, and South-TIEs | dealing with TIEs in the northbound representation, and South-TIEs | |||
| (sometimes abbreviated as S-TIEs) are used for the southbound | (sometimes abbreviated as S-TIEs) are used for the southbound | |||
| equivalent. TIEs have different types, such as node and prefix | equivalent. TIEs have different types, such as node and prefix | |||
| TIEs. | TIEs. | |||
| 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | |||
| Clos [CLOS] topologies (commonly called a Fat Tree/network in modern | Clos [CLOS] topologies (commonly called a Fat Tree/network in modern | |||
| IP fabric considerations as a homonym to the original definition of | IP fabric considerations as a similar term for the original | |||
| the term Fat Tree [FATTREE]) have gained prominence in today's | definition of the term Fat Tree [FATTREE]) have gained prominence in | |||
| networking, primarily as a result of the paradigm shift towards a | today's networking, primarily as a result of the paradigm shift | |||
| centralized data-center-based architecture that delivers a majority | towards a centralized data-center-based architecture that delivers a | |||
| of computation and storage services. | majority of computation and storage services. | |||
| Current routing protocols were geared towards a network with an | Current routing protocols were geared towards a network with an | |||
| irregular topology with isotropic properties and a low degree of | irregular topology with isotropic properties and a low degree of | |||
| connectivity. When applied to Fat Tree topologies: | connectivity. When applied to Fat Tree topologies: | |||
| * They tend to need extensive configuration or provisioning during | * They tend to need extensive configuration or provisioning during | |||
| initialization and adding or removing nodes from the fabric. | initialization and adding or removing nodes from the fabric. | |||
| * For link-state routing protocols, all nodes including spine-and- | * For link-state routing protocols, all nodes including spine-and- | |||
| leaf nodes learn the entire network topology and routing | leaf nodes learn the entire network topology and routing | |||
| skipping to change at line 276 ¶ | skipping to change at line 276 ¶ | |||
| v ++--++ +-+-++ ++--++ ++--++ + | v ++--++ +-+-++ ++--++ ++--++ + | |||
| |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 | |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 | |||
| +----+ +----+ +----+ +----+ | +----+ +----+ +----+ +----+ | |||
| Figure 1: RIFT Overview | Figure 1: RIFT Overview | |||
| A spine node only has information necessary for its level, which is | A spine node only has information necessary for its level, which is | |||
| all destinations south of the node based on SPF calculation, the | all destinations south of the node based on SPF calculation, the | |||
| default route, and potentially disaggregated routes. | default route, and potentially disaggregated routes. | |||
| RIFT combines the advantages of both link-state and distance-vector: | RIFT combines the advantages of both link-state and distance-vector | |||
| protocols: | ||||
| * Fastest possible convergence | * Fastest possible convergence | |||
| * Automatic detection of topology | * Automatic detection of topology | |||
| * Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf | * Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf | |||
| nodes | nodes | |||
| * High degree of ECMP | * High degree of ECMP | |||
| skipping to change at line 299 ¶ | skipping to change at line 300 ¶ | |||
| * Maximum propagation speed with flexible prefixes in an update | * Maximum propagation speed with flexible prefixes in an update | |||
| There are two types of link-state databases that are "north | There are two types of link-state databases that are "north | |||
| representation" North Topology Information Elements (N-TIEs) and | representation" North Topology Information Elements (N-TIEs) and | |||
| "south representation" South Topology Information Elements (S-TIEs). | "south representation" South Topology Information Elements (S-TIEs). | |||
| The N-TIEs contain a link-state topology description of lower levels, | The N-TIEs contain a link-state topology description of lower levels, | |||
| and the S-TIEs simply carry default and disaggregated routes for the | and the S-TIEs simply carry default and disaggregated routes for the | |||
| lower levels. | lower levels. | |||
| RIFT also eliminates major disadvantages of link-state and distance- | RIFT also eliminates major disadvantages of link-state and distance- | |||
| vector with the following: | vector protocols with the following: | |||
| * Reduced and balanced flooding | * Reduced and balanced flooding | |||
| * Level-constrained automatic neighbor discovery | * Level-constrained automatic neighbor discovery | |||
| To achieve this, RIFT builds on the art of IGPs, such as OSPF, IS-IS, | To achieve this, RIFT builds on the art of IGPs, such as OSPF, IS-IS, | |||
| Mobile Ad Hoc Network (MANET), and Internet of Things (IoT) to | Mobile Ad Hoc Network (MANET), and Internet of Things (IoT) to | |||
| provide unique features: | provide unique features: | |||
| * Automatic (positive or negative) route disaggregation of northward | * Automatic (positive or negative) route disaggregation of northward | |||
| skipping to change at line 363 ¶ | skipping to change at line 364 ¶ | |||
| 4.2.1. Horizontal Links | 4.2.1. Horizontal Links | |||
| RIFT is not limited to pure Clos divided into PoD and multi-planes | RIFT is not limited to pure Clos divided into PoD and multi-planes | |||
| but supports horizontal (East-West) links below the ToF level. Those | but supports horizontal (East-West) links below the ToF level. Those | |||
| links are used only for last resort northbound forwarding when a | links are used only for last resort northbound forwarding when a | |||
| spine loses all its northbound links or cannot compute a default | spine loses all its northbound links or cannot compute a default | |||
| route through them. | route through them. | |||
| A full-mesh connectivity between nodes on the same level can be | A full-mesh connectivity between nodes on the same level can be | |||
| employed and that allows North SPF (N-SPF) to provide for any node | deployed, which allows North SPF (N-SPF) to provide for any node | |||
| losing all its northbound adjacencies (as long as any of the other | losing all its northbound adjacencies (as long as any of the other | |||
| nodes in the level are northbound connected) to still participate in | nodes in the level are northbound connected) and still participate in | |||
| northbound forwarding. | northbound forwarding. | |||
| Note that a "ring" of horizontal links at any level below ToF does | Note that a "ring" of horizontal links at any level below ToF does | |||
| not provide a "ring-based protection" scheme since the SPF | not provide a "ring-based protection" scheme since the SPF | |||
| computation would have to deal with breaking of "loops", an | computation would have to deal with breaking of "loops", an | |||
| application for which RIFT is not intended. | application for which RIFT is not intended. | |||
| 4.2.2. Vertical Shortcuts | 4.2.2. Vertical Shortcuts | |||
| Through relaxations of the specified adjacency forming rules, RIFT | Through relaxations of the specified adjacency forming rules, RIFT | |||
| skipping to change at line 409 ¶ | skipping to change at line 410 ¶ | |||
| operation specified for East-West links and the southbound | operation specified for East-West links and the southbound | |||
| reflection between nodes are not applicable. Also, ZTP will | reflection between nodes are not applicable. Also, ZTP will | |||
| derive a sense of depth that will eliminate some links. | derive a sense of depth that will eliminate some links. | |||
| Variations of ZTP could be derived to meet specific objectives, | Variations of ZTP could be derived to meet specific objectives, | |||
| e.g., make it so that most routers have at least two parents to | e.g., make it so that most routers have at least two parents to | |||
| reach the ToF. | reach the ToF. | |||
| * RIFT applies to any Destination-Oriented DAG (DODAG) where there's | * RIFT applies to any Destination-Oriented DAG (DODAG) where there's | |||
| only one ToF node and the problem of disaggregation does not | only one ToF node and the problem of disaggregation does not | |||
| exist. In that case, RIFT operates very much like RPL [RFC6550], | exist. In that case, RIFT operates very much like RPL [RFC6550], | |||
| but uses Link State for southbound routes (downwards in RPL's | but uses link-state information for southbound routes (downwards | |||
| terms). For an arbitrary DAG with multiple destinations (ToFs), | in RPL's terms). For an arbitrary DAG with multiple destinations | |||
| the way disaggregation happens has to be considered. | (ToFs), the way disaggregation happens has to be considered. | |||
| * Positive Disaggregation expects that most of the ToF nodes reach | * Positive Disaggregation expects that most of the ToF nodes reach | |||
| most of the leaves, so disaggregation is the exception as opposed | most of the leaves, so disaggregation is the exception as opposed | |||
| to the rule. When this is no longer true, it makes sense to turn | to the rule. When this is no longer true, it makes sense to turn | |||
| off disaggregation and route between the ToF nodes over a ring, a | off disaggregation and route between the ToF nodes over a ring, a | |||
| full mesh, a transit network, or a form of area zero. Then again, | full mesh, a transit network, or a form of area zero. Then again, | |||
| this operation is similar to RPL operating as a single DODAG with | this operation is similar to RPL operating as a single DODAG with | |||
| a virtual root. | a virtual root. | |||
| * In order to aggregate and disaggregate routes, RIFT requires that | * In order to aggregate and disaggregate routes, RIFT requires that | |||
| skipping to change at line 433 ¶ | skipping to change at line 434 ¶ | |||
| fabric. This can be achieved with a ring as suggested by RIFT | fabric. This can be achieved with a ring as suggested by RIFT | |||
| [RFC9692], by some preconfiguration, or by using a synchronization | [RFC9692], by some preconfiguration, or by using a synchronization | |||
| with a common repository where all the active prefixes are | with a common repository where all the active prefixes are | |||
| registered. | registered. | |||
| 4.2.4. Reachability of Internal Nodes in the Fabric | 4.2.4. Reachability of Internal Nodes in the Fabric | |||
| RIFT does not require that nodes have reachable addresses in the | RIFT does not require that nodes have reachable addresses in the | |||
| fabric, though it is clearly desirable for operational purposes. | fabric, though it is clearly desirable for operational purposes. | |||
| Under normal operating conditions, this can be easily achieved by | Under normal operating conditions, this can be easily achieved by | |||
| injecting the node's loopback address into North and South Prefix | injecting the node's loopback address into Prefix North TIEs and | |||
| TIEs or other implementation-specific mechanisms. | Prefix South TIEs or other implementation-specific mechanisms. | |||
| Special considerations arise when a node loses all northbound | Special considerations arise when a node loses all northbound | |||
| adjacencies but is not at the top of the fabric. If a spine node | adjacencies but is not at the top of the fabric. If a spine node | |||
| loses all northbound links, the spine node doesn't advertise a | loses all northbound links, the spine node doesn't advertise a | |||
| default route. But if the level of the spine node is auto-determined | default route. But if the level of the spine node is auto-determined | |||
| by ZTP, it will "fall down" as depicted in Figure 8. | by ZTP, it will "fall down" as depicted in Figure 8. | |||
| 4.3. Use Cases | 4.3. Use Cases | |||
| 4.3.1. Data Center Topologies | 4.3.1. Data Center Topologies | |||
| 4.3.1.1. Data Center Fabrics | 4.3.1.1. Data Center Fabrics | |||
| RIFT is suited for applying in data center (DC) IP fabrics underlay | RIFT is suited for applying underlay routing in data center (DC) IP | |||
| routing, vast majority of which seem to be currently (and for the | fabrics, with the vast majority of these IP fabrics being Clos | |||
| foreseeable future) Clos architectures. It significantly simplifies | architectures (and will be for the foreseeable future). It | |||
| operation and deployment of such fabrics as described in Section 5 | significantly simplifies operation and deployment of such fabrics as | |||
| for environments compared to extensive proprietary provisioning and | described in Section 5 for environments compared to extensive | |||
| operational solutions. | proprietary provisioning and operational solutions. | |||
| 4.3.1.2. Adaptations to Other Proposed Data Center Topologies | 4.3.1.2. Adaptations to Other Proposed Data Center Topologies | |||
| . +-----+ +-----+ | . +-----+ +-----+ | |||
| . | | | | | . | | | | | |||
| .+-+ S0 | | S1 | | .+-+ S0 | | S1 | | |||
| .| ++---++ ++---++ | .| ++---++ ++---++ | |||
| .| | | | | | .| | | | | | |||
| .| | +------------+ | | .| | +------------+ | | |||
| .| | | +------------+ | | .| | | +------------+ | | |||
| skipping to change at line 507 ¶ | skipping to change at line 508 ¶ | |||
| environments close to content producers (server farms connection via | environments close to content producers (server farms connection via | |||
| DC fabrics) but in proximity to content consumers as well. Consumers | DC fabrics) but in proximity to content consumers as well. Consumers | |||
| are often clustered in metro areas with their own network | are often clustered in metro areas with their own network | |||
| architectures that can benefit from simplified, regular Clos | architectures that can benefit from simplified, regular Clos | |||
| structures. Thus, they can also benefit from RIFT. | structures. Thus, they can also benefit from RIFT. | |||
| 4.3.3. Building Cabling | 4.3.3. Building Cabling | |||
| Commercial edifices are often cabled in topologies that are either | Commercial edifices are often cabled in topologies that are either | |||
| Clos or its isomorphic equivalents. The Clos can grow rather high | Clos or its isomorphic equivalents. The Clos can grow rather high | |||
| with many levels. That presents a challenge for traditional routing | with many levels. That presents a challenge for classical routing | |||
| protocols (except BGP [RFC4271] and Private Network-Network Interface | protocols (except BGP [RFC4271] and Private Network-Network Interface | |||
| (PNNI) [PNNI], which is largely phased-out by now) that do not | (PNNI) [PNNI], which is largely phased-out by now) that do not | |||
| support an arbitrary number of levels, which RIFT does naturally. | support an arbitrary number of levels, which RIFT does naturally. | |||
| Moreover, due to the limited sizes of forwarding tables in network | Moreover, due to the limited sizes of forwarding tables in network | |||
| elements of building cabling, the minimum FIB size RIFT maintains | elements of building cabling, the minimum FIB size RIFT maintains | |||
| under normal conditions is cost-effective in terms of hardware and | under normal conditions is cost-effective in terms of hardware and | |||
| operational costs. | operational costs. | |||
| 4.3.4. Internal Router Switching Fabrics | 4.3.4. Internal Router Switching Fabrics | |||
| skipping to change at line 542 ¶ | skipping to change at line 543 ¶ | |||
| The Cloud Central Office (CloudCO) is a new stage of the telecom | The Cloud Central Office (CloudCO) is a new stage of the telecom | |||
| Central Office. It takes the advantage of Software-Defined | Central Office. It takes the advantage of Software-Defined | |||
| Networking (SDN) and Network Function Virtualization (NFV) in | Networking (SDN) and Network Function Virtualization (NFV) in | |||
| conjunction with general purpose hardware to optimize current | conjunction with general purpose hardware to optimize current | |||
| networks. The following figure illustrates this architecture at a | networks. The following figure illustrates this architecture at a | |||
| high level. It describes a single instance or macro-node of CloudCO | high level. It describes a single instance or macro-node of CloudCO | |||
| that provides a number of value-added services (VASes), a Broadband | that provides a number of value-added services (VASes), a Broadband | |||
| Access Abstraction (BAA), and virtualized network services. An | Access Abstraction (BAA), and virtualized network services. An | |||
| Access I/O module faces a CloudCO access node and the Customer | Access I/O module faces a CloudCO access node and the Customer | |||
| Premises Equipment (CPE) behind it. A Network I/O module is facing | Premises Equipment (CPE) behind it. A Network I/O module is facing | |||
| the core network. The two I/O modules are interconnected by a leaf | the core network. The two I/O modules are interconnected by a spine- | |||
| and spine fabric [TR-384]. | and-leaf fabric [TR-384]. | |||
| +---------------------+ +----------------------+ | +---------------------+ +----------------------+ | |||
| | Spine | | Spine | | | Spine | | Spine | | |||
| | Switch | | Switch | | | Switch | | Switch | | |||
| +------+---+------+-+-+ +--+-+-+-+-----+-------+ | +------+---+------+-+-+ +--+-+-+-+-----+-------+ | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | | | +-------------------------------+ | | | | | | | +-------------------------------+ | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | | +-------------------------+ | | | | | | | | +-------------------------+ | | | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| skipping to change at line 615 ¶ | skipping to change at line 616 ¶ | |||
| scenarios. | scenarios. | |||
| * RIFT automatically negotiates Bidirectional Forwarding Detection | * RIFT automatically negotiates Bidirectional Forwarding Detection | |||
| (BFD) per link. This allows for IP and micro-BFD [RFC7130] to | (BFD) per link. This allows for IP and micro-BFD [RFC7130] to | |||
| replace Link Aggregation Groups (LAGs) that hide bandwidth | replace Link Aggregation Groups (LAGs) that hide bandwidth | |||
| imbalances in case of constituent failures. Further automatic | imbalances in case of constituent failures. Further automatic | |||
| link validation techniques similar to those in [RFC5357] could be | link validation techniques similar to those in [RFC5357] could be | |||
| supported as well. | supported as well. | |||
| * RIFT inherently solves many problems associated with the use of | * RIFT inherently solves many problems associated with the use of | |||
| traditional routing topologies with dense meshes and high degrees | classical routing topologies with dense meshes and high degrees of | |||
| of ECMP by including automatic bandwidth balancing, flood | ECMP by including automatic bandwidth balancing, flood reduction, | |||
| reduction, and automatic disaggregation on failures while | and automatic disaggregation on failures while providing maximum | |||
| providing maximum aggregation of prefixes in default scenarios. | aggregation of prefixes in default scenarios. ECMP in RIFT | |||
| ECMP in RIFT eliminates the need for more Loop-Free Alternate | eliminates the need for more Loop-Free Alternate (LFA) procedures. | |||
| (LFA) procedures. | ||||
| * RIFT reduces FIB size towards the bottom of the IP fabric where | * RIFT reduces FIB size towards the bottom of the IP fabric where | |||
| most nodes reside and allows with that for cheaper hardware on the | most nodes reside. This allows for cheaper hardware on the edges | |||
| edges and introduction of modern IP fabric architectures that | and introduction of modern IP fabric architectures that encompass | |||
| encompass, e.g., server multihoming. | server multihoming and other mechanisms. | |||
| * RIFT provides valley-free routing that is loop free. A valley- | * RIFT provides valley-free routing that is loop free. A valley- | |||
| free path allows for reversal of direction at most once from a | free path allows for reversal of direction at most once from a | |||
| packet heading northbound to southbound while permitting traversal | packet heading northbound to southbound while permitting traversal | |||
| of horizontal links in the northbound phase. This allows for the | of horizontal links in the northbound phase. This allows for the | |||
| use of any such valley-free path in bisectional fabric bandwidth | use of any such valley-free path in bisectional fabric bandwidth | |||
| between two destinations irrespective of their metrics that can be | between two destinations irrespective of their metrics that can be | |||
| used to balance load on the fabric in different ways. Valley-free | used to balance load on the fabric in different ways. Valley-free | |||
| routing eliminates the need for any specific micro-loop avoidance | routing eliminates the need for any specific micro-loop avoidance | |||
| procedures for RIFT. | procedures for RIFT. | |||
| skipping to change at line 699 ¶ | skipping to change at line 699 ¶ | |||
| | +-----------+ | | + +---+linkSL7+-+ | + | | +-----------+ | | + +---+linkSL7+-+ | + | |||
| | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | |||
| |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
| +-+-----+ +-+-----+ +-----+-+ +-+-----+ | +-+-----+ +-+-----+ +-----+-+ +-+-----+ | |||
| + + + + | + + + + | |||
| Prefix111 Prefix112 Prefix121 Prefix122 | Prefix111 Prefix112 Prefix121 Prefix122 | |||
| Figure 4: Suboptimal Routing Upon Link Failure Use Case | Figure 4: Suboptimal Routing Upon Link Failure Use Case | |||
| As shown in Figure 4, as the result of the south reflection between | As shown in Figure 4, as the result of the south reflection, Spine121 | |||
| Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and | and Spine 122 know each other via Leaf121 or Leaf 122 at level 1. | |||
| Spine 122 know each other at level 1. | ||||
| Without disaggregation mechanisms, the packet from leaf121 to | Without disaggregation mechanisms, the packet from leaf121 to | |||
| prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6 | prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6 | |||
| fails. Then, the packet will go down through linkTS4 to linkSL8 to | fails. Then, the packet will go down through linkTS4 to linkSL8 to | |||
| Leaf122 or go up through linkSL5 to linkTS6, then go down through | Leaf122 or go up through linkSL5 to linkTS6, then go down through | |||
| linkTS8 and linkSL8 to Leaf122 based on the pure default route. This | linkTS8 and linkSL8 to Leaf122 based on the pure default route. This | |||
| is the case of suboptimal routing or bow tying. | is the case of suboptimal routing or bow tying. | |||
| With disaggregation mechanisms, Spine122 will detect the failure | With disaggregation mechanisms, Spine122 will detect the failure | |||
| according to the reflected node S-TIE from Spine121 when linkSL6 | according to the reflected node S-TIE from Spine121 when linkSL6 | |||
| skipping to change at line 788 ¶ | skipping to change at line 787 ¶ | |||
| unique in the RIFT network and the level of the node in the Fat Tree, | unique in the RIFT network and the level of the node in the Fat Tree, | |||
| which determines which peers are northward "parents" and which are | which determines which peers are northward "parents" and which are | |||
| southward "children". | southward "children". | |||
| ZTP is always on, but its decisions can be overridden when a network | ZTP is always on, but its decisions can be overridden when a network | |||
| administrator prefers to impose its own configuration. In that case, | administrator prefers to impose its own configuration. In that case, | |||
| it is the responsibility of the administrator to ensure that the | it is the responsibility of the administrator to ensure that the | |||
| configured parameters are correct, i.e., ensure that the System ID of | configured parameters are correct, i.e., ensure that the System ID of | |||
| each node is unique and that the administratively set levels truly | each node is unique and that the administratively set levels truly | |||
| reflect the relative position of the nodes in the fabric. It is | reflect the relative position of the nodes in the fabric. It is | |||
| recommended to let ZTP configure the network, and when not, it is | recommended to let ZTP configure the network, and when ZTP does not | |||
| recommended to configure the level of all the nodes to avoid an | configure the network, it is recommended to configure the level of | |||
| undesirable interaction between ZTP and the manual configuration. | all the nodes to avoid an undesirable interaction between ZTP and the | |||
| manual configuration. | ||||
| ZTP requires that the administrator points out the ToF nodes to set | ZTP requires that the administrator points out the ToF nodes to set | |||
| the baseline from which the fabric topology is derived. The ToF | the baseline from which the fabric topology is derived. The ToF | |||
| nodes are configured with the TOP_OF_FABRIC flag, which are initial | nodes are configured with the TOP_OF_FABRIC flag, which are initial | |||
| 'seeds' needed for other ZTP nodes to derive their level in the | 'seeds' needed for other ZTP nodes to derive their level in the | |||
| topology. ZTP computes the level of each node based on the Highest | topology. ZTP computes the level of each node based on the Highest | |||
| Available Level (HAL) of the potential parent closest to that | Available Level (HAL) of the potential parent closest to that | |||
| baseline, which represents the superspine. In a fashion, RIFT can be | baseline, which represents the superspine. In a fashion, RIFT can be | |||
| seen as a distance-vector protocol that computes a set of feasible | seen as a distance-vector protocol that computes a set of feasible | |||
| successors towards the superspine and autoconfigures the rest of the | successors towards the superspine and autoconfigures the rest of the | |||
| skipping to change at line 976 ¶ | skipping to change at line 976 ¶ | |||
| | | | +--------------------------------+ | | | | +--------------------------------+ | |||
| | | | | | | | | | | |||
| | | | | | | | | | | |||
| | | | | | | | | | | |||
| | | | | | | | | | | |||
| + + + + | + + + + | |||
| +-1--2--3--4--+ | +-1--2--3--4--+ | |||
| | Leaf1 | ...... | | Leaf1 | ...... | |||
| +-------------+ | +-------------+ | |||
| Figure 9: Fallen Spine | Figure 9: Additional Cabling Constraint Example | |||
| RIFT allows implementations to provide programmable plug-ins that can | RIFT allows implementations to provide programmable plug-ins that can | |||
| adjust ZTP operation or capture information during computation. | adjust ZTP operation or capture information during computation. | |||
| While defining this is outside the scope of this document, such a | While defining this is outside the scope of this document, such a | |||
| mechanism could be used to extend the miscabling functionality. | mechanism could be used to extend the miscabling functionality. | |||
| For other protocols to achieve this, it would require additional | For other protocols to achieve this, it would require additional | |||
| operational overhead. Consider a fabric that is using unnumbered | operational overhead. Consider a fabric that is using unnumbered | |||
| OSPF links; it is still very likely that a miscabled link will form | OSPF links; it is still very likely that a miscabled link will form | |||
| an adjacency. Each attempt to move cables to the correct port may | an adjacency. Each attempt to move cables to the correct port may | |||
| skipping to change at line 1134 ¶ | skipping to change at line 1134 ¶ | |||
| way, the multiple routes are equally valid and should be conserved in | way, the multiple routes are equally valid and should be conserved in | |||
| the case of anycast. Without further information from the | the case of anycast. Without further information from the | |||
| redistributed routing protocol, it is impossible to sort out a | redistributed routing protocol, it is impossible to sort out a | |||
| movement from a redistribution that happens asynchronously on | movement from a redistribution that happens asynchronously on | |||
| different leaves. RIFT [RFC9692] expects that anycast addresses are | different leaves. RIFT [RFC9692] expects that anycast addresses are | |||
| advertised within the timing precision, which is typically the case | advertised within the timing precision, which is typically the case | |||
| with a low-precision timing and a multihomed node. Beyond that time | with a low-precision timing and a multihomed node. Beyond that time | |||
| interval, RIFT interprets the lag as a mobility and only the freshest | interval, RIFT interprets the lag as a mobility and only the freshest | |||
| route is retained. | route is retained. | |||
| When using IPv6 [RFC8200], RIFT suggests to leverage [RFC8505] as the | When using IPv6 [RFC8200], RIFT suggests leveraging 6LoWPAN ND | |||
| IPv6 ND interaction between the mobile node and the leaf. This not | [RFC8505] as the IPv6 ND interaction between the mobile node and the | |||
| only provides a sequence counter but also a lifetime and a security | leaf. This not only provides a sequence counter but also a lifetime | |||
| token that may be used to protect the ownership of an address | and a security token that may be used to protect the ownership of an | |||
| [RFC8928]. When using [RFC8505], the parallel registration of an | address [RFC8928]. When using 6LoWPAN ND [RFC8505], the parallel | |||
| anycast address to multiple leaves is done with the same sequence | registration of an anycast address to multiple leaves is done with | |||
| counter, whereas the sequence counter is incremented when the point | the same sequence counter, whereas the sequence counter is | |||
| of attachment changes. This way, it is possible to differentiate a | incremented when the point of attachment changes. This way, it is | |||
| mobile node from a multihomed node, even when the mobility happens | possible to differentiate a mobile node from a multihomed node, even | |||
| within the timing precision. It is also possible for a mobile node | when the mobility happens within the timing precision. It is also | |||
| to be multihomed as well, e.g., to change only one of its points of | possible for a mobile node to be multihomed as well, e.g., to change | |||
| attachment. | only one of its points of attachment. | |||
| 5.9. IPv4 over IPv6 | 5.9. IPv4 over IPv6 | |||
| RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An | RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An | |||
| IPv6 Address Family (AF) configures via the usual ND mechanisms and | IPv6 Address Family (AF) configures via the usual ND mechanisms and | |||
| then V4 can use V6 next-hops analogous to [RFC8950]. It is expected | then V4 can use V6 next-hops analogous to [RFC8950]. It is expected | |||
| that the whole fabric supports the same type of forwarding of AFs on | that the whole fabric supports the same type of forwarding of AFs on | |||
| all the links. RIFT provides an indication whether a node is capable | all the links. RIFT provides an indication whether a node is capable | |||
| of V4-forwarding and implementations are possible where different | of V4-forwarding and implementations are possible where different | |||
| routing tables are computed per AF as long as the computation remains | routing tables are computed per AF as long as the computation remains | |||
| skipping to change at line 1188 ¶ | skipping to change at line 1188 ¶ | |||
| +---+----+ +---+----+ | +---+----+ +---+----+ | |||
| | V4 | | V4 | | | V4 | | V4 | | |||
| | subnet | | subnet | | | subnet | | subnet | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| Figure 10: IPv4 over IPv6 | Figure 10: IPv4 over IPv6 | |||
| 5.10. In-Band Reachability of Nodes | 5.10. In-Band Reachability of Nodes | |||
| RIFT doesn't precondition that nodes of the fabric have reachable | RIFT doesn't precondition that nodes of the fabric have reachable | |||
| addresses, but the operational reasons to reach the internal nodes | addresses, but operational reasons to reach the internal nodes may | |||
| may exist. Figure 11 shows an example that the network management | exist. Figure 11 shows an example that the network management | |||
| station (NMS) attaches to Leaf1. | station (NMS) attaches to Leaf1. | |||
| +-------+ +-------+ | +-------+ +-------+ | |||
| | ToF1 | | ToF2 | | | ToF1 | | ToF2 | | |||
| ++---- ++ ++-----++ | ++---- ++ ++-----++ | |||
| | | | | | | | | | | |||
| | +----------+ | | | +----------+ | | |||
| | +--------+ | | | | +--------+ | | | |||
| | | | | | | | | | | |||
| ++-----++ +--+---++ | ++-----++ +--+---++ | |||
| skipping to change at line 1224 ¶ | skipping to change at line 1224 ¶ | |||
| If the NMS wants to access Leaf2, it simply works because the | If the NMS wants to access Leaf2, it simply works because the | |||
| loopback address of Leaf2 is flooded in its Prefix North TIE. | loopback address of Leaf2 is flooded in its Prefix North TIE. | |||
| If the NMS wants to access Spine2, it also works because a spine node | If the NMS wants to access Spine2, it also works because a spine node | |||
| always advertises its loopback address in the Prefix North TIE. The | always advertises its loopback address in the Prefix North TIE. The | |||
| NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | |||
| ToF2-Spine2. | ToF2-Spine2. | |||
| If the NMS wants to access ToF2, ToF2's loopback address needs to be | If the NMS wants to access ToF2, ToF2's loopback address needs to be | |||
| injected into its Prefix South TIE. This TIE must be seen by all | injected into its Prefix South TIE. This TIE must be seen by all | |||
| nodes at the level below -- the spine nodes in Figure 9 -- that must | nodes at the level below -- the spine nodes in Figure 11 -- that must | |||
| form a ceiling for all the traffic coming from below (south). | form a ceiling for all the traffic coming from below (south). | |||
| Otherwise, the traffic from the NMS may follow the default route to | Otherwise, the traffic from the NMS may follow the default route to | |||
| the wrong ToF Node, e.g., ToF1. | the wrong ToF Node, e.g., ToF1. | |||
| In the case of failure between ToF2 and spine nodes, ToF2's loopback | In the case of failure between ToF2 and spine nodes, ToF2's loopback | |||
| address must be disaggregated recursively all the way to the leaves. | address must be disaggregated recursively all the way to the leaves. | |||
| In a partitioned ToF, even with recursive disaggregation, a ToF node | In a partitioned ToF, even with recursive disaggregation, a ToF node | |||
| is only reachable within its plane. | is only reachable within its plane. | |||
| A possible alternative to recursive disaggregation is to use a ring | A possible alternative to recursive disaggregation is to use a ring | |||
| that interconnects the ToF nodes to transmit packets between them for | that interconnects the ToF nodes to transmit packets between them for | |||
| their loopback addresses only. The idea is that this is mostly | their loopback addresses only. The idea is that this is mostly | |||
| control traffic and should not alter the load-balancing properties of | control traffic and should not alter the load-balancing properties of | |||
| the fabric. | the fabric. | |||
| 5.11. Dual-Homing Servers | 5.11. Dual-Homing Servers | |||
| Each RIFT node may operate in ZTP mode. It has no configuration | Each RIFT node may operate in ZTP mode. It has no configuration | |||
| (unless it is a ToF at the top of the topology or the must operate in | (unless it is a ToF node at the top of the topology or if it must | |||
| the topology as leaf and/or support leaf-2-leaf procedures), and it | operate in the topology as a leaf and/or support leaf-2-leaf | |||
| will fully configure itself after being attached to the topology. | procedures), and it will fully configure itself after being attached | |||
| to the topology. | ||||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| |ToF| |ToF| |ToF| ToF | |ToF| |ToF| |ToF| ToF | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | |||
| | +----------------+ | | | | +----------------+ | | | |||
| | +----------------+ | | | +----------------+ | | |||
| | | | | | | | | | | | | | | |||
| +----------+--+ +--+----------+ | +----------+--+ +--+----------+ | |||
| | ToR1 | | ToR2 | Spine | | ToR1 | | ToR2 | Spine | |||
| skipping to change at line 1270 ¶ | skipping to change at line 1271 ¶ | |||
| | | | | | +-----------------+ | | | | | | | +-----------------+ | | |||
| | | | | +--------------+ | | | | | | | | +--------------+ | | | | |||
| | | | | | | | | | | | | | | | | | | |||
| +---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| +---+ +---+ ............. +---+ +---+ | +---+ +---+ ............. +---+ +---+ | |||
| SV(1) SV(2) SV(n-1) SV(n) Leaf | SV(1) SV(2) SV(n-1) SV(n) Leaf | |||
| Figure 12: Dual-Homing Servers | Figure 12: Dual-Homing Servers | |||
| Sometimes people may prefer to disaggregate from ToR to servers from | Sometimes people may prefer to disaggregate from ToR nodes to servers | |||
| start on, i.e. the servers have couple tens of routes in FIB from | from startup, i.e., the servers have multiple routes in the FIB from | |||
| start on beside default routes to avoid breakages at rack level. | startup other than default routes to avoid breakages at the rack | |||
| Full disaggregation of the fabric could be achieved by configuration | level. Full disaggregation of the fabric could be achieved by | |||
| supported by RIFT. | configuration supported by RIFT. | |||
| 5.12. Fabric with a Controller | 5.12. Fabric with a Controller | |||
| There are many different ways to deploy the controller. One | There are many different ways to deploy the controller. One | |||
| possibility is attaching a controller to the RIFT domain from ToF and | possibility is attaching a controller to the RIFT domain from ToF and | |||
| another possibility is attaching a controller from the leaf. | another possibility is attaching a controller from the leaf. | |||
| +------------+ | +------------+ | |||
| | Controller | | | Controller | | |||
| ++----------++ | ++----------++ | |||
| skipping to change at line 1326 ¶ | skipping to change at line 1327 ¶ | |||
| If the controller is attaching from a leaf to the fabric, no special | If the controller is attaching from a leaf to the fabric, no special | |||
| provisions are needed. | provisions are needed. | |||
| 5.13. Internet Connectivity Within Underlay | 5.13. Internet Connectivity Within Underlay | |||
| If global addressing is running without overlay, an external default | If global addressing is running without overlay, an external default | |||
| route needs to be advertised through the RIFT fabric to achieve | route needs to be advertised through the RIFT fabric to achieve | |||
| internet connectivity. For the purpose of forwarding of the entire | internet connectivity. For the purpose of forwarding of the entire | |||
| RIFT fabric, an internal fabric prefix needs to be advertised in the | RIFT fabric, an internal fabric prefix needs to be advertised in the | |||
| South Prefix TIE by ToF and spine nodes. | Prefix South TIE by ToF and spine nodes. | |||
| 5.13.1. Internet Default on the Leaf | 5.13.1. Internet Default on the Leaf | |||
| In the case that the internet gateway is a leaf, the leaf node as the | In the case that the internet gateway is a leaf, the leaf node as the | |||
| internet gateway needs to advertise a default route in its Prefix | internet gateway needs to advertise a default route in its Prefix | |||
| North TIE. | North TIE. | |||
| 5.13.2. Internet Default on the ToFs | 5.13.2. Internet Default on the ToFs | |||
| In the case that the internet gateway is a ToF, the ToF and spine | In the case that the internet gateway is a ToF, the ToF and spine | |||
| skipping to change at line 1567 ¶ | skipping to change at line 1568 ¶ | |||
| <https://www.rfc-editor.org/info/rfc8655>. | <https://www.rfc-editor.org/info/rfc8655>. | |||
| [RFC8950] Litkowski, S., Agrawal, S., Ananthamurthy, K., and K. | [RFC8950] Litkowski, S., Agrawal, S., Ananthamurthy, K., and K. | |||
| Patel, "Advertising IPv4 Network Layer Reachability | Patel, "Advertising IPv4 Network Layer Reachability | |||
| Information (NLRI) with an IPv6 Next Hop", RFC 8950, | Information (NLRI) with an IPv6 Next Hop", RFC 8950, | |||
| DOI 10.17487/RFC8950, November 2020, | DOI 10.17487/RFC8950, November 2020, | |||
| <https://www.rfc-editor.org/info/rfc8950>. | <https://www.rfc-editor.org/info/rfc8950>. | |||
| [RFC9692] Przygienda, T., Ed., Head, J., Ed., Sharma, A., Thubert, | [RFC9692] Przygienda, T., Ed., Head, J., Ed., Sharma, A., Thubert, | |||
| P., Rijsman, B., and D. Afanasiev, "RIFT: Routing in Fat | P., Rijsman, B., and D. Afanasiev, "RIFT: Routing in Fat | |||
| Trees", RFC 9692, DOI 10.17487/RFC9692, December 2024, | Trees", RFC 9692, DOI 10.17487/RFC9692, March 2025, | |||
| <https://www.rfc-editor.org/info/rfc9692>. | <https://www.rfc-editor.org/info/rfc9692>. | |||
| [TR-384] Broadband Forum Technical Report, "TR-384: Cloud Central | [TR-384] Broadband Forum Technical Report, "TR-384: Cloud Central | |||
| Office Reference Architectural Framework", TR-384, Issue | Office Reference Architectural Framework", TR-384, Issue | |||
| 1, January 2018, | 1, January 2018, | |||
| <https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf>. | <https://www.broadband-forum.org/pdfs/tr-384-1-0-0.pdf>. | |||
| 8.2. Informative References | 8.2. Informative References | |||
| [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer | [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer | |||
| skipping to change at line 1674 ¶ | skipping to change at line 1675 ¶ | |||
| Nanjing | Nanjing | |||
| 210012 | 210012 | |||
| China | China | |||
| Email: zhang.zheng@zte.com.cn | Email: zhang.zheng@zte.com.cn | |||
| Dmitry Afanasiev | Dmitry Afanasiev | |||
| Yandex | Yandex | |||
| Email: fl0w@yandex-team.ru | Email: fl0w@yandex-team.ru | |||
| Pascal Thubert | Pascal Thubert | |||
| Cisco Systems, Inc | Individual | |||
| Building D | ||||
| 45 Allee des Ormes - BP1200 | ||||
| 06254 Mougins - Sophia Antipolis | ||||
| France | France | |||
| Phone: +33 497 23 26 34 | Email: pascal.thubert@gmail.com | |||
| Email: pthubert@cisco.com | ||||
| Tony Przygienda | Tony Przygienda | |||
| Juniper Networks | Juniper Networks | |||
| 1194 N. Mathilda Ave | 1194 N. Mathilda Ave | |||
| Sunnyvale, CA 94089 | Sunnyvale, CA 94089 | |||
| United States of America | United States of America | |||
| Email: prz@juniper.net | Email: prz@juniper.net | |||
| End of changes. 29 change blocks. | ||||
| 75 lines changed or deleted | 72 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||