Network Working Group N. Zong Internet-Draft L. Dunbar Intended status: Informational Huawei Technologies Expires: March 10, 2014 M. Shore No Mountain Software September 06, 2013 Problem Statement for Reliable Virtualized Network Function (VNF) draft-zong-vnfpool-problem-statement-01 Abstract Virtualization technology has been widely supported by both Network Operators and Data Center providers to provide service with reduced operational and capital costs, automated deployment, and enhanced elasticity. A challenge is how to achieve the reliability and high availability capabilities of the virtualized network function to facilitate reliable service. This document focuses on the problems related to the reliability and high availability aspects of virtualized network function. A discussion of reliable virtualized network function pools is presented for scoping solution purpose. Some related works together with potential reuse and extension are also introduced. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 10, 2014. Zong, et al. Expires March 10, 2014 [Page 1] Internet-Draft Reliable VNF September 2013 Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Background . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1. VNF Instance Selection and Status Monitoring . . . . . . 6 3.2. Backup Selection and Announcement . . . . . . . . . . . . 6 3.3. Service State Synchronization . . . . . . . . . . . . . . 6 3.4. Transition Handling . . . . . . . . . . . . . . . . . . . 6 3.5. Policy Enforcement . . . . . . . . . . . . . . . . . . . 6 4. Working Scope . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Reference Architecture . . . . . . . . . . . . . . . . . 7 4.2. Proposed Working Scope . . . . . . . . . . . . . . . . . 8 5. Related Works . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1. Reliable Server Pool . . . . . . . . . . . . . . . . . . 9 5.2. Virtual Router Redundancy Protocol . . . . . . . . . . . 10 5.3. VNF Forwarding Graph . . . . . . . . . . . . . . . . . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 9.1. Normative References . . . . . . . . . . . . . . . . . . 12 9.2. Informative References . . . . . . . . . . . . . . . . . 12 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 1. Background Network functions such as firewall (FW), Deep Packet Inspection (DPI), Load Balancer (LB), WAN Optimization are conventionally deployed as a set of dedicated devices in both Network Operators' network and Data Center (DC) network, as the building blocks of the services. In recent years, virtualization technology, from server Zong, et al. Expires March 10, 2014 [Page 2] Internet-Draft Reliable VNF September 2013 virtualization, network virtualization, to network function virtualization, is getting wider industry adoption by both Network Operators and DC providers, to achieve elastic service offering and reduced operational cost [NFV-WP]. The European Telecommunications Standards Institute (ETSI) has launched an Industry Specification Group (ISG) to study the use cases, requirements and architecture of Network Function Virtualization (NFV) from Network Operators' perspective. Building upon general purpose servers, instances of Virtualized Network Function (VNF) can be placed into various locations including DC networks, Network Operator networks and even customer premises. Furthermore, there are potentially more factors that cause VNF instance transition (e.g. scaling, migration) or even failure, such as resource contention among instances, hardware status change, and hardware/software failure at various levels. Therefore, a major challenge is how to achieve the reliability and high availability capabilities of the VNF under highly distributed and dynamic conditions of the VNF instances. For example, the Reliability and Availability Working Group (RELAV WG) of ETSI ISG NFV targets on identifying the resiliency problems and requirements to the services provided by Network Operators [NFV-REL]. An overview of VNF use cases focus on the reliability and high availability issues can be found in [VNFP-UC]. In this document, we first overview problems related to the reliability and high availability aspects of VNF. We then present an applicable architecture of reliable VNF pools to scope potential solutions. Finally, we refer to some related works for potential reuse and extension of existing approaches. 2. Terminology Reliability and High Availability: capability of a functional entity to consistently provide function under various dynamic and even unexpected conditions such as fault, overload, etc. Virtualized Network Function (VNF): a VNF provides the same functional behavior and interfaces as the equivalent network function, but is deployed as software instances building on top of a virtualization platform [NFV-TERM]. VNF Pool: a group of VNF instances providing same network function. Pool Element (PE): a VNF instance inside a VNF pool. Pool User (PU): an entity that requests network function provided by VNF pool. Zong, et al. Expires March 10, 2014 [Page 3] Internet-Draft Reliable VNF September 2013 Pool Manager (PM): an entity that manages pool elements, and interacts with pool user to provide network function. 3. Problems Many network services require multiple network functions to be performed sequentially on data packets. A traditional model for multi-tier service is shown as below, where for each network function, all instances connect to the corresponding entrance point (e.g. LB) responsible for sending/receiving data packets to/from selected instance(s), and steering the data packets between different network functions. Service (e.g. VOIP, Web) +--------------+ +--------------+ +--------------+ | function#1 | | function#2 | | function#n | | +----------+ | | +----------+ | | +----------+ | | | Instance | | | | Instance | |... ...| | Instance | | | +----------+ | | +----------+ | | +----------+ | | |data | | |data | | |data | | |conn | | |conn | | |conn | | +----------+ | | +----------+ | | +----------+ | | | Entrance | | | | Entrance | | | | Entrance | | | | Point | | | | Point | | | | Point | | | +----------+ | | +----------+ | | +----------+ | +-----+--------+ +-------+------+ +-------+------+ |data conn |data conn | +-------------------+----------------------+ Figure 1: Multi-tier Service. Such model works well when all instances of the same network function are topologically close to each other. However, VNF instances are highly distributed in DC networks, Network Operator networks and even customer premises. When VNF instances are topologically far from each other, there could be many network links/nodes between an instance and the corresponding entrance point for steering the data packets. For two VNF instances of different network functions, it is possible that they are on the same physical server, but the entrance points are many links/nodes away. To improve network efficiency, it is desirable to establish direct data connections between VNF instances, as shown below. Service (e.g. VOIP, Web) +----------+ +----------+ +----------+ | VNF#1 | data conn | VNF#2 | data conn | VNF#n | | Instance |-----------| Instance |- ... ... -| Instance | Zong, et al. Expires March 10, 2014 [Page 4] Internet-Draft Reliable VNF September 2013 +----------+ +----------+ +----------+ ^ | Virtualization +--------------------------------------------------------+ | Virtualization Platform | +--------------------------------------------------------+ Figure 2: VNF Instances Direct Connection. Many of today's dedicated network devices have built-in failure protection and recovery mechanisms for reliability and high availability. However, VNF instances are software instances running on general purpose servers via virtualization platform. There are potentially more factors that cause VNF instance transition or even failure, such as: 1) hardware failure; 2) hardware status change such as server over-utilization, network congestion; 3) software failure at various levels including hypervisor, Virtual Machine (VM), VNF instance; 4) performance downgrade due to resource contention between VNF instances; 5) instance migration due to server consolidation, or configuration change due to policy. Generally, VNF instance transition refers to the actions taken to address varying condition of hardware, software or configuration. Transition could be scaling in/out or scale up/down of an instance. Transition could be replacing an instance in same location, or moving instance to another location. Therefore, the major challenge is how to achieve reliability and high availability capabilities of the VNF during VNF instance transition or failure in the model of VNF instances direct connection. The potential issues to be addressed are described in the following subsections. Zong, et al. Expires March 10, 2014 [Page 5] Internet-Draft Reliable VNF September 2013 3.1. VNF Instance Selection and Status Monitoring One basic goal of reliable VNF is to select a suitable VNF instance from a group of candidates and replace the VNF instance in case of instance failure. The issues are: 1) Who is responsible and how to select the VNF instance? 2) Who is responsible and how to monitor the status of instance? 3.2. Backup Selection and Announcement Before a VNF instance fails, one or more backup instances of the same network function have to be selected and notified to the directly connected instances in the adjacent VNFs. The issues are: 1) Who is responsible and how to select the backup instances, as well as announce the backup instances? 2) How to deal with the backup instance transition or failure? 3.3. Service State Synchronization The service state of the VNF instance should also be synchronized between the VNF instance and its backup instances for stateful network function. The issues are: 1) Who is responsible and how to collect/keep the service state of the instance? 2) How to synchronize service state with backup instances? 3.4. Transition Handling It is important to maintain the service reliability during VNF instance transition. The issues are: 1) Who is responsible and how to notify the VNF instance transition to the directly connected instances in the adjacent VNFs? 2) How to re-establish the network connection and session between a new VNF instance and the directly connected instances with an acceptable level of service continuity? 3.5. Policy Enforcement Zong, et al. Expires March 10, 2014 [Page 6] Internet-Draft Reliable VNF September 2013 There could be some policies reflecting the different reliability class of the service and hence affecting the selection of VNF instances. Examples would include isolation policies requiring that VNF instances be placed on separate physical servers or separate DC sites. Another example is to place some VNF instances in topologically closed locations. The issues are: 1) Who is responsible for receiving and enforcing the policy? 2) Who is responsible and how to collect enough information (e.g. network topology, view of instances distribution) to enforce the policy? 4. Working Scope Reliability and high availability aspects of VNF fall within a broader problem space than has been identified in this draft. The current focus of our work is to develop tools to improve VNF reliability and availability, which will be applicable to a wide range of services. 4.1. Reference Architecture There are a number of existing technologies for providing reliable and highly available functions, such as Reliable Server Pool (RSerPool) [RFC5351], Virtual Router Redundancy Protocol (VRRP) [RFC5798]. Although these technologies are applicable to different scenarios using different protocols, the underlying idea is similar. Both technologies provide service with an abstract object (e.g. pool handle in RSerPool, Virtual Router ID in VRRP) to represent a group of functional instances where the dynamic mapping of abstract object to actual serving instance, or the selection of serving instance, is managed internally to the group to cover failover procedure. The advantage of this model is to provide service a reliable and highly available function in a manner that transparent to both end-hosts and other service sub-systems. Based on the above mentioned model, we could derive a reference architecture for reliable and highly available VNF called reliable VNF pools, which is illustrated as below. Note that we reuse the term of pool to represent a group of VNF instances without loss of generality. +-----------------+ | Pool User | +-----------------+ ^ ^ | (2) | Zong, et al. Expires March 10, 2014 [Page 7] Internet-Draft Reliable VNF September 2013 +-----------+ +-----------+ | | v v +--------------+ (4) +--------------+ | Pool Manager |<-------------->| Pool Manager | +--------------+ +--------------+ ^ ^ | (1) | v v +------------------------------+ +------------------------------+ |+----------+ +----------+ | | +----------+ +----------+| || VNF#1 | | VNF#1 | |(3)| | VNF#2 | | VNF#2 || || Instance | ... | Instance |<+---+>| Instance | ... | Instance || |+----------+ +----------+ | | +----------+ +----------+| | VNF#1 Pool | | VNF#2 Pool | +------------------------------+ +------------------------------+ Figure 3: Reliable VNF Pools. In the architecture of reliable VNF pools, there are multiple VNF pools, each containing a group of VNF instances providing the same network function. Each VNF pool has a Pool Manager (PM) to manage an abstract object of the VNF pool including VNF identity (e.g. "FW", "DPI") and the associated addresses of the VNF instances. A Pool User (PU) could be either an application end-host or a service sub- system (e.g. orchestrator in DC service) requesting network function from the PM. PM could also be a designated VNF instance in the VNF pool in the case that VNF instances are self-organized to select designated instance. In such case, the PM itself provides network function to the PU as well. 4.2. Proposed Working Scope Based on the reference architecture of reliable VNF pools, we believe the following design goals or working scope need to be considered as aspects of providing reliable and highly available VNF. 1) The communication between VNF instances and the responsible PM to transmit messages such as VNF Instance Selection, Status Monitoring, Service State Synchronization; 2) The communication between PU and PM to address issues like Policy Enforcement, VNF Instance query/response; 3) The communication between VNF instances in different VNF pools to transmit messages such as Backup Announcement, Service State Synchronization, Transition Handling; Zong, et al. Expires March 10, 2014 [Page 8] Internet-Draft Reliable VNF September 2013 4) The communication between different PMs to achieve fault- tolerance of PMs themselves including pool information redundancy. The main purpose of this section is to scope the solution space. The proposed solution will be addressed by separate draft. 5. Related Works In this section, we refer to some related work for potential reuse and extension. 5.1. Reliable Server Pool Reliable Server Pool (RSerPool) supports high availability and the scalability of applications through the use of pools of servers [RFC5351]. The main functions of RSerPool involve server pool management, as well as receiving requests from a client to bind to a desired server. The main protocols developed by RSerPool are called Aggregate Server Access Protocol (ASAP) [RFC5352] and Endpoint Handlespace Redundancy Protocol (ENRP) [RFC5353]. The architecture of RSerPool is shown as below. +--------------+ | Pool User | +--------------+ ^ | ASAP V +--------------+ ENRP +--------------+ | ENRP Server |<-------->| ENRP Server | +--------------+ +--------------+ ^ | ASAP V +--------------------------------------------------+ | +----------+ +----------+ +----------+ | | | PE | | PE | ... ... | PE | | | +----------+ +----------+ +----------+ | | Server Pool | +--------------------------------------------------+ Figure 4: Reliable Server Pool. The similarity and applicability of RSerPool to reliable VNF pools includes: Zong, et al. Expires March 10, 2014 [Page 9] Internet-Draft Reliable VNF September 2013 1) Pool Elements (PEs) can be regarded as VNF instances, and an ENRP Server can be regarded as a PM; 2) ASAP could be applicable to reliable VNF pools for VNF instance management, policy enforcement, backup announcement, service state synchronization, transition handling, and so on; 3) ENRP could be applicable to reliable VNF pools for fault- tolerant PM. Potential extension to meet the full objective of reliable VNF pools includes: 1) Extend ASAP to support more policy enforcement such as failure isolation; 2) Extend ASAP to support more efficient instance transition. 5.2. Virtual Router Redundancy Protocol Virtual Router Redundancy Protocol (VRRP) specifies an election protocol that dynamically assigns responsibility for a virtual router to one of the VRRP routers (i.e. Master) on a LAN [RFC5798]. The election process provides dynamic failover in the forwarding responsibility should the Master become unavailable. The advantage of VRRP is a higher availability default path without requiring configuration of dynamic routing or router discovery protocols on every end-host. An example is shown as below. +---------------+ +---------------+ | VRRP Router#1 | | VRRP Rouer#2 | |(Master, IP A) | |(Backup, IP B) | | | | | VRID=1 +---------------+ +---------------+ | | ---------+-----------------------+--------- ^ | (IP A) | +--+--+ | Host| +-----+ Figure 5: Virtual Router (VRID=1) Zong, et al. Expires March 10, 2014 [Page 10] Internet-Draft Reliable VNF September 2013 The similarity and applicability of VRRP to reliable VNF pools includes: 1) VRRP Routers can be regarded as VNF instances; 2) The Master advertisement and transition between Master and Backup procedure can be a part of the function of PM as the designated VNF instance in the VNF pool. The gap between VRRP and reliable VNF pools includes: 1) In VRRP, the loss of the master is infrequent, while in reliable VNF pools the more frequent transition of VNF instances means that failover efficiency is a more pressing concern; 2) There is no policy enforcement related to reliability in VRRP. 5.3. VNF Forwarding Graph VNF forwarding graph (a.k.a. service chain in a wider sense) defines the sequence of VNF instances that a user session must traverse [NFV- UC]. An example of a VNF forwarding graph is a topology in which user packets traverse a sequence of VNF instances of Intrusion Detection Service (IDS), FW, Network Address Translation (NAT) and LB. Different services have different VNF forwarding graphs based on specific user needs and therefore service logic. The VNF forwarding graph and reliable VNF pools are independent but complementary with each other in the following ways: 1) The VNF forwarding graph determines the sequential relation between VNF instances, while reliable VNF pools maintains the reliability and high availability of VNF. 6. Security Considerations Any technology which allows the insertion, deletion, reordering, or manipulation of network functions has the potential to be subverted by an attacker, with serious consequences. Distributed VNFs introduce an additional attack vector, in which bad actors join several VNFs of a service. Replay attacks have the potential to create denials of service, reordering, adding, or removing VNFs. VNF reliability technologies must provide cryptographic protections against spoofing and insertion attacks as well as replay attacks, in the form of client authentication, origin authentication on VNF reliability management (control plane) traffic, and replay protections. There may be circumstances under which an attacker masquerading as a VNF manager can introduce data leakage or similar Zong, et al. Expires March 10, 2014 [Page 11] Internet-Draft Reliable VNF September 2013 attacks, and consequently server authentication would be required, as well. 7. IANA Considerations This document has no actions for IANA. 8. Acknowledgements The authors would like to thank Daniel King from Lancaster University, UK for the valuable comments to this draft. 9. References 9.1. Normative References TBD. 9.2. Informative References [NFV-WP] NFV Whitepaper: "Network Function Virtualization", issue 1, 2012, http://portal.etsi.org/NFV/NFV_White_Paper.pdf. [NFV-REL] ETSI GS NFV REL 001: "Network Function Virtualization; Resiliency Requirements", Version 0.0.1, 2013. [VNFP-UC] L. Xia, Q. Wu and D. King, "Use cases and Requirements for Virtual Service Node Pool Management", draft-xia-vsnpool-management- use-case-01, August 2013. [NFV-TERM] ETSI GS NFV 003: "Terminology for Main Conceptional Entities in NFV", Version 0.0.4, 2013. [RFC5351] P. Lei, L. Ong, M. Tuexen and T. Dreibholz, "An Overview of Reliable Server Pooling Protocols", RFC5351, September 2008. [RFC5352] R. Stewart, Q. Xie, M. Stillman and M. Tuexen, "Aggregate Server Access Protocol (ASAP)", RFC5352, September 2008. [RFC5353] Q. Xie, R. Stewart, M. Stillman, M. Tuexen and A. Silverton, "Endpoint Handlespace Redundancy Protocol (ENRP)", RFC5353, September 2008. [RFC5798] S.Nadas, "Virtual Router Redundancy Protocol (VRRP) Version 3 for IPv4 and IPv6", RFC5798, March 2010. [NFV-UC] ETSI GS NFV 001: "Network Function Virtualization; Use Cases", Version 0.0.2, 2013. Zong, et al. Expires March 10, 2014 [Page 12] Internet-Draft Reliable VNF September 2013 10. References Authors' Addresses Ning Zong Huawei Technologies Email: zongning@huawei.com Linda Dunbar Huawei Technologies Email: linda.dunbar@huawei.com Melinda Shore No Mountain Software Email: melinda.shore@nomountain.net Zong, et al. Expires March 10, 2014 [Page 13]