INTERNET DRAFT S. Bandyopadhyay draft-shyam-mshn-ipv6-18.txt January 31, 2014 Intended status: Proposed Standard Expires: July 31, 2014 Mesh Structured Hierarchical Networking and IPv6 draft-shyam-mshn-ipv6-18.txt Abstract This document tries to address an approach for reorganization of entire network in a large address space. It describes how a three- tier mesh structured hierarchy can be established based on fragmenting the entire space into some regions and sub regions inside each of them. It addresses issues which could be relevant to this architecture in the context of IPv6. This document also tries to come out with an approach how IP switch based network can perform as good as ATM network for the processing of real time traffic. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on October 18, 2013. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Bandyopadhyay Expires July 31, 2014 [Page 1] Internet Draft MSHN and IPv6 January 31, 2014 Table of Contents 1. Introduction.....................................................2 2. A Three-tier mesh structured hierarchical network................3 2.1. Route propagation..............................................5 2.2. Determination of prefix lengths................................7 2.2.1. A pseudo optimal distribution of prefixes in a 64bit architecture........................................................8 2.2.2. Whether to go for a two-tier or three-tier hierarchy........10 2.3. Issues related to Satellite communications....................10 2.4. Solution for site multihoming.................................12 2.4.1. Multihoming, VPN and load sharing...........................16 2.4.2. Multihoming and IP Mobility.................................18 3. Processing of real time packets (QoS issue).....................19 3.1. Dual mode operation...........................................21 4. Refinements over existing IPv6 specification....................22 4.1. Distributed processing and Multicasting.......................22 5. Expected changes at the application layer.......................23 6. IANA Consideration..............................................23 7. Security Consideration..........................................23 8. Acknowledgments.................................................23 9. Normative References............................................23 10. Informative References.........................................24 11. Author's Address...............................................24 1. Introduction Transition from IPv4 to IPv6 is in the process. Work has been done to upgrade individual nodes (workstations) from IPv4 to IPv6. Also, there are established documents to make routers/switches to work to support IPv4 as well as IPv6 packets simultaneously in order to make the transition possible [1]. The CIDR[2] based hierarchical architecture in the existing 32-bit system is supposed to be continued in IPv6 too with a large address space. There are documents/concerns over BGP table entries to become too large in the existing system [3]. There are proposals to upgrade Autonomous System number to 32-bit from 16-bit to support the demand at the same time [4]. The challenge relies on how to make the transition smooth from IPv4 to a real IP world with least changes possible. ATM network performs faster than the network with IP switches. The difference becomes more prominent for real time applications. Whereas they have disadvantages as far as bandwidth usages is concerned compared to the IP-switch based network. This document tries to address approaches for IP-switch based network to process real-time applications as fast as ATM network also a mesh structured hierarchical network with flat address space for routing convenience. It provides a solution for site multihoming of stub networks. Bandyopadhyay Expires July 31, 2014 [Page 2] Internet Draft MSHN and IPv6 January 31, 2014 2. A Three-tier mesh structured hierarchical network Existing system is in work with Autonomous System (AS) and inter-AS layer with the approach of CIDR. In order to meet the need within the 32-bit address space, Autonomous Systems of various sizes maintain CIDR based hierarchical architecture. With the help of NAT [5], a stub network can maintain an user ID space as large as a class A network and can meet its useful need to communicate with the rest of the world with very few real IP addresses. With the combination of CIDR and NAT applied in the entire space, most of the part of 32-bit address space gets effectively used as network ID. This is how, 16-bit 'Autonomous System Number' is realized as insufficient in order to meet the need of growing customers. If the same gets continued with a larger network ID, load in the switches will become too high. With traditional CIDR based hierarchy, a node of higher prefix can be divided into number of nodes with lower prefixes. Each divided node can further be subdivided with nodes of further lower prefixes. This process can be continued till no further division is possible. The point worth noting is at each point the designer of the network has to preconceive the future expansion of the network with the concept in the mind that the resource can not be exhausted at any point of time. This phenomenon leads the designer to allocate resources much higher than whatever is needed which leads to a space of unused address space and the concept of H-D (host-density) ratio comes into play. The problem gets aggravated once resource gets exhausted by any chance. e.g. a node of prefix /16 can be divided with a number of nodes of prefixes /24. If any one of the nodes /24 gets exhausted, resources of other nodes of prefixes /24 can not be used even if they are available. Transition from private IP to real IP may not appear to be a simple task. This has happened due to the desperate attempt of the service providers to provide internet services with the help of NAT. e.g. a large educational institute meets its current requirement with 4 real IP addresses; one for its mail server, one for its web server, one for its ftp server and another one for its proxy server to provide web based services to all of its users. These four types of services are used by any organization of any size(it may be 400 or even 40000). In the current provider network these organizations are supported their need with 4 IP addresses and the CIDR based tree has been built using these components together. When private IP will be replaced with real IP, each customer network will require IP addresses based on its size and requirement. So, even if CIDR based architecture is maintained with real IP space, existing provider based network needs to be reorganized. The desired approach will be to assign address block that will be proportional to the sizes Bandyopadhyay Expires July 31, 2014 [Page 3] Internet Draft MSHN and IPv6 January 31, 2014 (bandwidth) of the ports of the switches of the provider network. As Autonomous Systems of various sizes are supported, Autonomous Systems and the nodes inside the Autonomous Systems can be viewed as graphically lying on the same plane within the address apace. If network can be viewed as lying on different planes, routing issues can be made simpler. If network is designed with a fixed length of prefix for the Autonomous System everywhere, routing information for the rest will get confined with the other part of the network prefix. Which means the maximum size of AS gets assigned to all irrespective of their actual sizes. This can be made possible with the advantage of using a large address space and dividing it into number of regions of fixed sizes inside it. Thus entire network can be viewed as a network of inter-AS layer nodes. Each node in the inter-AS layer can act either only as a router in the inter-AS layer or as a router in the inter-AS layer with an Autonomous System attached to it with a single point of attachment or as an Autonomous System with multiple Autonomous System border routers (ASBR) appearing like a mesh. Thus two tier mesh structured hierarchy gets established between AS layer and inter-AS layer with each AS having a fixed length of prefix. Based on the definition of Autonomous System, it is a small area within the entire network that maintains its own independent identity that communicates with the rest of the world through some specific border routers. In the similar manner, if a larger area (say region or state) can be considered as network of Autonomous Systems, that can maintain its own identity by communicating with the rest of the world through some border routers (say, state border router), mesh structured hierarchy can be established within the inter-AS layer. The inter-AS layer will be split into inter-AS-top and inter-AS- bottom. To maintain this hierarchy, each node of inter-AS-top needs to have multiple regional or state border routers (say, SBR) through which each one will communicate with the rest of the world in the similar manner an Autonomous System maintains ASBR. Thus, entire network will appear as a network of nodes of inter-AS-top layer. To maintain hierarchy, each node of the inter-AS-top needs to have a fixed length of prefix. i.e. each node of the inter-AS top will be assigned a maximum (fixed) number of nodes of Autonomous Systems. Thus, with three-tier mesh structured hierarchy in the network layer, network ID can be viewed as A.B.C. If pA, pB and pC be the prefix lengths of inter-AS-top, inter-AS-bottom and AS layers respectively, there will be 2^pA nodes at the topmost layer, 2^pB at the inter-AS- bottom layer and 2^pC nodes at the AS layer. Thus the entire space gets divided into a fixed number of regions and each region gets divided into fixed number of sub regions. This division is supposed to be made based on geography, population density and their demands and related factors. Bandyopadhyay Expires July 31, 2014 [Page 4] Internet Draft MSHN and IPv6 January 31, 2014 Let nMaxInterASTopNodes be the possible maximum number of nodes assigned at the top most layer and nMaxInterASBottomNodes be that at the inter-AS-bottom layer and nMaxASNodes at the AS layer. Where nMaxInterASTopNodes <= 2^pA and nMaxInterASBottomNodes <= 2^pB and nMaxASNodes <= 2^pC. 2.1. Route propagation With hierarchy established, routing information that gets established inside a node of inter-AS-top, does not need to be propagated to another node of inter-AS-top. Entire routing information of inter-AS- top layer needs to be propagated to inter-AS-bottom layer. So, each router of inter-AS layer will have two tables of information, one for the inter-AS-top and another for the inter-AS-bottom of the inter-AS- top node that it belongs to. BGP (with little modification) will work very well with a trick applied at the SBRs. Each SBR will not propagate the routing information of inter-AS-bottom layer of its domain to another SBR of neighboring domain. i.e. SBR of one top layer node will propagate routing information only of inter-AS-top layer to SBR of another top layer node. Inside a node of inter-AS- top, routing information of inter-AS-top and inter-AS-bottom need to be propagated from one ASBR to another neighboring ASBR. Inside a top layer node A, routing information of another top layer node B will have two parts; one for the list of SBRs through which a packet will traverse from top layer node A to B and another for the list of ASBRs through which the packet will traverse from one AS to another inside A. In terms of BGP, AS_PATH attribute will be split into two parts; one for the information of the top layer and another for the bottom layer. Within the same node A routing information of one AS to another AS will not have any top layer information. i.e. the top layer information will be set to as NULL. Similarly, each node of the AS layer will have three tables of routing entries. One for the inter-AS-top, one for the inter-AS- bottom and another for the routing information inside the Autonomous System itself. Introduction of hierarchy at the inter-AS layer reduces the size of the routing table substantially. With the availability of hardware resources if flat address space is maintained at each layer, problems related to CIDR can be avoided. With flat address space, no hierarchical relationship needs to be established between any two nodes in the same layer. So, all the nodes inside each layer can be used till they get exhausted. With flat address space (i.e. without prefix reduction), BGP tables will have maximum nMaxInterASTopNodes + nMaxInterASBottomNodes entries. Bandyopadhyay Expires July 31, 2014 [Page 5] Internet Draft MSHN and IPv6 January 31, 2014 IGP like OSPF has got provision to divide AS into smaller areas. OSPF hides the topology of an area from the rest of the Autonomous System. This information hiding enables a significant reduction in routing traffic. With the support of subnetting, OSPF attaches an IP address mask to indicate a range of IP addresses being described by that particular route. With this approach it reduces the size of the routing traffic instead of describing all the nodes inside it, but introduces another level of hierarchy. If subnetting concept can be avoided from the AS layer(with the additional overhead of computation inside the SPF tree), each area can be configured from a free pool of addresses based on its requirement dynamically. So, an AS can be divided into number of areas of heterogeneous sizes with the nodes from a free pool of address space. Similarly, the concept of area can be introduced in the inter-AS- bottom layer the way it works in OSPF. The area border routers in the inter-AS-bottom layer have to behave exactly in the similar manner the way an ABR behaves in OSPF. i.e. an area border router will hide the topology inside an area to the rest of the world and will distribute the collected information inside the area to the rest. It will distribute the collected routing information from outside to the nodes inside as well. In order to implement this, protocol running in the inter-AS layer (say BGP) will have to introduce a 'cost' factor. This cost factor can be interpreted as the cost of propagation of a packet from one AS to another. The protocols running inside AS layer (RIP/OSPF, etc) will have to the supply the cost information for a packet to travel from one ASBR to another. All the protocols must behave in unison for supplying this information. The cost factor is needed for a remote node while sending a packet to a node inside an area while more than one area border routers are equidistant from that remote node. Thus inter-AS-bottom layer (i.e. one inter-AS-top level node) can be divided into number of areas of heterogeneous sizes with nodes of AS from a free pool of address space. BGP adopts a technique called route aggregation. Along with route aggregation it reduces routing information within a message. In the similar manner, introduction of area inside inter-AS-bottom layer will not only reduce the complexity of the protocol, but will reduce the size of a BGP packet substantially. With this architecture, each node(router) inside an AS is represented as A.B.C. Each node may or may not be attached with a network which acts as a leaf node (i.e. a network will not act as a transit). In order to make use of user-id space properly and to support customer networks of heterogeneous sizes, the user-ID space needs to be divided as subnet-ID and user-ID. Profoundly, a VLSM (variable length subnet mask) type of approach has to be adopted at each node of an AS. So, each node of the AS layer will act as the root of a tree whose leaves are independent small customer networks which will act Bandyopadhyay Expires July 31, 2014 [Page 6] Internet Draft MSHN and IPv6 January 31, 2014 as stub. As the routing information of inter-AS layer as well as AS layer need not be passed inside any node of the VLSM tree, each router inside the tree should maintain default route for any address outside of its network. With this approach, load on each router of the service providers will become negligible. Protocols that supports VLSM with MPLS/VPN has to be implemented inside the tree (inside the VLSM tree, all the physical ports of a switch have to be configured with the subnet mask. So, mere MPLS on top of static routing table should do the rest). The fundamental assumptions based on which this architecture lies can be summarized as follows: i) Entire network can be viewed as a network of regions or states where each region or state can have its own identity by communicating with the rest of the world through some state border routers. Each region or state is a network of Autonomous Systems. Each region as well as each Autonomous System inside them will have a fixed (maximum) length of prefix. ii) Availability of hardware resources is such that flat address space can be maintained at the inter-AS layer. Introduction of mesh-structured hierarchy at the inter-AS layer will have several advantages: o Load at each router will get reduced substantially. o Concept of CIDR style approach and complexity related to prefix reduction can be easily avoided. o Mesh structured hierarchy will make traffic evenly distributed. o Physical cable connection can be optimized. o Administrative issues will become easier. 2.2. Determination of prefix lengths With this architecture, IP address can be described as A.B.C.D where the D part represents the user id. Each router in the inter-AS layer will have two tables of information, one for the inter-AS-top and another for the inter-AS-bottom of the inter-AS-top node that it belongs to. Whereas, each node of the AS layer will have three tables of routing entries; one for the inter-AS-top, one for the inter-AS- bottom and another for the routing information inside the Autonomous System itself. In the worst case. a node inside an AS needs to maintain nMaxInterASTopNodes + nMaxInterASBottomNodes + nMaxASNodes entries in its routing table. The dynamic nature of allocating an area from a free pool of address space is more frequent at the AS layer than at the inter-AS-bottom Bandyopadhyay Expires July 31, 2014 [Page 7] Internet Draft MSHN and IPv6 January 31, 2014 layer. As OSPF supports all the features needed, it can be considered as default choice in the AS layer. Existing implementation of OSPF (Version 2) supports subnetting, by which an entire area can be represented as a combination of network address and subnet mask. With this approach, entire routing table gets reduced substantially. With the removal of subnetting, all the nodes inside an area will have an entry inside the routing table (OSPF Version 1). So the deterministic factor is what is the maximum number of nodes inside an AS OSPF can support once subnetting support gets removed. So the prefix length of AS layer will be determined by this factor of OSPF. With the introduction of hierarchy in the inter-AS layer, number of entries in the BGP routing table will get reduced substantially. Even if pA and pB both are selected as 16, number of routing entries come within the admissible range of existing BGP protocol. But, it is the responsibility of IANA to come out with a scheme how nMaxInterASTopNodes and nMaxInterASBottomNodes are to be selected. Each top level node will have nMaxInterASBottomNodes nodes. It will be a waste of address space if each country gets assigned a top level nodes (e.g. china has got a population of 1,306,313,800 people where as Vatican City has got only 920 according to a census of 2006). So a moderate value of nMaxInterASBottomNodes is desirable, with which larger countries will have a number of top level nodes. e.g. each state of USA can be assigned a top level node. With the introduction of area in the inter-AS-bottom layer, each top level node can be divided into number of areas of heterogeneous sizes. So, a group of neighboring countries with less population can share the address space of a top level node. Similarly, user-id space has to be decided based on the largest area VLSM tree should be spanned through. All these issues are completely geo political and have to be decided by IANA. 2.2.1. A pseudo optimal distribution of prefixes in a 64bit architecture In order to have optimal use of cable connections, length of the VLSM tree is expected to be as short as possible. Also any single organization may prefer to have its user id space to be under the same network id. So, a 16bit user-id may become insufficient for places like large university campus, where as 32bit will become too large. Hence, 24bit user-id will be a moderate one which is the class A address space in ipv4 (also used as the space for private IP). As published in 1998 [8], OSPF can support an area with 1600 routers and 30K external LSAs. So, 11 bits are needed to support this space. With the assumption that OSPF can support much more address space with the advancement of hardware technology as well as to keep the space open for future expansions, 12 bits are assigned for the AS layer. 16 bits are assigned for the inter-AS-bottom layer. So, if on the average, 16bit equivalent space gets used within the user-id space and 8bit Bandyopadhyay Expires July 31, 2014 [Page 8] Internet Draft MSHN and IPv6 January 31, 2014 equivalent nodes gets used inside an AS (16% of 1600), for a top level node (with 16bit equivalent AS nodes), it will generate 2^40 IP addresses, which will give 8629 IP addresses per person in Japan (with a population of 127417200; Japan is at the 10th position from the top in the population list of the world). So, even if all the countries with population less than or equal to Japan are assigned a top level node and all the provinces/states of countries with larger population are assigned a top level node each, total number of nodes will come well under 1024. If a number of neighboring countries with lesser population shares a top level node, total number of top level nodes will come down further. This suggests that 62 bit equivalent (10(pA)+16(pB)+12(pC)+24(user-id)) space will be good enough for unicast addresses. This distribution expects OSPF to support 65K (64K+1K) external LSAs. 64bit address space may be divided into two 63bit blocks as follows: i. Global unicast addresses with the most significant bit set to 0. In order to separate out router address space from the host computers of customer networks, routers may be assigned a prefix 01 whereas the host computers will have prefix 00. With three-tier hierarchy, network ID is represented as A.B.C. Any router inside the VLSM tree including the root will have an address 01A.B.C.router-id. Where as a host interface inside a customer network will be represented as 00A.B.C.uid. As the number of nodes representing routers in the provider network will be way too less than the user-id space for the customer networks, in order to keep more space for unicast addresses of customer networks as well as to keep the option open for future expansion, entire 63 bit address space with the MSB set to 0 has been assigned to customer networks for unicast addresses. So, the distribution will look like 10(pA)+17(pB)+12(pC)+24(user-id). Router address space will be assigned from the address space with the MSB set to 1. ii. Address space with the MSB set to 1 will be distributed within the rest. This distribution will be based on the requirements and the work that have already been done in connection to IPv6 along with the following requirements: a) Router address space: Any node in the router address space will be designated with a prefix followed by A.B.C.router-id. The prefix will be determined based on the distribution of the 63 bit address space. b) Provider independent address space: This space will be used for the customers who would like to retain their number even after changing their providers. Each of these addresses has to be mapped Bandyopadhyay Expires July 31, 2014 [Page 9] Internet Draft MSHN and IPv6 January 31, 2014 with an address from the global unicast address space. Customers who would like to have mobility support, the mapped address can be considered as the "Home Address" of the mobile node as defined in the specification of "IP Mobility Support"[9]. c) Address space for multicasting d) Address space for private IP 2.2.2. Whether to go for a two-tier or three-tier hierarchy Establishment of hierarchy in the inter-AS layer reduces the size of BGP entries to a great extent, but leads to an improper use of address space due to geo-political reason. If hierarchy in the inter- AS space gets removed, entire 26bit (10+16) space will be available for a single layer and use of inter-AS space will be true to its sense, but will increase external LSA (and/or number of entries in the BGP table) dramatically. So, it depends on to what extent OSPF can support external LSAs. BGP expects the packet length to be limited to 4096 bytes. BGP manages to make it work with this limitation with the concept of prefix reduction in the CIDR based environment. As the number of inter-AS nodes increases, BGP has to change this limit in order to make it work in flat address space. The alternate will be to divide the inter-AS space into number of areas as defined in section 2.1. The area border routers will advertise the aggregated information to the rest of the world. BGP may have to incorporate both the options at the same time. As the number of nodes in the inter-AS layer increases, in order to reduce the number of entries in the routing table, inter-AS space has to be split into two separate planes. So, two-tier hierarchy can be considered as an interim state to go for three-tier hierarchy. If it so happen that current available data is good enough to support the present need, it will be worth to look for to what extent it can support in the future. Assignment of inter-AS nodes in two-tier hierarchy should be based on the geographical distribution as if it is part of three-tier hierarchy. Otherwise, introduction of three-tier hierarchy in the future will become another difficult task to go through. Based on the report of year 2011, BGP supports ~400,000 entries in the routing table. With this growing trend, BGP may have to change the limit of packet length even in a CIDR based environment. With the introduction of two-tier hierarchy, number of entries in the routing table will come down drastically and with the three-tier approach, it will come down further. 2.3. Issues related to Satellite communications Establishment of hierarchy in the inter-AS layer expects the only way any two autonomous systems in two different top level nodes Bandyopadhyay Expires July 31, 2014 [Page 10] Internet Draft MSHN and IPv6 January 31, 2014 communicate is through their SBRs. If two autonomous systems inside the same top level node communicate through satellite, it will be considered as a direct link between them. Whenever autonomous system 'ASa' of top level node 'A' communicates with autonomous system 'ASb' of top level node 'B' through satellite, they have to go through their state border routers. i.e. satellite port inside 'A' that communicates with a satellite port inside 'B' will be considered as state border router. If multiple such ports exists inside node 'A', all of them will be equidistant from any port inside 'B'. Which expects any satellite port inside 'B' to have prior knowledge of list of autonomous systems that will be under the purview of any port inside 'A'. So, all the satellite ports of 'A' have to exchange such group of information with all the satellite ports of 'B' and vice versa. These group of autonomous systems can be considered as a cluster of autonomous systems inside an area of a top level node. If number of such ports is small, some heuristics can be applied while assigning AS numbers in order to reduce the processing time during the circuit establishment phase. It will become difficult to maintain such heuristics once the number of such ports becomes large. So, in case of satellite communication, the advantage of establishing hierarchy inside inter-AS layer diminishes as the number of satellite ports increases. If any private corporate maintains its own satellite channel to communicate between its offices at distant locations, all of these offices are going to be considered as under the user-id space of its network. Service providers that provide satellite services to the end-site customers, can operate in the usual manner as they will provide connection to customer networks which will act as stub. Bandyopadhyay Expires July 31, 2014 [Page 11] Internet Draft MSHN and IPv6 January 31, 2014 2.4. Solution for site multihoming This is a general solution for site multihoming of stub networks in real IP world irrespective of the actual framework of the service provider network. RFC1122[10] made an extensive study for the necessary requirements of a multihomed host in an connected environment with a single gateway to reach the outside world. Some of the requirements suggested in that document related to UDP as well as the application layer were avoided by the implementation of TCP/IP by making sure that the interface address of an outgoing packet gets selected based on the route to be followed by the destination address. This criterion holds good in a connected environment with a single gateway to reach the outside world. Once more than one gateway appears to reach the outside world, either routing table of the entire world has to be brought in or needs some enhancement in the existing system to make things work. Whenever a customer network gets service from more than one service provider, the customer network can be viewed as having multiple source-id (user-id) space. Each of these IP domain gets connected to different service providers through different routers. So each interface of customer network will have IP addresses as many service providers it is connected with. So, the number of routing entries in the routing table will (roughly) become a multiple of IP domains it supports. Communication between any two hosts within the customer network will follow the traditional routing mechanism. In order to provide multihoming services it is needed that a host computer always forwards packets to the router associated to the same IP domain while communicating to someone in the outside world. i.e. if a host computer H receives an IP address 'addr1' and 'addr2' from two service providers P1 and P2 which are connected through routers R1 and R2 respectively, host H has to forward a packet to R1 (or R2) while using its IP address as 'addr1' (or 'addr2') in order to send packets to the outside world. So, a host computer as well as the intermediate routers have to use default routing based on the source domain of the source address in the IP header. In order to achieve this, host computers as well as intermediate routers need to have information related to its IP domain (net address/net mask) and the associated default router for all of its IP domains. They need to have a route entry per IP domain for all of its default routers. These information should be uploaded at the system start up time. As each interface is going to have multiple IP addresses, hosts need to have a provision to select its default IP domain. Users can select this option based on their need dynamically. If no source address has been specified by an application, source Bandyopadhyay Expires July 31, 2014 [Page 12] Internet Draft MSHN and IPv6 January 31, 2014 address has to be selected based on the outgoing interface and the 'default IP domain' as selected by the user. Selection of 'default IP domain' becomes effective while initiating communication to the outside world only. UDP based servers that need to support multiple clients simultaneously need to respond to a client's request with the same source address that the client had specified as the destination address. In order to satisfy this, system needs to introduce two system calls along with the existing system calls (i.e. read, write, send, sendto, recv, recvfrom) int recvwithdstaddr (int sockfd, char *buf, int nbytes, int flags, struct sockaddr *from, int fromlen, struct sockaddr *dst, int dstlen); 'recvwithdstaddr' receives data with destination address as specified by the sender. It is similar to 'recvfrom' with the additional fields related to the address of the receiving interface of the host. int sendwithsrcaddr (int sockfd, char *buf, int nbytes, int flags, struct sockaddr *to, int tolen, struct sockaddr *src, int srclen); 'sendwithsrcaddr' sends data specifying the source address of the outgoing interface of the host. It is similar to 'sendto' with additional parameters related to source address. It behaves like 'sendto' if no address is specified for 'src'. If application layer calls 'bind' with an address != INADDR_ANY then the address specified by 'bind' prevails over 'src' of 'sendwithsrcaddr'. All the UDP based servers need to replace 'sendto' with 'sendwithsrcaddr' and 'recvfrom' with 'recvwithdstaddr'. Current implementation of Net/3 passes a pointer of the protocol control block to the IP layer. Changes in the UDP client applications can be avoided by maintaining a cache of the headers of the incoming and outgoing IP packets at the PCB and making appropriate changes in the IP as well as in the UDP layer. So, if all the implementation of TCP/IP maintains the same approach, UDP client applications need not be changed. In order to maintain consistency between UDP server and UDP client applications it will be better if UDP client applications also use 'sendwithsrcaddr' in place of 'sendto' and 'recvwithdstaddr' in place of 'recvfrom'. In order to use 'sendwithsrcaddr' before using 'recvwithdstaddr' an Bandyopadhyay Expires July 31, 2014 [Page 13] Internet Draft MSHN and IPv6 January 31, 2014 application program (e.g. UDP clients) needs to know its source address. So, another system call needs to be introduced to get the source address based on the destination address. struct in_addr getsrcaddr(struct in_addr *dst); Applications with RAW sockets need to follow the path of UDP applications. All TCP based applications should work in the usual manner. Routing of IP packets (in the ip_output module of the hosts and in the ip_forwarding module of the intermediate routers) need to be modified in the following manner. If destination address of the IP header falls within any one of its IP domains, usual routing mechanism has to be followed with a minimal change. If no source address is specified by the application layer, source address has to be selected based on the outgoing interface and the domain that the destination address belongs to. If destination address falls outside of its IP domains, packets have to be forwarded to any of the default routers. The outgoing interface has to be selected based on the route look up of the default router from the routing table. If no source address is specified by the application layer, source address has to be selected based on the 'default IP domain' as selected by the user. If customer network maintains private IP domain, communication using private IP has to be restricted within private IP space. Implementation of TCP/IP needs to support multiple IP addresses per interface; also in order to provide load sharing facility in an VPN environment(section 2.4.1), it needs to support weak end system model[10]. Net/3 supports both these features. Following changes are expected with the source code of Net/3. Introduce ip_domain structure and some parameters as follows: struct ip_domain { struct in_addr net_addr; struct in_addr net_mask; struct in_addr def_router; }; #define MAX_IP_DOMAINS 16 short num_ipdomains; struct ip_domain *ipdomain[MAX_IP_DOMAINS]; Bandyopadhyay Expires July 31, 2014 [Page 14] Internet Draft MSHN and IPv6 January 31, 2014 If customer network maintains private IP domain (along with the user- id space provided by the service providers) and expects its communication to be confined within its own space, def_router field should be set as NULL. Upload IP domain information for all of its IP domains during system start up. Three new sysctl routines have to be introduced under the 'ip' node of the MIB tree (i.e. under CTL_NET, PF_INET, IPPROTO_IP), IPCTL_NUM_DOMAINS, IPCTL_DOMAIN and IPCTL_DEFROUTER. Using 'sysctl' IPCTL_NUM_DOMAIN (i.e. num_ipdomains) entry has to be configured first. Populate 'num_ipdomains' MIB attributes of domains under IPCTL_DOMAIN and for each IP_DOMAIN allocate MIB entries of each domain (DOMAIN_NET_ADDR, DOMAIN_NET_MASK & DOMAIN_DEF_ROUTER i.e. the attributes of ip_domain). Users should get provision to change IPCTL_DEFROUTER attribute dynamically. As each interface is going to have multiple IP addresses, the variable 'defrouter' has to be assigned a value that will match the field 'def_router' of an entry of 'domaininfo'. Add a route entry for all the routers connecting to the service providers during system start up (i.e. when /etc/netstart gets executed). Add the following entries in the inpcb structure to restore IP header info. struct ip inp_pkt_rcvd; /* cached header of last packet received */ struct ip inp_pkt_sent; /* cached header of last packet sent */ Execute the following steps in the 'ip_output' routine of the IP stack before it calls 'rtalloc' for route look up. If destination address of the IP packet falls outside of its IP domains { If source address has been specified, (i.e. ip->ip_src.s_addr != INADDR_ANY) { get def router address based on the source IP domain it belongs to. } else { if (ip->ip_dst.s_addr == inp->inp_pkt_rcvd.ip_src.s_addr) { ip->ip_src.s_addr = inp->inp_pkt_rcvd.ip_dst.s_addr; get default router address based on the source IP domain the source address belongs to. } else if (ip->ip_dst.s_addr==inp->inp_pkt_sent.ip_dst.s_addr){ ip->ip_src.s_addr = inp->inp_pkt_sent.ip_src.s_addr; get default router address based on the source IP Bandyopadhyay Expires July 31, 2014 [Page 15] Internet Draft MSHN and IPv6 January 31, 2014 domain the source address belongs to. } else If destination address is from private address space { get source address as the private IP address of any of its interfaces. Get default router based on the selected private IP address from its IP domains. } else { get default router based on the selected 'default IP domain' } } use 'rtalloc' to get the next hop address for the def router. If source address has not been specified { select source address based on the outgoing interface 'ia', and the 'default IP domain' as selected by the user. } Forward the packet to the next hop. } else { /* i.e. destination address is inside its IP domains */ follow the usual procedure to forward packets with the following changes. If source address has not been specified { If destination address is from private address space { select source address based on the outgoing interface and the private address assigned to it. } else { select source address based on the outgoing interface and the domain that the destination address belongs to. } } } restore the header info of the pkt sent. inp->inp_pkt_sent.ip_src.s_addr = ip->ip_src.s_addr; inp->inp_pkt_sent.ip_dst.s_addr = ip->ip_dst.s_addr; udp_input and rip_input routines have to be updated to restore the header of the packet received. In Net/3, the 'ip_forwarding' routine calls 'ip_output'; so it should be left as it is. Bandyopadhyay Expires July 31, 2014 [Page 16] Internet Draft MSHN and IPv6 January 31, 2014 2.4.1. Multihoming, VPN and load sharing For a corporate, that maintains multiple offices and communicates within themselves through private address space using VPN, can do load sharing of outgoing traffic of private IP space by segregating private IP domain of each office into number of sub domains through suitable configuration. Let us consider one of its offices gets connected to two providers P1 and P2 and gets address space as 'unicastNetAddr1'/'unicastNetMask1' and 'unicastNetAddr2'/'unicastNetMask2' respectively. It also gets assigned private address space as 'privateDomainNetAddr'/'privateDomainNetMask' from its corporate. For load sharing, it wants to maintain two sub domains with its ID space as 'subDomainNetAddr1'/'subDomainNetMask1' and 'subDomainNetAddr2'/'subDomainNetMask2' respectively. Domain 1 gets associated with the default router CE1 and domain 2 gets associated with CE2. Host computers and intermediate routers will be configured in the following manner: All hosts of sub domain 1 will have three entries of ip_domain: 1: 'net_addr = 'unicastNetAddr1' 'net_mask = 'unicastNetMask1' 'def_router = CE1 2: 'net_addr = 'unicastNetAddr2' 'net_mask = 'unicastNetMask2' 'def_router = CE2 3: 'net_addr' = 'privateDomainNetAddr' 'net_mask' = 'privateDomainNetMask' 'def_router' = CE1 All hosts of sub domain 2 will have three entries of ip_domain: 1: 'net_addr = 'unicastNetAddr1' 'net_mask = 'unicastNetMask1' 'def_router = CE1 2: 'net_addr = 'unicastNetAddr2' 'net_mask = 'unicastNetMask2' 'def_router = CE2 3: 'net_addr' = 'privateDomainNetAddr' 'net_mask' = 'privateDomainNetMask' 'def_router' = CE2 All intermediate routers will have four entries of ip_domain: Bandyopadhyay Expires July 31, 2014 [Page 17] Internet Draft MSHN and IPv6 January 31, 2014 1: 'net_addr = 'unicastNetAddr1' 'net_mask = 'unicastNetMask1' 'def_router = CE1 2: 'net_addr = 'unicastNetAddr2' 'net_mask = 'unicastNetMask2' 'def_router = CE2 3: 'net_addr' = 'subDomainNetAddr1' 'net_mask' = 'subDomainNetMask1' 'def_router' = CE1 4: 'net_addr' = 'subDomainNetAddr2' 'net_mask' = 'subDomainNetMask2' 'def_router' = CE2 If any of the CE-PE link fails, that particular CE needs to forward its outgoing traffic to the other CE whose CE-PE link remains active. This can be achieved by tunneling mechanism or providing a hot link between the CEs. Forwarding of packets should be restricted to packets with private IP space. CE routers need to communicate within themselves at regular intervals and elect a leader within themselves. The elected leader should get privilege to forward the IP broadcast packets to other sites in order to avoid multiplicity. Broadcast packets that are originated only at the local site needs to be forwarded to the other sites. For a remote site, which is connected with PE routers RPE1 and RPE2, PE router of local site can load share its outgoing traffic by segregating its outgoing traffic with a suitable manner. If any of the link between RPE1 or RPE2 fails, it needs to forward all the traffic to the active link as well. 2.4.2. Multihoming and IP Mobility If a mobile node gets a co-located care-of IP address from its current location[9], usually it selects its address based on its 'home address' while communicating to the correspondent node. As the multihoming aspect for outgoing packets expect the source domain to be the deciding factor for packet forwarding, the transport layer of the mobile node should use IP over IP while forwarding packets. The inner ip header should be as usual based on the source address as the home address, the outer ip header should use source address as the co-located care-of address. If the correspondent node is also mobile, packets towards the correspondent node will reach the home agent of the correspondent node. Home agent of the correspondent node should pop out the outer IP header and replace it with the header to forward the packets to its final destination in order to avoid further stacking of IP header. If it so happen that there are applications that need to use IP over IP and the home agent need to preserve the Bandyopadhyay Expires July 31, 2014 [Page 18] Internet Draft MSHN and IPv6 January 31, 2014 stack of the IP header, a new protocol type has to be introduced just to specify the mobility aspect. The co-located care-of IP address has to be bound to one of the IP addresses supported by the service providers (if mobile node advertises more than one address, the home agent will get confused, also there are other implications). Transport layer must ensure that the 'home address' gets tightly coupled with this IP address. 3. Processing of real time packets (QoS issue) Here is an attempt to come out with a solution for IP switch based network to operate in the most user-friendly manner to transport data traffic (IP) as well as real time (RT) traffic (as RTP[6] packet) in the existing 32-bit system. In case of IP routing/switching entire packet gets collected at the intermediate router/switch and forwarded based on the forwarding table. Inside the switch/router the variable length IP packet gets fragmented into smaller size frames at the ingress side. The frames gets transported through the switching fabric with proper priority mechanism (to support QoS) and then reassembled at the egress side and passed through the media for the next hop. In case of ATM, packets get fragmented at the ingress edge devices into small size cells. Entire packet gets transported as a stream of cells and gets collected at the egress edge device. The success of ATM over IP routing as far as speed is concerned is due to the fact that the latency gets reduced as the entire packet does not get collected, fragmented and reassembled at the intermediate nodes. So, in case of IP switch based network, if RT packets can be passed without getting fragmented inside the switch, better performance can be expected. i.e. one RT packet needs to get to fit inside one internal frame of the switch fabric. Additionally, to make this approach successful, maximum size of MPLS label stack has to be defined. Inside the switch all the IP packets will be assumed to carry same number of MPLS labels whether they are having one or the maximum in real sense. In fact, to reduce overhead, this limit should be the minimum number of labels needed to satisfy all sorts of features supported by MPLS. i.e. label stacking of depth n (without limit) needs modification. If minimum frame size is selected to fit one RTP packet, overhead becomes too high due to very large (40 bytes: 20 bytes IP + 8bytes UDP + 12 bytes RTP) packet header. Again, if large frame size is used, fragmentation loss becomes too high for the small size packets (say, 40 bytes IP packets). So, a compromise is needed that will give a better result based on the IP packet size distribution. Frame size Bandyopadhyay Expires July 31, 2014 [Page 19] Internet Draft MSHN and IPv6 January 31, 2014 is selected based on the minimum value of the overhead due to the fragmentation loss of data packet as well as the overhead as header of the RT packets. Studies show that primarily IP data packets of three different sizes are found common in nature. Almost ~50% packets of size 40 bytes (TCP ACK), ~20% packets of size 576 bytes (path MTU set by X.25) and ~30% packets of size 1500 bytes (path MTU set by ethernet) Other packets are less compared to the above three categories and almost evenly distributed. For the sake of simplicity of calculation, traffic of the first three categories are only considered. Payload of the data traffic is the actual IP packet size where as the payload of RT traffic is the payload inside RTP packet. If totBytes are to be transported across the internet and dataPcnt be the %of data traffic, totBytes*dataPcnt/100 = data traffic and (100-dataPcnt)*totBytes/100 = RT traffic; Out of data traffic 50% of 40 bytes length; 20% of 576 bytes length;& 30% of 1500 bytes length. If totDataPkts be the total data packets, totDataPkts*(50*40/100 + 20*576/100 + 30*1500/100) = totBytes*dataPcnt/100; or, totDataPkts*58520 = totBytes*dataPcnt; Let totBytes = 58520*100, for the ease of calculation; i.e. totDataPkt = dataPcnt*100; 40 bytes packets = 50*totDataPkt/100 i.e. 50*dataPcnt 576 bytes packets = 20*totDataPkt/100 i.e. 20*dataPcnt 1500 bytes packets = 30*totDataPkt/100 i.e. 30*dataPcnt RT packets = totBytes * (100 - dataPcnt)/100 = 58520 * (100-dataPcnt); If n is considered to be the depth of MPLS label stack, inside the switch, actual size of 40 bytes packet = 40+4*n bytes, 576 bytes packet = 576+4*n bytes & 1500 bytes packet = 1500+4*n bytes Let frameSize be the payload of a frame (excluding the frame header) inside the switch. If a RT packet fits exactly inside frameSize, RT packet payload = (frameSize-40-4*n) bytes; Bandyopadhyay Expires July 31, 2014 [Page 20] Internet Draft MSHN and IPv6 January 31, 2014 Total overhead = packet header overhead (of RT packets) + fragmentation overhead (of data packets); If a plot is drawn for frameSize = 40+4*n+1 to 1500+4*n for different dataPcnt (with dataPcnt=80 to 100) minimum of overhead are found at frameSize = (84, 101, 118, 126 and 152) for n==3; frameSize = (119, 127 and 152) for n==4 and at frameSize = (118, 127 and 152) for n==5. Actual data of the IP traffic has to be collected to get the best result. As dataPcnt increases minimum values are found at a lower frameSize and it gives better result with the higher range for lower dataPcnt. With average IP packet size 585 bytes, switches will encounter a loss of 4*(n-1) bytes for packets that will need only one label. In order to make this scheme work, a standard for maximum label stack size has to be defined. RTP packet size also has to be standardized. The same scheme is applicable to all the switching systems where IP packets get transported in hop by hop basis unlike the way it works in ATM networks. 3.1. Dual mode operation Inside ingress as well as in the egress card, packets need to follow certain functional steps. In order to maximize the output, a series of processing units work in pipeline mode for these operations. Ingress service cards need to act in dual mode to process RT packets and non-RT packets. i.e. the RT packets should follow a direct path that won't need fragmentation and related complexities before they are sent to the VOQs (virtual output queues, where from packets gets picked up to be sent to the switching fabric). Whereas other packets need to follow a different path for fragmentation operations. This will prevent a RT packet to be blocked by the fragmentation procedure of not-RT packets that arrive in the service card prior to the arrival of RT packet. So, mere mapping of RT packet size with the frameSize of switch fabric will not achieve the speed of ATM switches. Simulation studies show that significant improvement is achieved once RT packets are directly sent to VOQs after the operation of label processing. It will be worth to study by the hardware people to figure out whether entire set of data can be placed into queues based on their priorities and segmentation operation is done in each queue in parallel mode before putting the frames into their respective VOQs. Entire operation will be lot costlier, but simulation result shows that in such case, RT packets need not be restricted to fixed size cells. Standardization of label stack depth need not be imposed as well. Bandyopadhyay Expires July 31, 2014 [Page 21] Internet Draft MSHN and IPv6 January 31, 2014 4. Refinements over existing IPv6 specification As IPv6 was envisioned long before some of the newer technologies e.g. MPLS came into picture, some refinements can be made over the existing specification. These considerations are related to bandwidth usages and performance inside switches. Previous chapter shows that smaller packet size gives better result for processing of RT packet. So, it is desirable to have IP packet header to be as small as possible. As described earlier, evaluation of the parameters nMaxInterASTopNodes, nMaxInterASBottomNodes and nMaxASNodes is geo- political and have to be decided by IANA. Once these parameters are determined with mutual agreements, values of pA, pB, pC and prefix length of user id can be determined. If the total length comes out to be less than 128, length of IP header will be reduced accordingly. The 'flow label' field of IPv6 packet header may not be of any use with MPLS is in use. ATM used to have 4 priority classes. The first specification of IPv6 RFC-1883 used a 4bit type of service field along with a 24bits flow label field. These two were modified to a 8bit type of service field and a 20bit flow label field in the current spec RFC-2460. Too many priority classes may increase complexities to process inside switches. If type of service field of IPv6 header may be reduced to be of 4bit length as it was stated in RFC-1883 and 'flow label' field gets removed, another three bytes may be reduced from the IPv6 header. The field 'Hop Limit' has got a 8bit value in the existing spec. The role of this field needs to be discussed properly with a large address space. 4.1. Distributed processing and Multicasting With the inherent hierarchy involved in this architecture, distributed applications can also be structured in a suitable manner. Say, for a commonly used web based application a master level server will be there at every top level node. Any change that might happen in the application, has to be synchronized within these master level servers first. There might be servers at the middle layer (inside each inter-AS-bottom) inside each top level node. Once the changes get reflected at the master node, all the servers at the middle layer needs to update themselves with their master level node. This will reduce network traffic substantially. Inherent hierarchy in the architecture will also help establishing multicast tree in the similar manner. Work on these issues can be progressed only after this architecture gets approved. Bandyopadhyay Expires July 31, 2014 [Page 22] Internet Draft MSHN and IPv6 January 31, 2014 5. Expected changes at the application layer IP packets with size 576 in most of the cases come out of those TCP layers that do not process maximum path-MTU and takes the default one that was set during X.25. The 576 factor can be corrected very easily with path-MTU set to 1500. With the consideration that label switch path do not get changed very frequently in between two arbitrary network points for any particular type of packet, most of the applications are expected to become UDP based with negative ACK. TCP in turn might go through changes. Once this comes into effect, 40 bytes packets will come down drastically. Switch fabric frame size needs to be determined keeping these two factors in mind along with changes in IP packet header. With the existing 32-bit system, frame size (excluding the frame header) of 152 and 127 are most viable solution in general for label stack depth=3,4 &5. 6. IANA Consideration This is a first level draft for proposed standard. Hence, IANA actions should come into play at a later stage, if needed. 7. Security Consideration This document does not include any security related issues. 8. Acknowledgments The author would like to thank to Professor Amitava Datta of University of Western Australia for his review and constructive comments. 9. Normative References [1] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for IPv6 Hosts and Routers", RFC 4213, October 2005. [2] Fuller V., Li. T., "Classless Inter-Domain Routing (CIDR): The Internet Address Assignment and Aggregation Plan", RFC 4632, August 2006. [3] Huston, G., "Commentary on Inter-Domain Routing in the Internet", RFC 3221, December 2001. [4] Q. Vohra, E. Chen., "BGP Support for Four-octet AS Number Space", RFC 4893, May 2007. [5] Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001. Bandyopadhyay Expires July 31, 2014 [Page 23] Internet Draft MSHN and IPv6 January 31, 2014 [6] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson. "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. [7] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks(VPNs)", RFC 4364, February 2006. [8] J. Moy., OSPF Standardization Report, RFC 2329, April 1998 [9] C. Perkins, "IP Mobility Support for IPv4, Revised", RFC5944, November 2010. [10] R. Braden, "Requiements for Internet Hosts -- Communication Layers", RFC1122, October 1989. 10. Informative References [11] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [12] Rekhter, Y., and T., Li, "A Border Gateway Protocol 4 (BGP- 4)",RFC 1771, March 1995. [13] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification, RFC 1883, December 1995. [14] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. [15] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [16] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001. 11. Author's Address Shyamaprasad Bandyopadhyay HL No 205/157/7, Inda Kharagpur 721305, India Phone: +91 3222 225137 e-mail: shyamb66@gmail.com Bandyopadhyay Expires July 31, 2014 [Page 24]