Network Working Group Yantao Sun Internet-Draft Beijing Jiaotong University Intended status: Informational Xiaoli Song Expires: January 10, 2013 Bin Liu ZTE Inc. Qiang Liu Jing Cheng Beijing Jiaotong University July 9, 2012 MatrixDCN: A New Network Fabric for Data Centers draft-sun-matrix-dcn-00 Abstract This document introduces describes the requirement of today's data centers and a new type of network topology called MatrixDCN (matrix data center network). MatrixDCN is used to deploy large scale data center network, which can support more than 100 thousands of servers in one data center without network bandwidth bottleneck. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 10, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Yantao Sun, et al. Expires January 10, 2013 [Page 1] Internet-Draft MatrixDCN July 2012 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . . 3 2. Conventions used in this document . . . . . . . . . . . . . . 3 3. Network fabric . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Components . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2. Multiple Paths . . . . . . . . . . . . . . . . . . . . . . 5 3.3. Addressing . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1. Routing of RS . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Routing of CS . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Routing of AS . . . . . . . . . . . . . . . . . . . . . . 7 4.4. Construction of Routing Table . . . . . . . . . . . . . . 7 4.4.1. Construction for CS . . . . . . . . . . . . . . . . . 7 4.4.2. Construction for RS . . . . . . . . . . . . . . . . . 8 4.4.3. Construction for AS . . . . . . . . . . . . . . . . . 8 4.5. PDU Format . . . . . . . . . . . . . . . . . . . . . . . . 8 4.6. Fault Tolerance . . . . . . . . . . . . . . . . . . . . . 9 5. VM Migration . . . . . . . . . . . . . . . . . . . . . . . . . 10 6. Multiple tenants . . . . . . . . . . . . . . . . . . . . . . . 10 7. Deployment Scenarios . . . . . . . . . . . . . . . . . . . . . 10 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10. Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 Yantao Sun, et al. Expires January 10, 2013 [Page 2] Internet-Draft MatrixDCN July 2012 1. Introduction Traditional network topology is a tree-like fabric composed by routers and switches, in which network is divided into 3 layers including core layer, aggregation layer and access layer. All servers are connected to the access switches in access layer. This kind of topology has some problems using in data center network. Firstly, it constrains the scale of data center network. When the scale of network expands, those routers in core layer are apt to be the bandwidth bottleneck in the whole network, since more packages need to be routed between different layer-2 domains through core routers. Secondly, data center network is divided into many small layer-2 domains by core routers, which conduces to that VM's migration is limited to only one layer-2 domains. And last, it's difficult to exploit mass of redundant links between switches as STP protocol must be used to ensure no loops in layer-2 network to avoid broadcast storm. To solve above problems, new network architectures should be introduced to data centers. The new architecture requires the following features: 1) support multiple paths to eliminate bandwidth bottleneck; 2) has regular network topology with good extendibility and maintainability; 3) support VM migration in the entire network; 4) has enough VLANs and permit any endpoints to compose one VLAN. 1.1. Acronyms & Definitions DCN - Data Center Network AS - Access Switch RS - Row Switch CS - Column Switch PDU - Protocol Data Unit OSPF - Open Shortest Path First Routing Protocol VLAN - Virtual Local Area Network 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Yantao Sun, et al. Expires January 10, 2013 [Page 3] Internet-Draft MatrixDCN July 2012 In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying RFC-2119 significance. 3. Network fabric In the researches on data center networks, fat-tree fabric attracts a great many of attentions. Fat-tree is on kind of multi-root tree and a lot of important research works and some practices have been done based on fat-tree fabric. In this proposal, we introduce another variety of multi-root tree called MatrixDCN for building large-scale data center network. Similar with fat-tree, MatrixDCN has regular topology, multiple paths, special addressing and routing matched with its topology. MatrixDCN can support more than 100 thousands of physical servers in a single data center network. And furthermore, it can support virtual machine migration in whole network and huge number of tenants' isolation by modest modification. 3.1. Components In MatrixDCN, there are 3 types of network devices, called Row Switch (RS), Column Switch (CS) and Access Switch (AS). AS switches are deployed as a matrix with multiple rows and columns. For example, one 8 X 8 matrix has 8 rows and 8 columns and 64 AS switches. For a RS, it is deployed on the head of one row and links all the AS switches together in this row, and for a CS, it is deployed on the head of one column and links all the AS switches together in this column. Every AS connects with all the RS and CS switches located at the row and column head of it. Figure 1 is an example of 2 X 2 MatrixDCN. Yantao Sun, et al. Expires January 10, 2013 [Page 4] Internet-Draft MatrixDCN July 2012 +----+ +----+ | CH | ____ | CH | ____ +----+ \ \ +----+ \ \ 10.0.1.1| | 10.0.2.1 | | _________| __ | _____ | | / | | \ | | +----+ +----+ | +----+ | | RH |___| AS | | | AS | | +----+ +----+ | +----+ | 10.1.0.1 10.1.1.1 | 10.1.2.1 | / \ | / \ | +---+ +---+ | +---+ +---+ | | S | | S | | | S | | S | | +---+ +---+ | +---+ +---+ | 10.1.1.2 10.1.1.3 | | _____________ | _______ | / | \ | +----+ +----+ | +----+ | | RH | __ | AS | __/ | AS | __/ +----+ +----+ +----+ 10.2.0.1 10.2.1.1 10.2.2.1 / \ / \ +---+ +---+ +---+ +---+ | S | | S | | S | | S | +---+ +---+ +---+ +---+ 10.2.2.3 Figure 1: demonstration topology 3.2. Multiple Paths To eliminate bandwidth bottleneck in data center networks, MatrixDCN can deploy multiple RS switches in one row and CS switches in one column. Thus, we can have more links between AS and RS/CS, which means more bandwidth, between an AS and its RS and CS switches. If bandwidth between AS and RS(called row bandwidth), bandwidth between AS and CS(called column bandwidth) and access bandwidth of AS are equal, this network is approximately non-blocking. 3.3. Addressing In MatrixDCN, all the devices including servers and switches are assigned an IP address according to their position in the network. Suppose an AS is located at the mth row and nth column, its IP Address is 10.m.n.1/24 and the servers connected with it are set to 10.m.n.x/24. For RS switches located at the head of mth row, their IP will be set as 10.m.0.x/16. For CS switches located at the head of mth column, their IP will be set as 10.0.m.x/255.0.255.0. Yantao Sun, et al. Expires January 10, 2013 [Page 5] Internet-Draft MatrixDCN July 2012 4. Routing The routing is very simple for MatrixDCN, as its topology is very regular and every switch can obtain the knowledge of the entire topology without exchanging link states since the device position is encoded in its address. All the switches in MatrixDCN have routing ability. These switches store routing entry using the standard routing table. The structure of routing table is illustrated in table 1. Table standard IP routing table --------------------------------------------------------------- Subnet | Address |Subnet mask|Next hop|Cost|Create/update time --------------------------------------------------------------- 10.1.0.0|255.255.0.0| 10.1.1.1 | | | --------------------------------------------------------------- 10.2.0.0|255.255.0.0| 10.1.2.1 | | | --------------------------------------------------------------- 4.1. Routing of RS When packets arrive at RS switch, RS will determine the next hop of a packet by the 3rd number of its destination IP address based on IP routing table. If this number is k, the packet will be send to the kth AS switch (next hop) on the row of this RS. The routing table on RS switch of the ith row looks like below: Destination/Subnet mask Next hop 10.X.1.X/255.0.255.0 10.i.1.1 10.X.2.X/255.0.255.0 10.i.2.1 10.X.3.X/255.0.255.0 10.i.3.1 ...... 4.2. Routing of CS When packets arrive at CS switch, CS will determine the next hop of a packet by the 2rd part of its destination IP address based on IP routing table. If this number is k, the packet will be send to the kth AS switch (next hop) on the column of this CS. The routing table on CS switch of the ith row looks like below: Destination/Subnet mask Next hop 10.1.X.X/255.255.0.0 10.1.Col.1 10.2.X.X/255.255.0.0 10.2.Col.1 10.3.X.X/255.255.0.0 10.3.Col.1 ...... Yantao Sun, et al. Expires January 10, 2013 [Page 6] Internet-Draft MatrixDCN July 2012 4.3. Routing of AS When packets arrive at AS switch, AS should determine the next hop is whether RS or CS switch for every packet. For a packet, if its destination IP address is on the same row, it will be sent to the RS switch, and if its destination IP address is on the same column, it will be sent to the CS switch. For the packet whose destination is on different row and column, it can be sent to either RS or CS, and for the packet with destination on the same row and column, it is forwarded through level-2 switching without routing. The routing table of the CS on the crossing position of the ith row and the jth column looks like below: Destination/Subnet mask Next hop 10.Row.0.0/255.255.0.0 10.Row.0.X1 10.Row.0.0/255.255.0.0 10.Row.0.X2 ...... 10.0.Col.0/255.0. 255.0 10.0.Col.X1 10.0.Col.0/255.0. 255.0 10.0.Col.X2 ...... 10.0.0.0/255.0.0.0 10.Row.0.X1 10.0.0.0/255.0.0.0 10.Row.0.X2 ...... 10.0.0.0/255.0.0.0 10.0.Col.X1 10.0.0.0/255.0.0.0 10.0.Col.X2 ...... 4.4. Construction of Routing Table To build the routing table for switches, the connection relationship between adjacent switches should be learned automatically, and to learn connection relationship, every switch will send Hello PDU to all of its active ports periodically. Hello PDU is encapsulated in a UDP packet. A well-known UDP port will be obtained from IANA. 4.4.1. Construction for CS For a CS switch, such as 10.0.n.x, which will receive Hello PDUs from all the AS switches on the same column, its routing table is built according to the following rules: When the CS received a Hello PDU from 10.m.n.1, a routing entry that destination is "10.m.0.0/255.255.0.0" and next hop is "10.m.n.1" will be added/refreshed to its routing table. If Hello PDU can't be received in set time, the corresponding routing entry will be deleted. Yantao Sun, et al. Expires January 10, 2013 [Page 7] Internet-Draft MatrixDCN July 2012 4.4.2. Construction for RS For a RS switch, such as 10.m.0.x, which will receive Hello PDUs from all the AS switches on the same row, its routing table is built according to the following rules: When the RS received a Hello PDU from 10.m.n.1, a routing entry that destination is "10.0.n.0/255.0.255.0" and next hop is "10.m.n.1" will be added/refreshed to its routing table. If Hello PDU can't be received in set time, the corresponding routing entry will be deleted. 4.4.3. Construction for AS For an AS switch, such as 10.m.n.1, which will receive Hello PDUs from all the RS switches of the mth row and all the CS switches of the nth column, its routing table is built according to the following rules: If a Hello PDU received from a RS 10.m.0.x, two routing entry "10.m.0.0/255.255.0.0 10.m.0.x" and "10.0.0.0/255.0.0.0 10.m.0.x" will be added/refreshed to its routing table. If a Hello PDU received from a CS 10.0.n.x, two routing entry "10.0.n.0/255.255.0.0 10.0.0.x" and "10.0.0.0/255.0.0.0 10.m.0.x" will be added/refreshed to its routing table. If Hello PDU can't be received in set time, the corresponding routing entry will be deleted. 4.5. PDU Format Yantao Sun, et al. Expires January 10, 2013 [Page 8] Internet-Draft MatrixDCN July 2012 ---------------------------- | 0 | 8 | 16 | 24 | ---------------------------- |Version |Type| Packet Length| ---------------------------- | Row No | Column No | ---------------------------- | Check Sum | AuType | ---------------------------- | Authentication | ---------------------------- | Authentication | ---------------------------- | Data | ---------------------------- | ...... | ---------------------------- Version: version number of MatrixDCN Routing Protocol. Type: PDU packet type. If value is 1, this is a Hello PDU. If value is 2, this is a Link State Advertisements PDU to notice link fault knowledge. Packet Length: the total length of the PDU including PDU head and data. Row No and Colum No: The position of this switch. Check Sum: Check sum for the total PDU. AuType: Authentication type. 0: no authentication, 1: Plaintext Authentication, 2: MD5 Authentication. Authentication: Authentication infomation. 0: undefined, 1: Key, 2: key ID, MD5 data length and packet number. MD5 data is appended to the back of the packet. AuType and Authentication can refer to the definition of OSPF packet. 4.6. Fault Tolerance Any network node or link fault will conduct communication break, so fault tolerance must be considered in a usable routing protocol. To do this, switches in MatrixDCN should learn the whole network state through Link State Advertisements PDU. The more details will be elaborated in the next version of this document. Yantao Sun, et al. Expires January 10, 2013 [Page 9] Internet-Draft MatrixDCN July 2012 5. VM Migration Subnetting and location-related addressing make possible to build large-scale data center networks, but limit the migration of VMs. To solve this problem, Overlay or IP tunneling technology is introduced to data center networks. By small extension to AS, MatrixDCN can support seamless VM migration in the entire data center without any modification to above routing procedure. The feature of regular topology in MatrixDCN has been considered into this solution and more detail would be specialized in another document. 6. Multiple tenants The present VLAN protocol 802.1q can't satisfy the requirement of data center networks as it can only support only about 4000 VLANs. VXLAN and NVGRE are two similar but competitive draft protocols for solving this problem and can be used in MatrixDCN. Moreover, another solution in consideration of the regular topology will be discussed in other document. 7. Deployment Scenarios MatrixDCN can be used to deploy large-scale data center network. Suppose AS switch has 40 down-link ports with 1G bits speed and 8 up- link ports with 10G bits speed, RS and CS switch has 40 ports with 10G bits speed, those switches are currently main stream switches used in data center, we can build a MatrixDCN with 40 rows X 40 columns. In this MatrixDCN, every row has 4 RS switches and every column has 4 CS switches. For every AS, 4 up-link ports link RS switches and 4 up-link ports link CS switches, and its 40 down-link ports link servers. Thus, this MatrixDCN can contain up to 64,000 servers using 1600 AS switches, 160 RS switches and 160 CS switches. The average available bandwidth for every server is about 1000M bits. 8. Security Considerations The protection for routing information and isolation for networks of different tenants (VLAN) should be considered in this protocol. 9. Conclusion Today's data center produces some new requirement to networks, such as no-blocking, seamless VM migration and multiple tenants.This document introduces MatrixDCN, a new network fabric for data centers. Yantao Sun, et al. Expires January 10, 2013 [Page 10] Internet-Draft MatrixDCN July 2012 MatrixDCN can support up to 100 thousands of servers without network bandwidth bottleneck. This fabric is very simple and extendable, and its routing is very effective. Furthermore, this fabric has some advantages on supporting VM Migration and one-to-many or many-to-many traffic in cloud computing. 10. Reference [RFC2328] J. Moy, "OSPF Version 2", RFC2338, Apr. 1998. [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat. "A Scalable, Commodity, Data Center Network Architecture",In ACM SIGCOMM 2008. 11. Acknowledgments Thanks for Yuhua Wei, Lizhong Jin, Jinghai Yu, and Zhui GUo. They give important advices for this document. Authors' Addresses Yantao Sun Beijing Jiaotong University No.3 Shang Yuan Cun, Hai Dian District Beijing 100044 China Phone: Email: ytsun@bjtu.edu.cn Xiaoli Song ZTE Inc. ZTE Plaza, No.19 East Huayuan Road,Haidian District Beijing 100191 China Email: song.xiaoli@zte.com.cn Yantao Sun, et al. Expires January 10, 2013 [Page 11] Internet-Draft MatrixDCN July 2012 Bin Liu ZTE Inc. ZTE Plaza, No.19 East Huayuan Road,Haidian District Beijing 100191 China Email: liu.bin21@zte.com.cn Qiang Liu Beijing Jiaotong University No.3 Shang Yuan Cun, Hai Dian District Beijing 100044 China Email: liuq@bjtu.edu.cn Jing Cheng Beijing Jiaotong University No.3 Shang Yuan Cun, Hai Dian District Beijing 100044 China Email: yourney.j@gmail.com Yantao Sun, et al. Expires January 10, 2013 [Page 12]