Network Working Group L. Deng Internet-Draft China Mobile Intended status: Informational February 13, 2014 Expires: August 17, 2014 End Point Properties for Peer Selection draft-deng-taps-datacenter-00.txt Abstract It is noticed that within a data center, unique traffic pattern and performance goals for the transport layer exist, as compared to things on the Internet. This draft discusses the usecase for applying transport API from the perspective of an application running in a data center environment, and proposes potential requirements for such API design. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on August 17, 2014. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Deng Expires August 17, 2014 [Page 1] Internet-Draft End Point Properties for Peer Selection February 2014 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. VM related traffic . . . . . . . . . . . . . . . . . . . 3 3.2. Application Priorities . . . . . . . . . . . . . . . . . 3 3.3. Access Type differentiation . . . . . . . . . . . . . . . 4 3.4. Delay Tolerant Traffic . . . . . . . . . . . . . . . . . 4 4. Transport Optimization in DC . . . . . . . . . . . . . . . . 4 4.1. Performance degradation in DC . . . . . . . . . . . . . . 4 4.1.1. Incast Collapse . . . . . . . . . . . . . . . . . . . 4 4.1.2. Long tail of RTT . . . . . . . . . . . . . . . . . . 5 4.1.3. Buffer Pressure . . . . . . . . . . . . . . . . . . . 5 4.2. Transport Optimization Goals/Mechanisms . . . . . . . . . 5 5. DC Transport API Considerations . . . . . . . . . . . . . . . 6 5.1. information flow from app to transport . . . . . . . . . 6 5.2. information flow from transport to app . . . . . . . . . 6 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 8.1. Normative References . . . . . . . . . . . . . . . . . . 7 8.2. Informative References . . . . . . . . . . . . . . . . . 7 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 7 1. Introduction It is noticed that the traffic pattern in a data center is quite different from Internet. First of all, almost all the traffic in data center are carried by TCP (over 90%). Secondly, there are extreme deviation among TCP flows in terms of data volume and duration. while most of the flows are very short, that complete in less than 2-3 round trips, most of the traffic volume belongs to few long lasting flows. ToR switches are highly multiplexed for tens of concurrent TCP flows for most of the time. The reason behind such a traffic pattern is a combination of three types of data traffic: (1) highly delay-sensitive short flows resulting from the distributed computing model employed pervasively for delay-sensitive application (web search/social networking); (2) highly delay-sensitive short flows for cluster control/mangement; and Deng Expires August 17, 2014 [Page 2] Internet-Draft End Point Properties for Peer Selection February 2014 (3) delay tolerant bakup/synchronization data traffic with large data volume. 2. Terminology DC: Data Center, is a facility used to house computer systems and associated components, such as telecommunications and storage systems. ToR: a Top of Rack switch, usually sits on top of a rack of servers and serves as the entrance to other parts of the data center as well as inter-connects the local servers within the rack. VM: Virtual Machine, is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. VM migration the process of moving a running virtual machine or application between different physical machines. NIC: Network Interface Controller, is a computer hardware component that connects a computer to a computer network. DCB: Data center bridging, refers to a set of enhancements to Ethernet local area networks for use in data center environments, such as lossless ethernet. 3. Usecases Except the web search/query example in the introduction section, other usecases for optimized data delivery within a DC are presented in the following. 3.1. VM related traffic In virtualized data centers, to cope with the reliability concerns arising from the relatively unreliable general commodity hardware platforms, keeping several identical VM instances running on different physical servers for each other's backup is common practice. In such case, TCP flows for VM backup or migration, although considerably larger in data volume and longer in duration than typical user traffic, are also delay sensitive. 3.2. Application Priorities For data center accommodating multiple applications, dependent on the operator/service provider's marketing/provisioning strategies or the application's own user expectation, differentiation in resource provision in case of congestion is a common practice. For instance, Deng Expires August 17, 2014 [Page 3] Internet-Draft End Point Properties for Peer Selection February 2014 physical resources in a data center could be shared between delay- sensitive web search engine and document/music sharing applications. Within the data center, traffic from loader-balancer to servers and from servers to database are multiplexed on the internal DC network. 3.3. Access Type differentiation Given various access types for a specific application, the DC operator may want to enforce different QoS policies to some specific group of users, according to their access type. For instance, if the service provider is currently marketing on the mobile market, it could prioritize mobile traffic over fixed traffic. For potential competing service providers, one may wants to prioritize traffic from its own subscribers over other third party users. 3.4. Delay Tolerant Traffic Delay tolerant traffic, including software upgrade and active measurement data traffic for bandwidth detection should not impact the real productive traffic. 4. Transport Optimization in DC To fully understand why we need special transport service for DC environment as compared to Internet, it is better to look first at what problems an optimized transport service would be from the perspective of a DC application. 4.1. Performance degradation in DC In particular, the following three issues are identified in DC environment in terms of transport performance. 4.1.1. Incast Collapse For the sake of reduced CAPEX, cheap shallow-buffered ToR switches are dominant in today's data center, it is usually the case that the buffer space of the ToR switch before an aggregator (the server who is responsible for dividing a task into a group of subtasks and collects responses from its relevant working servers for result aggregation) be consumed up the instance that workers submit their subtask through highly synchronized TCP flows, resulting in consistent packet loss over the affected flows. The resultant timeout would cause a dramatic performance degradation, since the regular RTT (less than 10ms) in data center is of magnitudes smaller than the traditional TCP RTO configuration (200ms). Deng Expires August 17, 2014 [Page 4] Internet-Draft End Point Properties for Peer Selection February 2014 4.1.2. Long tail of RTT Due to the greedy nature of traditional TCP algorithms, the existence of large volume long flows would increasingly builds up the buffer queue in switches along the way, adding considerable queuing delay at switches for the highly delay sensitive short flows. 4.1.3. Buffer Pressure Due to the greedy nature of traditional TCP algorithms, the existence of large volume long flows would increasingly builds up the buffer queue in switches along the way, further reducing the actual available buffering space to accommodate delay sensitive short flows, even they are not submitted in the same time. 4.2. Transport Optimization Goals/Mechanisms Since both hardware and software devices are typically deployed and highly customized by a single service operator, there have been various private solutions for these issues, including cross-layer, cross boundary (network+end host) hybrid ones. In solving the above issues, various proposals are made in order to meet some of the following optimization goals: (1) Reduce unnecessary loss/timeout: since TCP performance lost are mainly caused by packet losses/retransimision timeouts, it is proposed that by finer-tuned RTO configuration and timing framework, the performance degradation in result could be largely mitigated.[Pannas] In the meantime, there have been work from IEEE DCB family, providing lossless ethernet service from the link layer, which could be rendered to avoid packet loss from the IP layer and be demonstrated to be effective in a coupled solution for DC transport optimization[detail]. (2) Mitigate the Performance impact from loss/timeout: delay-based CC algorithms are expected to be more robust to packet losses/timeout in mitigating incast collapse issue for DC.[vegas] Control/avoid lengthy buffer queues: as queuing delay substantially impact the RTT in DC environment, it is motivated to cut the delay hence improve performance by keeping the buffering queues short[dctcp] or even empty.[hull] In order to do that, the sender may sense the queue at switches by explicit feedback (ECN [dctcp] or implicit delay variation (Vegas[vegas]). (3) Delay prioritized buffer queuing: for resource bounded period, it is essential to make efficient use of limited resource to deliver Deng Expires August 17, 2014 [Page 5] Internet-Draft End Point Properties for Peer Selection February 2014 the demanded service rather than fair-sharing among all the competitors and fail them all ultimately. Proposals have been made to allow applications to explicitly indicate a flow's delivery preferences (either by absolute deadline information[d3] or by relative priorities[detail]), in order to improve the overall delivery success rate. (4) Smooth traffic bursts: one one hand, (distributed) application would be refined to introduce random offset in concurrent short flow submission; on the other hand, random offset would be introduced to RTO back-off calculation to mitigate retransmission synchronization [Pannas]. Moreover, physical pacing at NIC level are proposed to counter the effect of traffic bursts caused by server performance optimization techniques.[d2tcp] 5. DC Transport API Considerations 5.1. information flow from app to transport (1) delivery related: the information from the application about its expectation on the transport service in delivery. For example, the delivery goal could be specified in forms of (1.1) absolute delay requirement; or (1.2) relative priority indication. (2) retransmission related: the information from the application about how the transport would deal with packet losses. For example, the information could include: (2.1) loss recovery needed or not; (2.2) if so, preferred retransmission timeout granularity; (3) pacing related: the information from the application about its expectation about the traffic pacing. For example, the information could include: (3.1) traffic duration, in case of pacing for long flows only policy; (3.2) burstyness expectation. 5.2. information flow from transport to app Congestion information, from the network device or local transport layer about the congestion status of the current transport link. Deng Expires August 17, 2014 [Page 6] Internet-Draft End Point Properties for Peer Selection February 2014 6. Security Considerations TBA. 7. IANA Considerations TBA. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 8.2. Informative References [Pannas] Vasudevan, V., Phanishayee, A., and H. Shah, "Safe and effective fine-grained TCP retransmissions for datacenter communication", 2009. [d2tcp] Vamanan, B., Hasan, J., and T. Vijaykumar, "Deadline-aware datacenter tcp (d2tcp)", 2012. [d3] Wilson, C., Ballani, H., and T. Karagiannis, "Trading a little bandwidth for ultra-low latency in the data center", 2011. [dctcp] Alizadeh, M., Greenberg, A., and D. Maltz, "Data center tcp", 2011. [detail] Zats, D., Das, T., and P. Mohan, "DeTail: reducing the flow completion time tail in datacenter networks", 2012. [hull] Alizadeh, M., Kabbani, A., and T. Edsall, "Less is more: trading a little bandwidth for ultra-low latency in the data center", 2012. [vegas] Lee, C., Jang, K., and S. Moon, "Reviving delay-based TCP for data centers", 2012. Author's Address Lingli Deng China Mobile Email: denglingli@chinamobile.com Deng Expires August 17, 2014 [Page 7]