rfc9040.original   rfc9040.txt 
TCPM WG J. Touch
Internet Draft Independent
Intended status: Informational M. Welzl
Obsoletes: 2140 S. Islam
Expires: October 2021 University of Oslo
April 12, 2021
TCP Control Block Interdependence
draft-ietf-tcpm-2140bis-11.txt
Status of this Memo Internet Engineering Task Force (IETF) J. Touch
Request for Comments: 9040 Independent
Obsoletes: 2140 M. Welzl
Category: Informational S. Islam
ISSN: 2070-1721 University of Oslo
July 2021
This Internet-Draft is submitted in full conformance with the TCP Control Block Interdependence
provisions of BCP 78 and BCP 79.
This document may contain material from IETF Documents or IETF Abstract
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering This memo provides guidance to TCP implementers that is intended to
Task Force (IETF), its areas, and its working groups. Note that help improve connection convergence to steady-state operation without
other groups may also distribute working documents as Internet- affecting interoperability. It updates and replaces RFC 2140's
Drafts. description of sharing TCP state, as typically represented in TCP
Control Blocks, among similar concurrent or consecutive connections.
Internet-Drafts are draft documents valid for a maximum of six Status of This Memo
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This document is not an Internet Standards Track specification; it is
http://www.ietf.org/ietf/1id-abstracts.txt published for informational purposes.
The list of Internet-Draft Shadow Directories can be accessed at This document is a product of the Internet Engineering Task Force
http://www.ietf.org/shadow.html (IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are candidates for any level of Internet
Standard; see Section 2 of RFC 7841.
This Internet-Draft will expire on October 12, 2021. Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc9040.
Copyright Notice Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the Copyright (c) 2021 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with carefully, as they describe your rights and restrictions with respect
respect to this document. Code Components extracted from this to this document. Code Components extracted from this document must
document must include Simplified BSD License text as described in include Simplified BSD License text as described in Section 4.e of
Section 4.e of the Trust Legal Provisions and are provided the Trust Legal Provisions and are provided without warranty as
without warranty as described in the Simplified BSD License. described in the Simplified BSD License.
Abstract
This memo provides guidance to TCP implementers that is intended to
help improve connection convergence to steady-state operation
without affecting interoperability. It updates and replaces RFC
2140's description of sharing TCP state, as typically represented in
TCP Control Blocks, among similar concurrent or consecutive
connections.
Table of Contents Table of Contents
1. Introduction...................................................3 1. Introduction
2. Conventions Used in This Document..............................4 2. Conventions Used in This Document
3. Terminology....................................................4 3. Terminology
4. The TCP Control Block (TCB)....................................5 4. The TCP Control Block (TCB)
5. TCB Interdependence............................................7 5. TCB Interdependence
6. Temporal Sharing...............................................7 6. Temporal Sharing
6.1. Initialization of a new TCB..................................7 6.1. Initialization of a New TCB
6.2. Updates to the TCB cache.....................................8 6.2. Updates to the TCB Cache
6.3. Discussion..................................................10 6.3. Discussion
7. Ensemble Sharing..............................................11 7. Ensemble Sharing
7.1. Initialization of a new TCB.................................11 7.1. Initialization of a New TCB
7.2. Updates to the TCB cache....................................12 7.2. Updates to the TCB Cache
7.3. Discussion..................................................13 7.3. Discussion
8. Issues with TCB information sharing...........................14 8. Issues with TCB Information Sharing
8.1. Traversing the same network path............................15 8.1. Traversing the Same Network Path
8.2. State dependence............................................15 8.2. State Dependence
8.3. Problems with sharing based on IP address...................16 8.3. Problems with Sharing Based on IP Address
9. Implications..................................................16 9. Implications
9.1. Layering....................................................17 9.1. Layering
9.2. Other possibilities.........................................17 9.2. Other Possibilities
10. Implementation Observations..................................18 10. Implementation Observations
11. Changes Compared to RFC 2140.................................19 11. Changes Compared to RFC 2140
12. Security Considerations......................................19 12. Security Considerations
13. IANA Considerations..........................................20 13. IANA Considerations
14. References...................................................20 14. References
14.1. Normative References....................................20 14.1. Normative References
14.2. Informative References..................................21 14.2. Informative References
15. Acknowledgments..............................................24 Appendix A. TCB Sharing History
16. Change log...................................................24 Appendix B. TCP Option Sharing and Caching
Appendix A : TCB Sharing History.................................28 Appendix C. Automating the Initial Window in TCP over Long
Appendix B : TCP Option Sharing and Caching......................29 Timescales
Appendix C : Automating the Initial Window in TCP over Long C.1. Introduction
Timescales.......................................................31 C.2. Design Considerations
C.1. Introduction.............................................31 C.3. Proposed IW Algorithm
C.2. Design Considerations....................................31 C.4. Discussion
C.3. Proposed IW Algorithm....................................32 C.5. Observations
C.4. Discussion...............................................36 Acknowledgments
C.5. Observations.............................................37 Authors' Addresses
1. Introduction 1. Introduction
TCP is a connection-oriented reliable transport protocol layered TCP is a connection-oriented reliable transport protocol layered over
over IP [RFC793]. Each TCP connection maintains state, usually in a IP [RFC0793]. Each TCP connection maintains state, usually in a data
data structure called the TCP Control Block (TCB). The TCB contains structure called the "TCP Control Block (TCB)". The TCB contains
information about the connection state, its associated local information about the connection state, its associated local process,
process, and feedback parameters about the connection's transmission and feedback parameters about the connection's transmission
properties. As originally specified and usually implemented, most properties. As originally specified and usually implemented, most
TCB information is maintained on a per-connection basis. Some TCB information is maintained on a per-connection basis. Some
implementations share certain TCB information across connections to implementations share certain TCB information across connections to
the same host [RFC2140]. Such sharing is intended to lead to better the same host [RFC2140]. Such sharing is intended to lead to better
overall transient performance, especially for numerous short-lived overall transient performance, especially for numerous short-lived
and simultaneous connections, as can be used in the World-Wide Web and simultaneous connections, as can be used in the World Wide Web
and other applications [Be94][Br02]. This sharing of state is and other applications [Be94] [Br02]. This sharing of state is
intended to help TCP connections converge to long term behavior intended to help TCP connections converge to long-term behavior
(assuming stable application load, i.e., so-called "steady-state") (assuming stable application load, i.e., so-called "steady-state")
more quickly without affecting TCP interoperability. more quickly without affecting TCP interoperability.
This document updates RFC 2140's discussion of TCB state sharing and This document updates RFC 2140's discussion of TCB state sharing and
provides a complete replacement for that document. This state provides a complete replacement for that document. This state
sharing affects only TCB initialization [RFC2140] and thus has no sharing affects only TCB initialization [RFC2140] and thus has no
effect on the long-term behavior of TCP after a connection has been effect on the long-term behavior of TCP after a connection has been
established nor on interoperability. Path information shared across established or on interoperability. Path information shared across
SYN destination port numbers assumes that TCP segments having the SYN destination port numbers assumes that TCP segments having the
same host-pair experience the same path properties, i.e., that same host-pair experience the same path properties, i.e., that
traffic is not routed differently based on port numbers or other traffic is not routed differently based on port numbers or other
connection parameters (also addressed further in Section 8.1). The connection parameters (also addressed further in Section 8.1). The
observations about TCB sharing in this document apply similarly to observations about TCB sharing in this document apply similarly to
any protocol with congestion state, including SCTP [RFC4960] and any protocol with congestion state, including the Stream Control
DCCP [RFC4340], as well as for individual subflows in Multipath TCP Transmission Protocol (SCTP) [RFC4960] and the Datagram Congestion
[RFC8684]. Control Protocol (DCCP) [RFC4340], as well as to individual subflows
in Multipath TCP [RFC8684].
2. Conventions Used in This Document 2. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in "OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here. capitals, as shown here.
The core of this document describes behavior that is already The core of this document describes behavior that is already
permitted by TCP standards. As a result, it provides informative permitted by TCP standards. As a result, this document provides
guidance but does not use normative language, except when quoting informative guidance but does not use normative language except when
other documents. Normative language is used in Appendix C as quoting other documents. Normative language is used in Appendix C as
examples of requirements for future consideration. examples of requirements for future consideration.
3. Terminology 3. Terminology
The following terminology is used frequently in this document. Items The following terminology is used frequently in this document. Items
preceded with a "+" may be part of the state maintained as TCP preceded with a "+" may be part of the state maintained as TCP
connection state in the associated connections TCB and are the focus connection state in the TCB of associated connections and are the
of sharing as described in this document. Note that terms are used focus of sharing as described in this document. Note that terms are
as originally introduced where possible; in some cases, direction is used as originally introduced where possible; in some cases,
indicated with a suffix (_S for send, _R for receive) and in other direction is indicated with a suffix (_S for send, _R for receive)
cases spelled out (sendcwnd). and in other cases spelled out (sendcwnd).
+cwnd - TCP congestion window size [RFC5681] +cwnd: TCP congestion window size [RFC5681]
host - a source or sink of TCP segments associated with a single IP host: a source or sink of TCP segments associated with a single IP
address address
host-pair - a pair of hosts and their corresponding IP addresses host-pair: a pair of hosts and their corresponding IP addresses
+MMS_R - maximum message size that can be received, the largest ISN: Initial Sequence Number
received transport payload of an IP datagram [RFC1122]
+MMS_S - maximum message size that can be sent, the largest +MMS_R: maximum message size that can be received, the largest
transmitted transport payload of an IP datagram [RFC1122] received transport payload of an IP datagram [RFC1122]
path - an Internet path between the IP addresses of two hosts +MMS_S: maximum message size that can be sent, the largest
transmitted transport payload of an IP datagram [RFC1122]
PCB - protocol control block, the data associated with a protocol as path: an Internet path between the IP addresses of two hosts
maintained by an endpoint; a TCP PCB is called a TCB
PLPMTUD - packetization-layer path MTU discovery, a mechanism that
uses transport packets to discover the PMTU [RFC4821]
+PMTU - largest IP datagram that can traverse a path PCB: protocol control block, the data associated with a protocol as
[RFC1191][RFC8201] maintained by an endpoint; a TCP PCB is called a "TCB"
PMTUD - path-layer MTU discovery, a mechanism that relies on ICMP PLPMTUD: packetization-layer path MTU discovery, a mechanism that
error messages to discover the PMTU [RFC1191][RFC8201] uses transport packets to discover the Path Maximum
Transmission Unit (PMTU) [RFC4821]
+RTT - round-trip time of a TCP packet exchange [RFC793] +PMTU: largest IP datagram that can traverse a path [RFC1191]
[RFC8201]
+RTTVAR - variation of round-trip times of a TCP packet exchange PMTUD: path-layer MTU discovery, a mechanism that relies on ICMP
[RFC6298] error messages to discover the PMTU [RFC1191] [RFC8201]
+rwnd - TCP receive window size [RFC5681] +RTT: round-trip time of a TCP packet exchange [RFC0793]
+sendcwnd - TCP send-side congestion window (cwnd) size [RFC5681] +RTTVAR: variation of round-trip times of a TCP packet exchange
[RFC6298]
+sendMSS - TCP maximum segment size, a value transmitted in a TCP +rwnd: TCP receive window size [RFC5681]
option that represents the largest TCP user data payload that can be
received [RFC6691]
+ssthresh - TCP slow-start threshold [RFC5681] +sendcwnd: TCP send-side congestion window (cwnd) size [RFC5681]
TCB - TCP Control Block, the data associated with a TCP connection +sendMSS: TCP maximum segment size, a value transmitted in a TCP
as maintained by an endpoint option that represents the largest TCP user data payload that
can be received [RFC6691]
TCP-AO - TCP Authentication Option [RFC5925] +ssthresh: TCP slow-start threshold [RFC5681]
TFO - TCP Fast Open option [RFC7413] TCB: TCP Control Block, the data associated with a TCP connection as
maintained by an endpoint
+TFO_cookie - TCP Fast Open cookie, state that is used as part of TCP-AO: TCP Authentication Option [RFC5925]
the TFO mechanism, when TFO is supported [RFC7413]
+TFO_failure - an indication of when TFO option negotiation failed, TFO: TCP Fast Open option [RFC7413]
when TFO is supported
+TFOinfo - information cached when a TFO connection is established, +TFO_cookie: TCP Fast Open cookie, state that is used as part of the
which includes the TFO_cookie [RFC7413] TFO mechanism, when TFO is supported [RFC7413]
4. The TCP Control Block (TCB) +TFO_failure: an indication of when TFO option negotiation failed,
when TFO is supported
+TFOinfo: information cached when a TFO connection is established,
which includes the TFO_cookie [RFC7413]
4. The TCP Control Block (TCB)
A TCB describes the data associated with each connection, i.e., with A TCB describes the data associated with each connection, i.e., with
each association of a pair of applications across the network. The each association of a pair of applications across the network. The
TCB contains at least the following information [RFC793]: TCB contains at least the following information [RFC0793]:
Local process state Local process state
pointers to send and receive buffers
pointers to retransmission queue and current segment
pointers to Internet Protocol (IP) PCB
Per-connection shared state
macro-state
connection state
timers
flags
local and remote host numbers and ports
TCP option state
micro-state
send and receive window state (size*, current number)
congestion window size (sendcwnd)*
congestion window size threshold (ssthresh)*
max window size seen*
sendMSS#
MMS_S#
MMS_R#
PMTU#
round-trip time and its variation#
The per-connection information is shown as split into macro-state pointers to send and receive buffers
and micro-state, terminology borrowed from [Co91]. Macro-state pointers to retransmission queue and current segment
describes the protocol for establishing the initial shared state pointers to Internet Protocol (IP) PCB
about the connection; we include the endpoint numbers and components
(timers, flags) required upon commencement that are later used to Per-connection shared state
help maintain that state. Micro-state describes the protocol after a
macro-state
connection state
timers
flags
local and remote host numbers and ports
TCP option state
micro-state
send and receive window state (size*, current number)
congestion window size (sendcwnd)*
congestion window size threshold (ssthresh)*
max window size seen*
sendMSS#
MMS_S#
MMS_R#
PMTU#
round-trip time and its variation#
The per-connection information is shown as split into macro-state and
micro-state, terminology borrowed from [Co91]. Macro-state describes
the protocol for establishing the initial shared state about the
connection; we include the endpoint numbers and components (timers,
flags) required upon commencement that are later used to help
maintain that state. Micro-state describes the protocol after a
connection has been established, to maintain the reliability and connection has been established, to maintain the reliability and
congestion control of the data transferred in the connection. congestion control of the data transferred in the connection.
We distinguish two other classes of shared micro-state that are We distinguish two other classes of shared micro-state that are
associated more with host-pairs than with application pairs. One associated more with host-pairs than with application pairs. One
class is clearly host-pair dependent (shown above as "#", e.g., class is clearly host-pair dependent (shown above as "#", e.g.,
sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are
defined by the endpoint or endpoint pair (sendMSS, MMS_R, MMS_S, defined by the endpoint or endpoint pair (of the given example:
RTT) or are already cached and shared on that basis (PMTU sendMSS, MMS_R, MMS_S, RTT) or are already cached and shared on that
[RFC1191][RFC4821]). The other is host-pair dependent in its basis (of the given example: PMTU [RFC1191] [RFC4821]). The other is
aggregate (shown above as "*", e.g., congestion window information, host-pair dependent in its aggregate (shown above as "*", e.g.,
current window sizes, etc.) because they depend on the total congestion window information, current window sizes, etc.) because
capacity between the two endpoints. they depend on the total capacity between the two endpoints.
Not all of the TCB state is necessarily sharable. In particular, Not all of the TCB state is necessarily shareable. In particular,
some TCP options are negotiated only upon request by the application some TCP options are negotiated only upon request by the application
layer, so their use may not be correlated across connections. Other layer, so their use may not be correlated across connections. Other
options negotiate connection-specific parameters, which are options negotiate connection-specific parameters, which are similarly
similarly not shareable. These are discussed further in Appendix B. not shareable. These are discussed further in Appendix B.
Finally, we exclude rwnd from further discussion because its value Finally, we exclude rwnd from further discussion because its value
should depend on the send window size, so it is already addressed by should depend on the send window size, so it is already addressed by
send window sharing and is not independently affected by sharing. send window sharing and is not independently affected by sharing.
5. TCB Interdependence 5. TCB Interdependence
There are two cases of TCB interdependence. Temporal sharing occurs There are two cases of TCB interdependence. Temporal sharing occurs
when the TCB of an earlier (now CLOSED) connection to a host is used when the TCB of an earlier (now CLOSED) connection to a host is used
to initialize some parameters of a new connection to that same host, to initialize some parameters of a new connection to that same host,
i.e., in sequence. Ensemble sharing occurs when a currently active i.e., in sequence. Ensemble sharing occurs when a currently active
connection to a host is used to initialize another (concurrent) connection to a host is used to initialize another (concurrent)
connection to that host. connection to that host.
6. Temporal Sharing 6. Temporal Sharing
The TCB data cache is accessed in two ways: it is read to initialize The TCB data cache is accessed in two ways: it is read to initialize
new TCBs and written when more current per-host state is available. new TCBs and written when more current per-host state is available.
6.1. Initialization of a new TCB 6.1. Initialization of a New TCB
TCBs for new connections can be initialized using cached context
from past connections as follows:
TEMPORAL SHARING - TCB Initialization
Cached TCB New TCB
--------------------------------------
old_MMS_S old_MMS_S or not cached*
old_MMS_R old_MMS_R or not cached*
old_sendMSS old_sendMSS
old_PMTU old_PMTU+
old_RTT old_RTT
old_RTTVAR old_RTTVAR
old_option (option specific) TCBs for new connections can be initialized using cached context from
past connections as follows:
old_ssthresh old_ssthresh +==============+=============================+
| Cached TCB | New TCB |
+==============+=============================+
| old_MMS_S | old_MMS_S or not cached (2) |
+--------------+-----------------------------+
| old_MMS_R | old_MMS_R or not cached (2) |
+--------------+-----------------------------+
| old_sendMSS | old_sendMSS |
+--------------+-----------------------------+
| old_PMTU | old_PMTU (1) |
+--------------+-----------------------------+
| old_RTT | old_RTT |
+--------------+-----------------------------+
| old_RTTVAR | old_RTTVAR |
+--------------+-----------------------------+
| old_option | (option specific) |
+--------------+-----------------------------+
| old_ssthresh | old_ssthresh |
+--------------+-----------------------------+
| old_sendcwnd | old_sendcwnd |
+--------------+-----------------------------+
old_sendcwnd old_sendcwnd Table 1: Temporal Sharing - TCB Initialization
+Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].
*Note that some values are not cached when they are computed locally
(MMS_R) or indicated in the connection itself (MMS_S in the SYN).
The table below gives an overview of option-specific information (2) Note that some values are not cached when they are computed
that can be shared. Additional information on some specific TCP locally (MMS_R) or indicated in the connection itself (MMS_S in
options and sharing is provided in Appendix B. the SYN).
TEMPORAL SHARING - Option Info Initialization Table 2 gives an overview of option-specific information that can be
shared. Additional information on some specific TCP options and
sharing is provided in Appendix B.
Cached New +=================+=================+
------------------------------------ | Cached | New |
old_TFO_cookie old_TFO_cookie +=================+=================+
| old_TFO_cookie | old_TFO_cookie |
+-----------------+-----------------+
| old_TFO_failure | old_TFO_failure |
+-----------------+-----------------+
old_TFO_failure old_TFO_failure Table 2: Temporal Sharing -
Option Info Initialization
6.2. Updates to the TCB cache 6.2. Updates to the TCB Cache
During a connection, the TCB cache can be updated based on events of During a connection, the TCB cache can be updated based on events of
current connections and their TCBs as they progress over time, as current connections and their TCBs as they progress over time, as
shown below: shown in Table 3.
TEMPORAL SHARING - Cache Updates
Cached TCB Current TCB when? New Cached TCB
----------------------------------------------------------
old_MMS_S curr_MMS_S OPEN curr_MMS_S
old_MMS_R curr_MMS_R OPEN curr_MMS_R
old_sendMSS curr_sendMSS MSSopt curr_sendMSS
old_PMTU curr_PMTU PMTUD+ / curr_PMTU
PLPMTUD+
old_RTT curr_RTT CLOSE merge(curr,old)
old_RTTVAR curr_RTTVAR CLOSE merge(curr,old)
old_option curr_option ESTAB (depends on option)
old_ssthresh curr_ssthresh CLOSE merge(curr,old) +==============+===============+=============+=================+
| Cached TCB | Current TCB | When? | New Cached TCB |
+==============+===============+=============+=================+
| old_MMS_S | curr_MMS_S | OPEN | curr_MMS_S |
+--------------+---------------+-------------+-----------------+
| old_MMS_R | curr_MMS_R | OPEN | curr_MMS_R |
+--------------+---------------+-------------+-----------------+
| old_sendMSS | curr_sendMSS | MSSopt | curr_sendMSS |
+--------------+---------------+-------------+-----------------+
| old_PMTU | curr_PMTU | PMTUD (1) / | curr_PMTU |
| | | PLPMTUD (1) | |
+--------------+---------------+-------------+-----------------+
| old_RTT | curr_RTT | CLOSE | merge(curr,old) |
+--------------+---------------+-------------+-----------------+
| old_RTTVAR | curr_RTTVAR | CLOSE | merge(curr,old) |
+--------------+---------------+-------------+-----------------+
| old_option | curr_option | ESTAB | (depends on |
| | | | option) |
+--------------+---------------+-------------+-----------------+
| old_ssthresh | curr_ssthresh | CLOSE | merge(curr,old) |
+--------------+---------------+-------------+-----------------+
| old_sendcwnd | curr_sendcwnd | CLOSE | merge(curr,old) |
+--------------+---------------+-------------+-----------------+
old_sendcwnd curr_sendcwnd CLOSE merge(curr,old) Table 3: Temporal Sharing - Cache Updates
+Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].
Merge() is the function that combines the current and previous (old) Merge() is the function that combines the current and previous (old)
values and may vary for each parameter of the TCB cache. The values and may vary for each parameter of the TCB cache. The
particular function is not specified in this document; examples particular function is not specified in this document; examples
include windowed averages (mean of the past N values, for some N) include windowed averages (mean of the past N values, for some N) and
and exponential decay (new = (1-alpha)*old + alpha *new, where alpha exponential decay (new = (1-alpha)*old + alpha *new, where alpha is
is in the range [0..1]). in the range [0..1]).
The table below gives an overview of option-specific information
that can be similarly shared. The TFO cookie is maintained until the
client explicitly requests it be updated as a separate event.
TEMPORAL SHARING - Option Info Updates Table 4 gives an overview of option-specific information that can be
similarly shared. The TFO cookie is maintained until the client
explicitly requests it be updated as a separate event.
Cached Current when? New Cached +=================+=================+=======+=================+
--------------------------------------------------------- | Cached | Current | When? | New Cached |
old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie +=================+=================+=======+=================+
| old_TFO_cookie | old_TFO_cookie | ESTAB | old_TFO_cookie |
+-----------------+-----------------+-------+-----------------+
| old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure |
+-----------------+-----------------+-------+-----------------+
old_TFO_failure old_TFO_failure ESTAB old_TFO_failure Table 4: Temporal Sharing - Option Info Updates
6.3. Discussion 6.3. Discussion
As noted, there is no particular benefit to caching MMS_S and MMS_R As noted, there is no particular benefit to caching MMS_S and MMS_R
as these are reported by the local IP stack. Caching sendMSS and as these are reported by the local IP stack. Caching sendMSS and
PMTU is trivial; reported values are cached (PMTU at the IP layer), PMTU is trivial; reported values are cached (PMTU at the IP layer),
and the most recent values are used. The cache is updated when the and the most recent values are used. The cache is updated when the
MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4 MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4
Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is
received [RFC8201] or the equivalent is inferred, e.g., as from received [RFC8201] or the equivalent is inferred, e.g., as from
PLPMTUD [RFC4821]), respectively, so the cache always has the most PLPMTUD [RFC4821]), respectively, so the cache always has the most
recent values from any connection. For sendMSS, the cache is recent values from any connection. For sendMSS, the cache is
consulted only at connection establishment and not otherwise consulted only at connection establishment and not otherwise updated,
updated, which means that MSS options do not affect current which means that MSS options do not affect current connections. The
connections. The default sendMSS is never saved; only reported MSS default sendMSS is never saved; only reported MSS values update the
values update the cache, so an explicit override is required to cache, so an explicit override is required to reduce the sendMSS.
reduce the sendMSS. Cached sendMSS affects only data sent in the SYN Cached sendMSS affects only data sent in the SYN segment, i.e.,
segment, i.e., during client connection initiation or during during client connection initiation or during simultaneous open; the
simultaneous open; all other segment MSS are based on the value MSS of all other segments are constrained by the value updated as
updated as included in the SYN. included in the SYN.
RTT values are updated by formulae that merges the old and new RTT values are updated by formulae that merge the old and new values,
values, as noted in Section 6.2. Dynamic RTT estimation requires a as noted in Section 6.2. Dynamic RTT estimation requires a sequence
sequence of RTT measurements. As a result, the cached RTT (and its of RTT measurements. As a result, the cached RTT (and its variation)
variation) is an average of its previous value with the contents of is an average of its previous value with the contents of the
the currently active TCB for that host, when a TCB is closed. RTT currently active TCB for that host, when a TCB is closed. RTT values
values are updated only when a connection is closed. The method for are updated only when a connection is closed. The method for merging
merging old and current values needs to attempt to reduce the old and current values needs to attempt to reduce the transient
transient effects of the new connections. effects of the new connections.
The updates for RTT, RTTVAR and ssthresh rely on existing The updates for RTT, RTTVAR, and ssthresh rely on existing
information, i.e., old values. Should no such values exist, the information, i.e., old values. Should no such values exist, the
current values are cached instead. current values are cached instead.
TCP options are copied or merged depending on the details of each TCP options are copied or merged depending on the details of each
option. E.g., TFO state is updated when a connection is established option. For example, TFO state is updated when a connection is
and read before establishing a new connection. established and read before establishing a new connection.
Sections 8 and 9 discuss compatibility issues and implications of Sections 8 and 9 discuss compatibility issues and implications of
sharing the specific information listed above. Section 10 gives an sharing the specific information listed above. Section 10 gives an
overview of known implementations. overview of known implementations.
Most cached TCB values are updated when a connection closes. The Most cached TCB values are updated when a connection closes. The
exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122];
PMTU which is updated after Path MTU Discovery and also reported by PMTU, which is updated after Path MTU Discovery and also reported by
IP [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the IP [RFC1191] [RFC4821] [RFC8201]; and sendMSS, which is updated if
MSS option is received in the TCP SYN header. the MSS option is received in the TCP SYN header.
Sharing sendMSS information affects only data in the SYN of the next Sharing sendMSS information affects only data in the SYN of the next
connection, because sendMSS information is typically included in connection, because sendMSS information is typically included in most
most TCP SYN segments. Caching PMTU can accelerate the efficiency of TCP SYN segments. Caching PMTU can accelerate the efficiency of
PMTUD but can also result in black-holing until corrected if in PMTUD but can also result in black-holing until corrected if in
error. Caching MMS_R and MMS_S may be of little direct value as they error. Caching MMS_R and MMS_S may be of little direct value as they
are reported by the local IP stack anyway. are reported by the local IP stack anyway.
The way in which other TCP option state can be shared depends on the The way in which state related to other TCP options can be shared
details of that option. E.g., TFO state includes the TCP Fast Open depends on the details of that option. For example, TFO state
Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open includes the TCP Fast Open cookie [RFC7413] or, in case TFO fails, a
response. RFC 7413 states, "The client MUST cache negative responses negative TCP Fast Open response. RFC 7413 states,
from the server in order to avoid potential connection failures.
Negative responses include the server not acknowledging the data in
the SYN, ICMP error messages, and (most importantly) no response
(SYN-ACK) from the server at all, i.e., connection timeout." [RFC
7413]. TFOinfo is cached when a connection is established.
Other TCP option state might not be as readily cached. E.g., TCP-AO | The client MUST cache negative responses from the server in order
[RFC5925] success or failure between a host pair for a single SYN | to avoid potential connection failures. Negative responses
destination port might be usefully cached. TCP-AO success or failure | include the server not acknowledging the data in the SYN, ICMP
to other SYN destination ports on that host pair is never useful to | error messages, and (most importantly) no response (SYN-ACK) from
cache because TCP-AO security parameters can vary per service. | the server at all, i.e., connection timeout.
7. Ensemble Sharing TFOinfo is cached when a connection is established.
State related to other TCP options might not be as readily cached.
For example, TCP-AO [RFC5925] success or failure between a host-pair
for a single SYN destination port might be usefully cached. TCP-AO
success or failure to other SYN destination ports on that host-pair
is never useful to cache because TCP-AO security parameters can vary
per service.
7. Ensemble Sharing
Sharing cached TCB data across concurrent connections requires Sharing cached TCB data across concurrent connections requires
attention to the aggregate nature of some of the shared state. For attention to the aggregate nature of some of the shared state. For
example, although MSS and RTT values can be shared by copying, it example, although MSS and RTT values can be shared by copying, it may
may not be appropriate to simply copy congestion window or ssthresh not be appropriate to simply copy congestion window or ssthresh
information; instead, the new values can be a function (f) of the information; instead, the new values can be a function (f) of the
cumulative values and the number of connections (N). cumulative values and the number of connections (N).
7.1. Initialization of a new TCB 7.1. Initialization of a New TCB
TCBs for new connections can be initialized using cached context
from concurrent connections as follows:
ENSEMBLE SHARING - TCB Initialization
Cached TCB New TCB
------------------------------------------
old_MMS_S old_MMS_S
old_MMS_R old_MMS_R
old_sendMSS old_sendMSS
old_PMTU old_PMTU+
old_RTT old_RTT
old_RTTVAR old_RTTVAR
sum(old_ssthresh) f(sum(old_ssthresh), N)
sum(old_sendcwnd) f(sum(old_sendcwnd), N)
_
old_option (option specific)
+Note that PMTU is cached at the IP layer [RFC1191][RFC4821].
In the table, the cached sum() is a total across all active
connections because these parameters act in aggregate; similarly f()
is a function that updates that sum based on the new connection's
values, represented as "N".
The table below gives an overview of option-specific information
that can be similarly shared. Again, The TFO_cookie is updated upon
explicit client request, which is a separate event.
ENSEMBLE SHARING - Option Info Initialization
Cached New
------------------------------------
old_TFO_cookie old_TFO_cookie
old_TFO_failure old_TFO_failure
7.2. Updates to the TCB cache TCBs for new connections can be initialized using cached context from
concurrent connections as follows:
During a connection, the TCB cache can be updated based on changes +===================+=========================+
to concurrent connections and their TCBs, as shown below: | Cached TCB | New TCB |
+===================+=========================+
| old_MMS_S | old_MMS_S |
+-------------------+-------------------------+
| old_MMS_R | old_MMS_R |
+-------------------+-------------------------+
| old_sendMSS | old_sendMSS |
+-------------------+-------------------------+
| old_PMTU | old_PMTU (1) |
+-------------------+-------------------------+
| old_RTT | old_RTT |
+-------------------+-------------------------+
| old_RTTVAR | old_RTTVAR |
+-------------------+-------------------------+
| sum(old_ssthresh) | f(sum(old_ssthresh), N) |
+-------------------+-------------------------+
| sum(old_sendcwnd) | f(sum(old_sendcwnd), N) |
+-------------------+-------------------------+
| old_option | (option specific) |
+-------------------+-------------------------+
ENSEMBLE SHARING - Cache Updates Table 5: Ensemble Sharing - TCB Initialization
Cached TCB Current TCB when? New Cached TCB (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].
---------------------------------------------------------------
old_MMS_S curr_MMS_S OPEN curr_MMS_S
old_MMS_R curr_MMS_R OPEN curr_MMS_R In Table 5, the cached sum() is a total across all active connections
because these parameters act in aggregate; similarly, f() is a
function that updates that sum based on the new connection's values,
represented as "N".
old_sendMSS curr_sendMSS MSSopt curr_sendMSS Table 6 gives an overview of option-specific information that can be
similarly shared. Again, the TFO_cookie is updated upon explicit
client request, which is a separate event.
old_PMTU curr_PMTU PMTUD+ / curr_PMTU +=================+=================+
PLPMTUD+ | Cached | New |
+=================+=================+
| old_TFO_cookie | old_TFO_cookie |
+-----------------+-----------------+
| old_TFO_failure | old_TFO_failure |
+-----------------+-----------------+
old_RTT curr_RTT update rtt_update(old, curr) Table 6: Ensemble Sharing -
Option Info Initialization
old_RTTVAR curr_RTTVAR update rtt_update(old, curr) 7.2. Updates to the TCB Cache
old_ssthresh curr_ssthresh update adjust sum as appropriate During a connection, the TCB cache can be updated based on changes to
concurrent connections and their TCBs, as shown below:
old_sendcwnd curr_sendcwnd update adjust sum as appropriate +==============+===============+===========+=================+
| Cached TCB | Current TCB | When? | New Cached TCB |
+==============+===============+===========+=================+
| old_MMS_S | curr_MMS_S | OPEN | curr_MMS_S |
+--------------+---------------+-----------+-----------------+
| old_MMS_R | curr_MMS_R | OPEN | curr_MMS_R |
+--------------+---------------+-----------+-----------------+
| old_sendMSS | curr_sendMSS | MSSopt | curr_sendMSS |
+--------------+---------------+-----------+-----------------+
| old_PMTU | curr_PMTU | PMTUD+ / | curr_PMTU |
| | | PLPMTUD+ | |
+--------------+---------------+-----------+-----------------+
| old_RTT | curr_RTT | update | rtt_update(old, |
| | | | curr) |
+--------------+---------------+-----------+-----------------+
| old_RTTVAR | curr_RTTVAR | update | rtt_update(old, |
| | | | curr) |
+--------------+---------------+-----------+-----------------+
| old_ssthresh | curr_ssthresh | update | adjust sum as |
| | | | appropriate |
+--------------+---------------+-----------+-----------------+
| old_sendcwnd | curr_sendcwnd | update | adjust sum as |
| | | | appropriate |
+--------------+---------------+-----------+-----------------+
| old_option | curr_option | (depends) | (option |
| | | | specific) |
+--------------+---------------+-----------+-----------------+
old_option curr_option (depends) (option specific) Table 7: Ensemble Sharing - Cache Updates
+Note that the PMTU is cached at the IP layer [RFC1191][RFC4821]. + Note that the PMTU is cached at the IP layer [RFC1191] [RFC4821].
In the table, rtt_update() is the function used to combine old and In Table 7, rtt_update() is the function used to combine old and
current values, e.g., as a windowed average or exponentially decayed current values, e.g., as a windowed average or exponentially decayed
average. average.
The table below gives an overview of option-specific information Table 8 gives an overview of option-specific information that can be
that can be similarly shared. similarly shared.
ENSEMBLE SHARING - Option Info Updates
Cached Current when? New Cached +=================+=================+=======+=================+
---------------------------------------------------------- | Cached | Current | When? | New Cached |
old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie +=================+=================+=======+=================+
| old_TFO_cookie | old_TFO_cookie | ESTAB | old_TFO_cookie |
+-----------------+-----------------+-------+-----------------+
| old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure |
+-----------------+-----------------+-------+-----------------+
old_TFO_failure old_TFO_failure ESTAB old_TFO_failure Table 8: Ensemble Sharing - Option Info Updates
7.3. Discussion 7.3. Discussion
For ensemble sharing, TCB information should be cached as early as For ensemble sharing, TCB information should be cached as early as
possible, sometimes before a connection is closed. Otherwise, possible, sometimes before a connection is closed. Otherwise,
opening multiple concurrent connections may not result in TCB data opening multiple concurrent connections may not result in TCB data
sharing if no connection closes before others open. The amount of sharing if no connection closes before others open. The amount of
work involved in updating the aggregate average should be minimized, work involved in updating the aggregate average should be minimized,
but the resulting value should be equivalent to having all values but the resulting value should be equivalent to having all values
measured within a single connection. The function "rtt_update" in measured within a single connection. The function "rtt_update" in
the ensemble sharing table indicates this operation, which occurs Table 7 indicates this operation, which occurs whenever the RTT would
whenever the RTT would have been updated in the individual TCP have been updated in the individual TCP connection. As a result, the
connection. As a result, the cache contains the shared RTT cache contains the shared RTT variables, which no longer need to
variables, which no longer need to reside in the TCB. reside in the TCB.
Congestion window size and ssthresh aggregation are more complicated Congestion window size and ssthresh aggregation are more complicated
in the concurrent case. When there is an ensemble of connections, we in the concurrent case. When there is an ensemble of connections, we
need to decide how that ensemble would have shared these variables, need to decide how that ensemble would have shared these variables,
in order to derive initial values for new TCBs. in order to derive initial values for new TCBs.
Sections 8 and 9 discuss compatibility issues and implications of Sections 8 and 9 discuss compatibility issues and implications of
sharing the specific information listed above. sharing the specific information listed above.
There are several ways to initialize the congestion window in a new There are several ways to initialize the congestion window in a new
TCB among an ensemble of current connections to a host. Current TCP TCB among an ensemble of current connections to a host. Current TCP
implementations initialize it to four segments as standard [RFC3390] implementations initialize it to 4 segments as standard [RFC3390] and
and 10 segments experimentally [RFC6928]. These approaches assume 10 segments experimentally [RFC6928]. These approaches assume that
that new connections should behave as conservatively as possible. new connections should behave as conservatively as possible. The
The algorithm described in [Ba12] adjusts the initial cwnd depending algorithm described in [Ba12] adjusts the initial cwnd depending on
on the cwnd values of ongoing connections. It is also possible to the cwnd values of ongoing connections. It is also possible to use
use sharing mechanisms over long timescales to adapt TCP's initial sharing mechanisms over long timescales to adapt TCP's initial window
window automatically, as described further in Appendix C. automatically, as described further in Appendix C.
8. Issues with TCB information sharing 8. Issues with TCB Information Sharing
Here, we discuss various types of problems that may arise with TCB Here, we discuss various types of problems that may arise with TCB
information sharing. information sharing.
For the congestion and current window information, the initial For the congestion and current window information, the initial values
values computed by TCB interdependence may not be consistent with computed by TCB interdependence may not be consistent with the long-
the long-term aggregate behavior of a set of concurrent connections term aggregate behavior of a set of concurrent connections between
between the same endpoints. Under conventional TCP congestion the same endpoints. Under conventional TCP congestion control, if
control, if the congestion window of a single existing connection the congestion window of a single existing connection has converged
has converged to 40 segments, two newly joining concurrent to 40 segments, two newly joining concurrent connections will assume
connections assume initial windows of 10 segments [RFC6928], and the initial windows of 10 segments [RFC6928] and the existing
current connection's window doesn't decrease to accommodate this connection's window will not decrease to accommodate this additional
additional load and connections can mutually interfere. One example load. As a consequence, the three connections can mutually
of this is seen on low-bandwidth, high-delay links, where concurrent interfere. One example of this is seen on low-bandwidth, high-delay
connections supporting Web traffic can collide because their initial links, where concurrent connections supporting Web traffic can
windows were too large, even when set at one segment. collide because their initial windows were too large, even when set
at 1 segment.
The authors of [Hu12] recommend caching ssthresh for temporal The authors of [Hu12] recommend caching ssthresh for temporal sharing
sharing only when flows are long. Some studies suggest that sharing only when flows are long. Some studies suggest that sharing ssthresh
ssthresh between short flows can deteriorate the performance of between short flows can deteriorate the performance of individual
individual connections [Hu12, Du16], although this may benefit connections [Hu12] [Du16], although this may benefit aggregate
aggregate network performance. network performance.
8.1. Traversing the same network path 8.1. Traversing the Same Network Path
TCP is sometimes used in situations where packets of the same host- TCP is sometimes used in situations where packets of the same host-
pair do not always take the same path, such as when connection- pair do not always take the same path, such as when connection-
specific parameters are used for routing (e.g., for load balancing). specific parameters are used for routing (e.g., for load balancing).
Multipath routing that relies on examining transport headers, such Multipath routing that relies on examining transport headers, such as
as ECMP and LAG [RFC7424], may not result in repeatable path ECMP and Link Aggregation Group (LAG) [RFC7424], may not result in
selection when TCP segments are encapsulated, encrypted, or altered repeatable path selection when TCP segments are encapsulated,
- for example, in some Virtual Private Network (VPN) tunnels that encrypted, or altered -- for example, in some Virtual Private Network
rely on proprietary encapsulation. Similarly, such approaches cannot (VPN) tunnels that rely on proprietary encapsulation. Similarly,
operate deterministically when the TCP header is encrypted, e.g., such approaches cannot operate deterministically when the TCP header
when using IPsec ESP (although TCB interdependence among the entire is encrypted, e.g., when using IPsec Encapsulating Security Payload
set sharing the same endpoint IP addresses should work without (ESP) (although TCB interdependence among the entire set sharing the
problems when the TCP header is encrypted). Measures to increase the same endpoint IP addresses should work without problems when the TCP
probability that connections use the same path could be applied: header is encrypted). Measures to increase the probability that
e.g., the connections could be given the same IPv6 flow label connections use the same path could be applied; for example, the
[RFC6437]. TCB interdependence can also be extended to sets of host connections could be given the same IPv6 flow label [RFC6437]. TCB
IP address pairs that share the same network path conditions, such interdependence can also be extended to sets of host IP address pairs
as when a group of addresses is on the same LAN (see Section 9). that share the same network path conditions, such as when a group of
addresses is on the same LAN (see Section 9).
Traversing the same path is not important for host-specific Traversing the same path is not important for host-specific
information such as rwnd and TCP option state, such as TFOinfo, or information (e.g., rwnd), TCP option state (e.g., TFOinfo), or for
for information that is already cached per-host, such as path MTU. information that is already cached per-host (e.g., path MTU). When
When TCB information is shared across different SYN destination TCB information is shared across different SYN destination ports,
ports, path-related information can be incorrect; however, the path-related information can be incorrect; however, the impact of
impact of this error is potentially diminished if (as discussed this error is potentially diminished if (as discussed here) TCB
here) TCB sharing affects only the transient event of a connection sharing affects only the transient event of a connection start or if
start or if TCB information is shared only within connections to the TCB information is shared only within connections to the same SYN
same SYN destination port. destination port.
In case of Temporal Sharing, TCB information could also become In the case of temporal sharing, TCB information could also become
invalid over time, i.e., indicating that although the path remains invalid over time, i.e., indicating that although the path remains
the same, path properties have changed. Because this is similar to the same, path properties have changed. Because this is similar to
the case when a connection becomes idle, mechanisms that address the case when a connection becomes idle, mechanisms that address idle
idle TCP connections (e.g., [RFC7661]) could also be applied to TCB TCP connections (e.g., [RFC7661]) could also be applied to TCB cache
cache management, especially when TCP Fast Open is used [RFC7413]. management, especially when TCP Fast Open is used [RFC7413].
8.2. State dependence 8.2. State Dependence
There may be additional considerations to the way in which TCB There may be additional considerations to the way in which TCB
interdependence rebalances congestion feedback among the current interdependence rebalances congestion feedback among the current
connections, e.g., it may be appropriate to consider the impact of a connections. For example, it may be appropriate to consider the
connection being in Fast Recovery [RFC5681] or some other similar impact of a connection being in Fast Recovery [RFC5681] or some other
unusual feedback state, e.g., as inhibiting or affecting the similar unusual feedback state that could inhibit or affect the
calculations described herein. calculations described herein.
8.3. Problems with sharing based on IP address 8.3. Problems with Sharing Based on IP Address
It can be wrong to share TCB information between TCP connections on It can be wrong to share TCB information between TCP connections on
the same host as identified by the IP address if an IP address is the same host as identified by the IP address if an IP address is
assigned to a new host (e.g., IP address spinning, as is used by assigned to a new host (e.g., IP address spinning, as is used by ISPs
ISPs to inhibit running servers). It can be wrong if Network Address to inhibit running servers). It can be wrong if Network Address
(and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing Translation (NAT) [RFC2663], Network Address and Port Translation
mechanism is used. Such mechanisms are less likely to be used with (NAPT) [RFC2663], or any other IP sharing mechanism is used. Such
IPv6. Other methods to identify a host could also be considered to mechanisms are less likely to be used with IPv6. Other methods to
make correct TCB sharing more likely. Moreover, some TCB information identify a host could also be considered to make correct TCB sharing
is about dominant path properties rather than the specific host. IP more likely. Moreover, some TCB information is about dominant path
addresses may differ, yet the relevant part of the path may be the properties rather than the specific host. IP addresses may differ,
same. yet the relevant part of the path may be the same.
9. Implications 9. Implications
There are several implications to incorporating TCB interdependence There are several implications to incorporating TCB interdependence
in TCP implementations. First, it may reduce the need for in TCP implementations. First, it may reduce the need for
application-layer multiplexing for performance enhancement application-layer multiplexing for performance enhancement [RFC7231].
[RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection Protocols like HTTP/2 [RFC7540] avoid connection re-establishment
reestablishment costs by serializing or multiplexing a set of per- costs by serializing or multiplexing a set of per-host connections
host connections across a single TCP connection. This avoids TCP's across a single TCP connection. This avoids TCP's per-connection
per-connection OPEN handshake and also avoids recomputing the MSS, OPEN handshake and also avoids recomputing the MSS, RTT, and
RTT, and congestion window values. By avoiding the so-called "slow- congestion window values. By avoiding the so-called "slow-start
start restart", performance can be optimized [Hu01]. TCB restart", performance can be optimized [Hu01]. TCB interdependence
interdependence can provide the "slow-start restart avoidance" of can provide the "slow-start restart avoidance" of multiplexing,
multiplexing, without requiring a multiplexing mechanism at the without requiring a multiplexing mechanism at the application layer.
application layer.
Like the initial version of this document [RFC2140], this update's Like the initial version of this document [RFC2140], this update's
approach to TCB interdependence focuses on sharing a set of TCBs by approach to TCB interdependence focuses on sharing a set of TCBs by
updating the TCB state to reduce the impact of transients when updating the TCB state to reduce the impact of transients when
connections begin, end, or otherwise significantly change state. connections begin, end, or otherwise significantly change state.
Other mechanisms have since been proposed to continuously share Other mechanisms have since been proposed to continuously share
information between all ongoing communication (including information between all ongoing communication (including
connectionless protocols), updating the congestion state during any connectionless protocols) and update the congestion state during any
congestion-related event (e.g., timeout, loss confirmation, etc.) congestion-related event (e.g., timeout, loss confirmation, etc.)
[RFC3124]. By dealing exclusively with transients, the approach in [RFC3124]. By dealing exclusively with transients, the approach in
this document is more likely to exhibit the "steady-state" behavior this document is more likely to exhibit the "steady-state" behavior
as unmodified, independent TCP connections. as unmodified, independent TCP connections.
9.1. Layering 9.1. Layering
TCB interdependence pushes some of the TCP implementation from the TCB interdependence pushes some of the TCP implementation from its
traditional transport layer (in the ISO model), to the network typical placement solely within the transport layer (in the ISO
layer. This acknowledges that some state is in fact per-host-pair or model) to the network layer. This acknowledges that some components
can be per-path as indicated solely by that host-pair. Transport of state are, in fact, per-host-pair or can be per-path as indicated
protocols typically manage per-application-pair associations (per solely by that host-pair. Transport protocols typically manage per-
stream), and network protocols manage per-host-pair and path application-pair associations (per stream), and network protocols
associations (routing). Round-trip time, MSS, and congestion manage per-host-pair and path associations (routing). Round-trip
information could be more appropriately handled at the network time, MSS, and congestion information could be more appropriately
layer, aggregated among concurrent connections, and shared across handled at the network layer, aggregated among concurrent
connection instances [RFC3124]. connections, and shared across connection instances [RFC3124].
An earlier version of RTT sharing suggested implementing RTT state An earlier version of RTT sharing suggested implementing RTT state at
at the IP layer, rather than at the TCP layer. Our observations the IP layer rather than at the TCP layer. Our observations describe
describe sharing state among TCP connections, which avoids some of sharing state among TCP connections, which avoids some of the
the difficulties in an IP-layer solution. One such problem of an IP difficulties in an IP-layer solution. One such problem of an IP-
layer solution is determining the correspondence between packet layer solution is determining the correspondence between packet
exchanges using IP header information alone, where such exchanges using IP header information alone, where such
correspondence is needed to compute RTT. Because TCB sharing correspondence is needed to compute RTT. Because TCB sharing
computes RTTs inside the TCP layer using TCP header information, it computes RTTs inside the TCP layer using TCP header information, it
can be implemented more directly and simply than at the IP layer. can be implemented more directly and simply than at the IP layer.
This is a case where information should be computed at the transport This is a case where information should be computed at the transport
layer but could be shared at the network layer. layer but could be shared at the network layer.
9.2. Other possibilities 9.2. Other Possibilities
Per-host-pair associations are not the limit of these techniques. It Per-host-pair associations are not the limit of these techniques. It
is possible that TCBs could be similarly shared between hosts on a is possible that TCBs could be similarly shared between hosts on a
subnet or within a cluster, because the predominant path can be subnet or within a cluster, because the predominant path can be
subnet-subnet, rather than host-host. Additionally, TCB subnet-subnet rather than host-host. Additionally, TCB
interdependence can be applied to any protocol with congestion interdependence can be applied to any protocol with congestion state,
state, including SCTP [RFC4960] and DCCP [RFC4340], as well as for including SCTP [RFC4960] and DCCP [RFC4340], as well as to individual
individual subflows in Multipath TCP [RFC8684]. subflows in Multipath TCP [RFC8684].
There may be other information that can be shared between concurrent There may be other information that can be shared between concurrent
connections. For example, knowing that another connection has just connections. For example, knowing that another connection has just
tried to expand its window size and failed, a connection may not tried to expand its window size and failed, a connection may not
attempt to do the same for some period. The idea is that existing attempt to do the same for some period. The idea is that existing
TCP implementations infer the behavior of all competing connections, TCP implementations infer the behavior of all competing connections,
including those within the same host or subnet. One possible including those within the same host or subnet. One possible
optimization is to make that implicit feedback explicit, via optimization is to make that implicit feedback explicit, via extended
extended information associated with the endpoint IP address and its information associated with the endpoint IP address and its TCP
TCP implementation, rather than per-connection state in the TCB. implementation, rather than per-connection state in the TCB.
This document focuses on sharing TCB information at connection This document focuses on sharing TCB information at connection
initialization. Subsequent to RFC 2140, there have been numerous initialization. Subsequent to RFC 2140, there have been numerous
approaches that attempt to coordinate ongoing state across approaches that attempt to coordinate ongoing state across concurrent
concurrent connections, both within TCP and other congestion- connections, both within TCP and other congestion-reactive protocols,
reactive protocols, which are summarized in [Is18]. These approaches which are summarized in [Is18]. These approaches are more complex to
are more complex to implement and their comparison to steady-state implement, and their comparison to steady-state TCP equivalence can
TCP equivalence can be more difficult to establish, sometimes be more difficult to establish, sometimes intentionally (i.e., they
intentionally (i.e., they sometimes intend to provide a different sometimes intend to provide a different kind of "fairness" than
kind of "fairness" than emerges from TCP operation). emerges from TCP operation).
10. Implementation Observations
The observation that some TCB state is host-pair specific rather
than application-pair dependent is not new and is a common
engineering decision in layered protocol implementations. Although
now deprecated, T/TCP [RFC1644] was the first to propose using
caches in order to maintain TCB states (see Appendix A).
The table below describes the current implementation status for TCB
temporal sharing in Windows as of December 2020, Apple variants
(macOS, iOS, iPadOS, tvOS, watchOS) as of January 2021, Linux kernel
version 5.10.3, and FreeBSD 12. Ensemble sharing is not yet
implemented.
KNOWN IMPLEMENTATION STATUS
TCB data Status
------------------------------------------------------------
old_MMS_S Not shared
old_MMS_R Not shared
old_sendMSS Cached and shared in Apple, Linux (MSS)
old_PMTU Cached and shared in Apple, FreeBSD, Windows (PMTU)
old_RTT Cached and shared in Apple, FreeBSD, Linux, Windows
old_RTTVAR Cached and shared in Apple, FreeBSD, Windows 10. Implementation Observations
old_TFOinfo Cached and shared in Apple, Linux, Windows The observation that some TCB state is host-pair specific rather than
application-pair dependent is not new and is a common engineering
decision in layered protocol implementations. Although now
deprecated, T/TCP [RFC1644] was the first to propose using caches in
order to maintain TCB states (see Appendix A).
old_sendcwnd Not shared Table 9 describes the current implementation status for TCB temporal
sharing in Windows as of December 2020, Apple variants (macOS, iOS,
iPadOS, tvOS, and watchOS) as of January 2021, Linux kernel version
5.10.3, and FreeBSD 12. Ensemble sharing is not yet implemented.
old_ssthresh Cached and shared in Apple, FreeBSD*, Linux* +==============+=========================================+
| TCB data | Status |
+==============+=========================================+
| old_MMS_S | Not shared |
+--------------+-----------------------------------------+
| old_MMS_R | Not shared |
+--------------+-----------------------------------------+
| old_sendMSS | Cached and shared in Apple, Linux (MSS) |
+--------------+-----------------------------------------+
| old_PMTU | Cached and shared in Apple, FreeBSD, |
| | Windows (PMTU) |
+--------------+-----------------------------------------+
| old_RTT | Cached and shared in Apple, FreeBSD, |
| | Linux, Windows |
+--------------+-----------------------------------------+
| old_RTTVAR | Cached and shared in Apple, FreeBSD, |
| | Windows |
+--------------+-----------------------------------------+
| old_TFOinfo | Cached and shared in Apple, Linux, |
| | Windows |
+--------------+-----------------------------------------+
| old_sendcwnd | Not shared |
+--------------+-----------------------------------------+
| old_ssthresh | Cached and shared in Apple, FreeBSD*, |
| | Linux* |
+--------------+-----------------------------------------+
| TFO failure | Cached and shared in Apple |
+--------------+-----------------------------------------+
TFO failure Cached and shared in Apple Table 9: KNOWN IMPLEMENTATION STATUS
In the table above, "Apple" refers to all Apple OSes, i.e., * Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and
desktop/laptop macOS, phone iOS, pad iPadOS, video player tvOS, and its previous value if a previous value exists; in Linux, the
watch watchOS, which all share the same Internet protocol stack. calculation depends on state and is max(curr_cwnd/2, old_ssthresh)
in most cases.
*Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and In Table 9, "Apple" refers to all Apple OSes, i.e., macOS (desktop/
previous value if a previous value exists; in Linux, the calculation laptop), iOS (phone), iPadOS (tablet), tvOS (video player), and
depends on state and is max(curr_cwnd/2, old_ssthresh) in most watchOS (smart watch), which all share the same Internet protocol
cases. stack.
11. Changes Compared to RFC 2140 11. Changes Compared to RFC 2140
This document updates the description of TCB sharing in RFC 2140 and This document updates the description of TCB sharing in RFC 2140 and
its associated impact on existing and new connection state, its associated impact on existing and new connection state, providing
providing a complete replacement for that document [RFC2140]. It a complete replacement for that document [RFC2140]. It clarifies the
clarifies the previous description and terminology and extends the previous description and terminology and extends the mechanism to its
mechanism to its impact on new protocols and mechanisms, including impact on new protocols and mechanisms, including multipath TCP, Fast
multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication Open, PLPMTUD, NAT, and the TCP Authentication Option.
Option.
The detailed impact on TCB state addresses TCB parameters in greater The detailed impact on TCB state addresses TCB parameters with
detail, addressing MSS in both the send and receive direction, MSS greater specificity. It separates the way MSS is used in both send
and sendMSS separately, adds path MTU and ssthresh, and addresses and receive directions, it separates the way both of these MSS values
the impact on TCP option state. differ from sendMSS, it adds both path MTU and ssthresh, and it
addresses the impact on state associated with TCP options.
New sections have been added to address compatibility issues and New sections have been added to address compatibility issues and
implementation observations. The relation of this work to T/TCP has implementation observations. The relation of this work to T/TCP has
been moved to 0 on history, partly to reflect the deprecation of been moved to Appendix A (which describes the history to TCB sharing)
that protocol. partly to reflect the deprecation of that protocol.
Appendix C has been added to discuss the potential to use temporal Appendix C has been added to discuss the potential to use temporal
sharing over long timescales to adapt TCP's initial window sharing over long timescales to adapt TCP's initial window
automatically, avoiding the need to periodically revise a single automatically, avoiding the need to periodically revise a single
global constant value. global constant value.
Finally, this document updates and significantly expands the Finally, this document updates and significantly expands the
referenced literature. referenced literature.
12. Security Considerations 12. Security Considerations
These presented implementation methods do not have additional These presented implementation methods do not have additional
ramifications for direct (connection-aborting or information ramifications for direct (connection-aborting or information-
injecting) attacks on individual connections. Individual injecting) attacks on individual connections. Individual
connections, whether using sharing or not, also may be susceptible connections, whether using sharing or not, also may be susceptible to
to denial-of-service attacks that reduce performance or completely denial-of-service attacks that reduce performance or completely deny
deny connections and transfers if not otherwise secured. connections and transfers if not otherwise secured.
TCB sharing may create additional denial-of-service attacks that TCB sharing may create additional denial-of-service attacks that
affect the performance of other connections by polluting the cached affect the performance of other connections by polluting the cached
information. This can occur across whatever set of connections where information. This can occur across any set of connections in which
the TCB is shared, between connections in a single host, or between the TCB is shared, between connections in a single host, or between
hosts if TCB sharing is implemented within a subnet (see hosts if TCB sharing is implemented within a subnet (see
Implications section). Some shared TCB parameters are used only to "Implications" (Section 9)). Some shared TCB parameters are used
create new TCBs, others are shared among the TCBs of ongoing only to create new TCBs; others are shared among the TCBs of ongoing
connections. New connections can join the ongoing set, e.g., to connections. New connections can join the ongoing set, e.g., to
optimize send window size among a set of connections to the same optimize send window size among a set of connections to the same
host. PMTU is defined as shared at the IP layer, and is already host. PMTU is defined as shared at the IP layer and is already
susceptible in this way. susceptible in this way.
Options in client SYNs can be easier to forge than complete, two-way Options in client SYNs can be easier to forge than complete, two-way
connections. As a result, their values may not be safely connections. As a result, their values may not be safely
incorporated in shared values until after the three-way handshake incorporated in shared values until after the three-way handshake
completes. completes.
Attacks on parameters used only for initialization affect only the Attacks on parameters used only for initialization affect only the
transient performance of a TCP connection. For short connections, transient performance of a TCP connection. For short connections,
the performance ramification can approach that of a denial-of- the performance ramification can approach that of a denial-of-service
service attack. E.g., if an application changes its TCB to have a attack. For example, if an application changes its TCB to have a
false and small window size, subsequent connections will experience false and small window size, subsequent connections will experience
performance degradation until their window grew appropriately. performance degradation until their window grows appropriately.
TCB sharing reuses and mixes information from past and current TCB sharing reuses and mixes information from past and current
connections. Although reusing information could create a potential connections. Although reusing information could create a potential
for fingerprinting to identify hosts, the mixing reduces that for fingerprinting to identify hosts, the mixing reduces that
potential. There has been no evidence of fingerprinting based on potential. There has been no evidence of fingerprinting based on
this technique and it is currently considered safe in that regard. this technique, and it is currently considered safe in that regard.
Further, information about the performance of a TCP connection has Further, information about the performance of a TCP connection has
not been considered as private. not been considered as private.
13. IANA Considerations 13. IANA Considerations
There are no IANA implications or requests in this document.
This section should be removed upon final publication as an RFC.
14. References
14.1. Normative References
[RFC793] Postel, J., "Transmission Control Protocol," Network
Working Group RFC-793/STD-7, ISI, Sept. 1981.
[RFC1122] Braden, R. (ed), "Requirements for Internet Hosts --
Communication Layers", RFC-1122, Oct. 1989.
[RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191,
Nov. 1990.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU
Discovery," RFC 4821, Mar. 2007.
[RFC5681] Allman, M., Paxson, V., Blanton, E., "TCP Congestion
Control," RFC 5681 (Standards Track), Sep. 2009.
[RFC6298] Paxson, V., Allman, M., Chu, J., Sargent, M., "Computing
TCP's Retransmission Timer," RFC 6298, June 2011.
[RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast
Open", RFC 7413, Dec. 2014.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", RFC 8174, May 2017.
[RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.),
"Path MTU Discovery for IP version 6," RFC 8201, Jul.
2017.
14.2. Informative References
[Al10] Allman, M., "Initial Congestion Window Specification",
(work in progress), draft-allman-tcpm-bump-initcwnd-00,
Nov. 2010.
[Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A
Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala
Lumpur, Malaysia, May 23-27 2016.
[Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit
Congestion Notification (ECN) to TCP Control Packets",
draft-ietf-tcpm-generalized-ecn-07, Feb. 2021.
[Be94] Berners-Lee, T., et al., "The World-Wide Web,"
Communications of the ACM, V37, Aug. 1994, pp. 76-82.
[Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for
Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994.
[Br02] Brownlee, N., Claffy, K., "Understanding Internet Traffic
Streams: Dragonflies and Tortoises", IEEE Communications
Magazine p110-117, 2002.
[Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2,
Prentice-Hall, NJ, 1991.
[Du16] Dukkipati, N., Yuchung C., Amin V., "Research Impacting
the Practice of Congestion Control." ACM SIGCOMM CCR
(editorial), on-line post, July 2016.
[FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/
[Hu01] Hughes, A., Touch, J., Heidemann, J., "Issues in Slow-
Start Restart After Idle", draft-hughes-restart-00
(expired), Dec. 2001.
[Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for
short TCP flows," 2012 IEEE International Conference on
Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213.
[IANA] IANA TCP Parameters (options) registry,
https://www.iana.org/assignments/tcp-parameters
[Is18] Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G.,
Gjessing, S., "ctrlTCP: Reducing Latency through Coupled,
Heterogeneous Multi-Flow TCP Congestion Control," Proc.
IEEE INFOCOM Global Internet Symposium (GI) workshop (GI
2018), Honolulu, HI, April 2018.
[Ja88] Jacobson, V., Karels, M., "Congestion Avoidance and
Control", Proc. Sigcomm 1988.
[RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions
Functional Specification," RFC-1644, July 1994.
[RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379,
September 1992.
[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", RFC2001
(Standards Track), Jan. 1997.
[RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140,
April 1997.
[RFC2414] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's
Initial Window", RFC 2414 (Experimental), Sept. 1998.
[RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address
Translator (NAT) Terminology and Considerations", RFC-
2663, August 1999.
[RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's
Initial Window," RFC 3390, Oct. 2002.
[RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager,"
RFC 3124, June 2001.
[RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion
Control Protocol (DCCP)," RFC 4340, Mar. 2006.
[RFC4960] Stewart, R., (Ed.), "Stream Control Transmission
Protocol," RFC4960, Sept. 2007.
[RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication
Option," RFC 5925, June 2010.
[RFC6437] Amante, S., Carpenter, B., Jiang, S., Rajajalme, J., "IPv6
Flow Label Specification," RFC 6437, Nov. 2011.
[RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS),"
RFC 6691, July 2012.
[RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing
TCP's Initial Window," RFC 6928, Apr. 2013.
[RFC7231] Fielding, R., Reshke, J., Eds., "HTTP/1.1 Semantics and
Content," RFC-7231, June 2014.
[RFC7323] Borman, D., Braden, B., Jacobson, V., Scheffenegger, R.,
(Ed.), "TCP Extensions for High Performance," RFC 7323,
Sept. 2014.
[RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish,
B., "Mechanisms for Optimizing Link Aggregation Group
(LAG) and Equal-Cost Multipath (ECMP) Component Link
Utilization in Networks", RFC 7424, Jan. 2015
[RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer
Protocol Version 2 (HTTP/2)", RFC 7540, May 2015.
[RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP
to Support Rate-Limited Traffic", RFC 7661, Oct. 2015.
[RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O.,
Paasch, C., "TCP Extensions for Multipath Operation with
Multiple Addresses," RFC 8684, Mar. 2020.
15. Acknowledgments
The authors would like to thank for Praveen Balasubramanian for
information regarding TCB sharing in Windows, Christoph Paasch for
information regarding TCB sharing in Apple OSes, and Yuchung Cheng,
Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on
earlier versions of the draft, as well as members of the TCPM WG.
Earlier revisions of this work received funding from a collaborative
research project between the University of Oslo and Huawei
Technologies Co., Ltd. and were partly supported by USC/ISI's Postel
Center.
This document was prepared using 2-Word-v2.0.template.dot. This document has no IANA actions.
16. Change log 14. References
This section should be removed upon final publication as an RFC. 14.1. Normative References
ietf-11: [RFC0793] Postel, J., "Transmission Control Protocol", STD 7,
RFC 793, DOI 10.17487/RFC0793, September 1981,
<https://www.rfc-editor.org/info/rfc793>.
- Addressed gen-art review and IESG feedback [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts -
Communication Layers", STD 3, RFC 1122,
DOI 10.17487/RFC1122, October 1989,
<https://www.rfc-editor.org/info/rfc1122>.
ietf-10: [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
DOI 10.17487/RFC1191, November 1990,
<https://www.rfc-editor.org/info/rfc1191>.
- Addressed IETF last call feedback [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
ietf-09: [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
<https://www.rfc-editor.org/info/rfc4821>.
- Correction of typographic errors [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
<https://www.rfc-editor.org/info/rfc5681>.
ietf-08: [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent,
"Computing TCP's Retransmission Timer", RFC 6298,
DOI 10.17487/RFC6298, June 2011,
<https://www.rfc-editor.org/info/rfc6298>.
- Address TSV AD comments, add Apple OS implementation status [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP
Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014,
<https://www.rfc-editor.org/info/rfc7413>.
ietf-07: [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
- Update per id-nits and normative language for consistency [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed.,
"Path MTU Discovery for IP version 6", STD 87, RFC 8201,
DOI 10.17487/RFC8201, July 2017,
<https://www.rfc-editor.org/info/rfc8201>.
ietf-06: 14.2. Informative References
- Address WGLC comments [Al10] Allman, M., "Initial Congestion Window Specification",
Work in Progress, Internet-Draft, draft-allman-tcpm-bump-
initcwnd-00, 15 November 2010,
<https://datatracker.ietf.org/doc/html/draft-allman-tcpm-
bump-initcwnd-00>.
ietf-05: [Ba12] Barik, R., Welzl, M., Ferlin, S., and O. Alay, "LISA: A
linked slow-start algorithm for MPTCP", IEEE ICC,
DOI 10.1109/ICC.2016.7510786, May 2016,
<https://doi.org/10.1109/ICC.2016.7510786>.
- Correction of typographic errors, expansion of terminology [Ba20] Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit
Congestion Notification (ECN) to TCP Control Packets",
Work in Progress, Internet-Draft, draft-ietf-tcpm-
generalized-ecn-07, 16 February 2021,
<https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-
generalized-ecn-07>.
ietf-04: [Be94] Berners-Lee, T., Cailliau, C., Luotonen, A., Nielsen, H.,
and A. Secret, "The World-Wide Web", Communications of the
ACM V37, pp. 76-82, DOI 10.1145/179606.179671, August
1994, <https://doi.org/10.1145/179606.179671>.
- Fix internal cross-reference errors that appeared in ietf-02 [Br02] Brownlee, N. and KC. Claffy, "Understanding Internet
- Updated tables to re-center; clarified text traffic streams: dragonflies and tortoises", IEEE
Communications Magazine, pp. 110-117,
DOI 10.1109/MCOM.2002.1039865, 2002,
<https://doi.org/10.1109/MCOM.2002.1039865>.
ietf-03: [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for
Sun OS 4.1.3", USC/ISI Release 1.0, September 1994.
- Correction of typographic errors, minor rewording in appendices [Co91] Comer, D. and D. Stevens, "Internetworking with TCP/IP",
ISBN 10: 0134685059, ISBN 13: 9780134685052, 1991.
ietf-02: [Du16] Dukkipati, N., Cheng, Y., and A. Vahdat, "Research
Impacting the Practice of Congestion Control", Computer
Communication Review, The ACM SIGCOMM newsletter, July
2016.
- Minor reorganization and correction of typographic errors [FreeBSD] FreeBSD, "The FreeBSD Project",
- Added text to address fingerprinting in Security section <https://www.freebsd.org/>.
- Now retains Appendix B and body option tables upon publication
ietf-01: [Hu01] Hughes, A., Touch, J., and J. Heidemann, "Issues in TCP
Slow-Start Restart After Idle", Work in Progress,
Internet-Draft, draft-hughes-restart-00, December 2001,
<https://datatracker.ietf.org/doc/html/draft-hughes-
restart-00>.
- Added Appendix C to address long-timescale temporal adaptation [Hu12] Hurtig, P. and A. Brunstrom, "Enhanced metric caching for
short TCP flows", IEEE International Conference on
Communications, DOI 10.1109/ICC.2012.6364516, 2012,
<https://doi.org/10.1109/ICC.2012.6364516>.
ietf-00: [IANA] IANA, "Transmission Control Protocol (TCP) Parameters",
<https://www.iana.org/assignments/tcp-parameters>.
- Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. [Is18] Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G.,
- Cleaned orphan references to T/TCP, removed incomplete refs and S. Gjessing, "ctrlTCP: Reducing latency through
- Moved references to informative section and updated Sec 2 coupled, heterogeneous multi-flow TCP congestion control",
- Updated to clarify no impact to interoperability IEEE INFOCOM 2018 - IEEE Conference on Computer
- Updated appendix B to avoid 2119 language Communications Workshops (INFOCOM WKSHPS),
DOI 10.1109/INFCOMW.2018.8406887, April 2018,
<https://doi.org/10.1109/INFCOMW.2018.8406887>.
06: [Ja88] Jacobson, V. and M. Karels, "Congestion Avoidance and
Control", SIGCOMM Symposium proceedings on Communications
architectures and protocols, November 1988.
- Changed to update 2140, cite it normatively, and summarize the [RFC1379] Braden, R., "Extending TCP for Transactions -- Concepts",
updates in a separate section RFC 1379, DOI 10.17487/RFC1379, November 1992,
<https://www.rfc-editor.org/info/rfc1379>.
05: [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions
Functional Specification", RFC 1644, DOI 10.17487/RFC1644,
July 1994, <https://www.rfc-editor.org/info/rfc1644>.
- Fixed some TBDs [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", RFC 2001,
DOI 10.17487/RFC2001, January 1997,
<https://www.rfc-editor.org/info/rfc2001>.
04: [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140,
DOI 10.17487/RFC2140, April 1997,
<https://www.rfc-editor.org/info/rfc2140>.
- Removed BCP-style recommendations and fixed some TBDs [RFC2414] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
Initial Window", RFC 2414, DOI 10.17487/RFC2414, September
1998, <https://www.rfc-editor.org/info/rfc2414>.
03: [RFC2663] Srisuresh, P. and M. Holdrege, "IP Network Address
Translator (NAT) Terminology and Considerations",
RFC 2663, DOI 10.17487/RFC2663, August 1999,
<https://www.rfc-editor.org/info/rfc2663>.
- Updated Touch's affiliation and address information [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager",
RFC 3124, DOI 10.17487/RFC3124, June 2001,
<https://www.rfc-editor.org/info/rfc3124>.
02: [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
Initial Window", RFC 3390, DOI 10.17487/RFC3390, October
2002, <https://www.rfc-editor.org/info/rfc3390>.
- Stated that our OS implementation overview table only covers [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram
temporal sharing. Congestion Control Protocol (DCCP)", RFC 4340,
DOI 10.17487/RFC4340, March 2006,
<https://www.rfc-editor.org/info/rfc4340>.
- Correctly reflected sharing of old_RTT in Linux in the [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol",
implementation overview table. RFC 4960, DOI 10.17487/RFC4960, September 2007,
<https://www.rfc-editor.org/info/rfc4960>.
- Marked entries that are considered safe to share with an [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP
asterisk (suggestion was to split the table) Authentication Option", RFC 5925, DOI 10.17487/RFC5925,
June 2010, <https://www.rfc-editor.org/info/rfc5925>.
- Discussed correct host identification: NATs may make IP [RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme,
addresses the wrong input, could e.g., use HTTP cookie. "IPv6 Flow Label Specification", RFC 6437,
DOI 10.17487/RFC6437, November 2011,
<https://www.rfc-editor.org/info/rfc6437>.
- Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)",
MTU RFC 6691, DOI 10.17487/RFC6691, July 2012,
<https://www.rfc-editor.org/info/rfc6691>.
- Added information about option sharing, listed options in [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis,
Appendix B "Increasing TCP's Initial Window", RFC 6928,
DOI 10.17487/RFC6928, April 2013,
<https://www.rfc-editor.org/info/rfc6928>.
Authors' Addresses [RFC7231] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
Protocol (HTTP/1.1): Semantics and Content", RFC 7231,
DOI 10.17487/RFC7231, June 2014,
<https://www.rfc-editor.org/info/rfc7231>.
Joe Touch [RFC7323] Borman, D., Braden, B., Jacobson, V., and R.
Manhattan Beach, CA 90266 Scheffenegger, Ed., "TCP Extensions for High Performance",
USA RFC 7323, DOI 10.17487/RFC7323, September 2014,
<https://www.rfc-editor.org/info/rfc7323>.
Phone: +1 (310) 560-0334 [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B.
Email: touch@strayalpha.com Khasnabish, "Mechanisms for Optimizing Link Aggregation
Group (LAG) and Equal-Cost Multipath (ECMP) Component Link
Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424,
January 2015, <https://www.rfc-editor.org/info/rfc7424>.
Michael Welzl [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext
University of Oslo Transfer Protocol Version 2 (HTTP/2)", RFC 7540,
PO Box 1080 Blindern DOI 10.17487/RFC7540, May 2015,
Oslo N-0316 <https://www.rfc-editor.org/info/rfc7540>.
Norway
Phone: +47 22 85 24 20 [RFC7661] Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating
Email: michawe@ifi.uio.no TCP to Support Rate-Limited Traffic", RFC 7661,
Safiqul Islam DOI 10.17487/RFC7661, October 2015,
University of Oslo <https://www.rfc-editor.org/info/rfc7661>.
PO Box 1080 Blindern
Oslo N-0316
Norway
Phone: +47 22 84 08 37 [RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C.
Email: safiquli@ifi.uio.no Paasch, "TCP Extensions for Multipath Operation with
Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March
2020, <https://www.rfc-editor.org/info/rfc8684>.
Appendix A: TCB Sharing History Appendix A. TCB Sharing History
T/TCP proposed using caches to maintain TCB information across T/TCP proposed using caches to maintain TCB information across
instances (temporal sharing), e.g., smoothed RTT, RTT variation, instances (temporal sharing), e.g., smoothed RTT, RTT variation,
congestion avoidance threshold, and MSS [RFC1644]. These values were congestion-avoidance threshold, and MSS [RFC1644]. These values were
in addition to connection counts used by T/TCP to accelerate data in addition to connection counts used by T/TCP to accelerate data
delivery prior to the full three-way handshake during an OPEN. The delivery prior to the full three-way handshake during an OPEN. The
goal was to aggregate TCB components where they reflect one goal was to aggregate TCB components where they reflect one
association - that of the host-pair, rather than artificially association -- that of the host-pair rather than artificially
separating those components by connection. separating those components by connection.
At least one T/TCP implementation saved the MSS and aggregated the At least one T/TCP implementation saved the MSS and aggregated the
RTT parameters across multiple connections but omitted caching the RTT parameters across multiple connections but omitted caching the
congestion window information [Br94], as originally specified in congestion window information [Br94], as originally specified in
[RFC1379]. Some T/TCP implementations immediately updated MSS when [RFC1379]. Some T/TCP implementations immediately updated MSS when
the TCP MSS header option was received [Br94], although this was not the TCP MSS header option was received [Br94], although this was not
addressed specifically in the concepts or functional specification addressed specifically in the concepts or functional specification
[RFC1379][RFC1644]. In later T/TCP implementations, RTT values were [RFC1379] [RFC1644]. In later T/TCP implementations, RTT values were
updated only after a CLOSE, which does not benefit concurrent updated only after a CLOSE, which does not benefit concurrent
sessions. sessions.
Temporal sharing of cached TCB data was originally implemented in Temporal sharing of cached TCB data was originally implemented in the
the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same Sun OS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same
[FreeBSD]. As mentioned before, only the MSS and RTT parameters were [FreeBSD]. As mentioned before, only the MSS and RTT parameters were
cached, as originally specified in [RFC1379]. Later discussion of cached, as originally specified in [RFC1379]. Later discussion of T/
T/TCP suggested including congestion control parameters in this TCP suggested including congestion control parameters in this cache;
cache; for example, [RFC1644] (Section 3.1) hints at initializing for example, Section 3.1 of [RFC1644] hints at initializing the
the congestion window to the old window size. congestion window to the old window size.
Appendix B: TCP Option Sharing and Caching Appendix B. TCP Option Sharing and Caching
In addition to the options that can be cached and shared, this memo In addition to the options that can be cached and shared, this memo
also lists known TCP options [IANA] for which state is unsafe to be also lists known TCP options [IANA] for which state is unsafe to be
kept. This list is not intended to be authoritative or exhaustive. kept. This list is not intended to be authoritative or exhaustive.
Obsolete (unsafe to keep state): Obsolete (unsafe to keep state):
ECHO Echo
ECHO REPLY Echo Reply
PO Conn permitted Partial Order Connection Permitted
PO service profile Partial Order Service Profile
CC CC
CC.NEW CC.NEW
CC.ECHO CC.ECHO
Alt CS req TCP Alternate Checksum Request
Alt CS data TCP Alternate Checksum Data
No state to keep: No state to keep:
EOL End of Option List (EOL)
NOP No-Operation (NOP)
WS Window Scale (WS)
SACK SACK
TS Timestamps (TS)
MD5 MD5 Signature Option
TCP-AO TCP Authentication Option (TCP-AO)
EXP1 RFC3692-style Experiment 1
EXP2 RFC3692-style Experiment 2
Unsafe to keep state: Unsafe to keep state:
Skeeter (DH exchange, known to be vulnerable) Skeeter (DH exchange, known to be vulnerable)
Bubba (DH exchange, known to be vulnerable) Bubba (DH exchange, known to be vulnerable)
Trailer CS Trailer Checksum Option
SCPS capabilities SCPS capabilities
S-NACK Selective Negative Acknowledgements (S-NACK)
Records boundaries Records Boundaries
Corruption experienced Corruption experienced
SNAP SNAP
TCP Compression TCP Compression Filter
Quickstart response Quick-Start Response
UTO User Timeout Option (UTO)
MPTCP negotiation success (see below for negotiation failure) Multipath TCP (MPTCP) negotiation success (see below for
negotiation failure)
TFO negotiation success (see below for negotiation failure) TCP Fast Open (TFO) negotiation success (see below for negotiation
failure)
Safe but optional to keep state: Safe but optional to keep state:
MPTCP negotiation failure (to avoid negotiation retries) Multipath TCP (MPTCP) negotiation failure (to avoid negotiation
retries)
MSS Maximum Segment Size (MSS)
TFO negotiation failure (to avoid negotiation retries) TCP Fast Open (TFO) negotiation failure (to avoid negotiation
retries)
Safe and necessary to keep state: Safe and necessary to keep state:
TFO cookie (if TFO succeeded in the past) TCP Fast Open (TFO) Cookie (if TFO succeeded in the past)
Appendix C: Automating the Initial Window in TCP over Long Timescales Appendix C. Automating the Initial Window in TCP over Long Timescales
C.1. Introduction C.1. Introduction
Temporal sharing, as described earlier in this document, builds on Temporal sharing, as described earlier in this document, builds on
the assumption that multiple consecutive connections between the the assumption that multiple consecutive connections between the same
same host pair are somewhat likely to be exposed to similar host-pair are somewhat likely to be exposed to similar environment
environment characteristics. The stored information can become less characteristics. The stored information can become less accurate
accurate over time and suitable precautions should take this ageing over time and suitable precautions should take this aging into
into consideration (this is discussed further in section 8.1). consideration (this is discussed further in Section 8.1). However,
However, there are also cases where it can make sense to track these there are also cases where it can make sense to track these values
values over longer periods, observing properties of TCP connections over longer periods, observing properties of TCP connections to
to gradually influence evolving trends in TCP parameters. This gradually influence evolving trends in TCP parameters. This appendix
appendix describes an example of such a case. describes an example of such a case.
TCP's congestion control algorithm uses an initial window value TCP's congestion control algorithm uses an initial window value (IW)
(IW), both as a starting point for new connections and as an upper both as a starting point for new connections and as an upper limit
limit for restarting after an idle period [RFC5681][RFC7661]. This for restarting after an idle period [RFC5681] [RFC7661]. This value
value has evolved over time, originally one maximum segment size has evolved over time; it was originally 1 maximum segment size (MSS)
(MSS), and increased to the lesser of four MSS or 4,380 bytes and increased to the lesser of 4 MSSs or 4,380 bytes [RFC3390]
[RFC3390][RFC5681]. For a typical Internet connection with a maximum [RFC5681]. For a typical Internet connection with a maximum
transmission unit (MTU) of 1500 bytes, this permits three segments transmission unit (MTU) of 1500 bytes, this permits 3 segments of
of 1,460 bytes each. 1,460 bytes each.
The IW value was originally implied in the original TCP congestion The IW value was originally implied in the original TCP congestion
control description and documented as a standard in 1997 control description and documented as a standard in 1997 [RFC2001]
[RFC2001][Ja88]. The value was updated in 1998 experimentally and [Ja88]. The value was updated in 1998 experimentally and moved to
moved to the standards track in 2002 [RFC2414][RFC3390]. In 2013, it the Standards Track in 2002 [RFC2414] [RFC3390]. In 2013, it was
was experimentally increased to 10 [RFC6928]. experimentally increased to 10 [RFC6928].
This appendix discusses how TCP can objectively measure when an IW This appendix discusses how TCP can objectively measure when an IW is
is too large, and that such feedback should be used over long too large and that such feedback should be used over long timescales
timescales to adjust the IW automatically. The result should be to adjust the IW automatically. The result should be safer to deploy
safer to deploy and might avoid the need to repeatedly revisit IW and might avoid the need to repeatedly revisit IW over time.
over time.
Note that this mechanism attempts to make the IW more adaptive over Note that this mechanism attempts to make the IW more adaptive over
time. It can increase the IW beyond that which is currently time. It can increase the IW beyond that which is currently
recommended for widescale deployment, and so its use should be recommended for wide-scale deployment, so its use should be carefully
carefully monitored. monitored.
C.2. Design Considerations C.2. Design Considerations
TCP's IW value has existed statically for over two decades, so any TCP's IW value has existed statically for over two decades, so any
solution to adjusting the IW dynamically should have similarly solution to adjusting the IW dynamically should have similarly
stable, non-invasive effects on the performance and complexity of stable, non-invasive effects on the performance and complexity of
TCP. In order to be fair, the IW should be similar for most machines TCP. In order to be fair, the IW should be similar for most machines
on the public Internet. Finally, a desirable goal is to develop a on the public Internet. Finally, a desirable goal is to develop a
self-correcting algorithm, so that IW values that cause network self-correcting algorithm so that IW values that cause network
problems can be avoided. To that end, we propose the following problems can be avoided. To that end, we propose the following
design goals: design goals:
o Impart little to no impact to TCP in the absence of loss, i.e., * Impart little to no impact to TCP in the absence of loss, i.e., it
it should not increase the complexity of default packet should not increase the complexity of default packet processing in
processing in the normal case. the normal case.
o Adapt to network feedback over long timescales, avoiding values * Adapt to network feedback over long timescales, avoiding values
that persistently cause network problems. that persistently cause network problems.
o Decrease the IW in the presence of sustained loss of IW segments, * Decrease the IW in the presence of sustained loss of IW segments,
as determined over a number of different connections. as determined over a number of different connections.
o Increase the IW in the absence of sustained loss of IW segments, * Increase the IW in the absence of sustained loss of IW segments,
as determined over a number of different connections. as determined over a number of different connections.
o Operate conservatively, i.e., tend towards leaving the IW the * Operate conservatively, i.e., tend towards leaving the IW the same
same in the absence of sufficient information, and give greater in the absence of sufficient information, and give greater
consideration to IW segment loss than IW segment success. consideration to IW segment loss than IW segment success.
We expect that, without other context, a good IW algorithm will We expect that, without other context, a good IW algorithm will
converge to a single value, but this is not required. An endpoint converge to a single value, but this is not required. An endpoint
with additional context or information, or deployed in a constrained with additional context or information, or deployed in a constrained
environment, can always use a different value. In particular, environment, can always use a different value. In particular,
information from previous connections, or sets of connections with a information from previous connections, or sets of connections with a
similar path, can already be used as context for such decisions (as similar path, can already be used as context for such decisions (as
noted in the core of this document). noted in the core of this document).
However, if a given IW value persistently causes packet loss during However, if a given IW value persistently causes packet loss during
the initial burst of packets, it is clearly inappropriate and could the initial burst of packets, it is clearly inappropriate and could
be inducing unnecessary loss in other competing connections. This be inducing unnecessary loss in other competing connections. This
might happen for sites behind very slow boxes with small buffers, might happen for sites behind very slow boxes with small buffers,
which may or may not be the first hop. which may or may not be the first hop.
C.3. Proposed IW Algorithm C.3. Proposed IW Algorithm
Below is a simple description of the proposed IW algorithm. It Below is a simple description of the proposed IW algorithm. It
relies on the following parameters: relies on the following parameters:
o MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) * MinIW = 3 MSS or 4,380 bytes (as per [RFC3390])
o MaxIW = 10 MSS (as per [RFC6928]) * MaxIW = 10 MSS (as per [RFC6928])
o MulDecr = 0.5 * MulDecr = 0.5
o AddIncr = 2 MSS
o Threshold = 0.05 * AddIncr = 2 MSS
* Threshold = 0.05
We assume that the minimum IW (MinIW) should be as currently We assume that the minimum IW (MinIW) should be as currently
specified as standard [RFC3390]. The maximum IW can be set to a specified as standard [RFC3390]. The maximum IW (MaxIW) can be set
fixed value (we suggest using the experimental and now somewhat de- to a fixed value (we suggest using the experimental and now somewhat
facto standard in [RFC6928]) or set based on a schedule if trusted de facto standard in [RFC6928]) or set based on a schedule if trusted
time references are available [Al10]; here we prefer a fixed value. time references are available [Al10]; here, we prefer a fixed value.
We also propose to use an AIMD algorithm, with increase and We also propose to use an Additive Increase Multiplicative Decrease
decreases as noted. (AIMD) algorithm, with increase and decreases as noted.
Although these parameters are somewhat arbitrary, their initial Although these parameters are somewhat arbitrary, their initial
values are not important except that the algorithm is AIMD and the values are not important except that the algorithm is AIMD and the
MaxIW should not exceed that recommended for other systems on the MaxIW should not exceed that recommended for other systems on the
Internet (here we selected the current de-facto standard rather than Internet (here, we selected the current de facto standard rather than
the actual standard). Current proposals, including default current the actual standard). Current proposals, including default current
operation, are degenerate cases of the algorithm below for given operation, are degenerate cases of the algorithm below for given
parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus parameters, notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling
disabling the automatic part of the algorithm. the automatic part of the algorithm.
The proposed algorithm is as follows: The proposed algorithm is as follows:
1. On boot: 1. On boot:
IW = MaxIW; # assume this is in bytes, and indicates an integer IW = MaxIW; # assume this is in bytes and indicates an integer
multiple of 2 MSS (an even number to support ACK compression) # multiple of 2 MSS (an even number to support
# ACK compression)
2. Upon starting a new connection: 2. Upon starting a new connection:
CWND = IW; CWND = IW;
conncount++; conncount++;
IWnotchecked = 1; # true IWnotchecked = 1; # true
3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN
(as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat (as similarly addressed in Section 5 of ECN++ for TCP [Ba20]),
as if the IW is too large: treat as if the IW is too large:
if (IWnotchecked && (synackecn == 1)) { if (IWnotchecked && (synackecn == 1)) {
losscount++; losscount++;
IWnotchecked = 0; # never check again IWnotchecked = 0; # never check again
} }
4. During a connection, if retransmission occurs, check the seqno of 4. During a connection, if retransmission occurs, check the seqno of
the outgoing packet (in bytes) to see if the resent segment fixes the outgoing packet (in bytes) to see if the re-sent segment
an IW loss: fixes an IW loss:
if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) { if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) {
losscount++; losscount++;
IWnotchecked = 0; # never do this entire "if" again IWnotchecked = 0; # never do this entire "if" again
} else { } else {
IWnotchecked = 0; # you're beyond the IW so stop checking IWnotchecked = 0; # you're beyond the IW so stop checking
} }
5. Once every 1000 connections, as a separate process (i.e., not as 5. Once every 1000 connections, as a separate process (i.e., not as
part of processing a given connection): part of processing a given connection):
if (conncount > 1000) { if (conncount > 1000) {
if (losscount/conncount > threshold) { if (losscount/conncount > threshold) {
# the number of connections with errors is too high # the number of connections with errors is too high
IW = IW * MulDecr; IW = IW * MulDecr;
} else { } else {
IW = IW + AddIncr; IW = IW + AddIncr;
}
} }
}
As presented, this algorithm can yield a false positive when the As presented, this algorithm can yield a false positive when the
sequence number wraps around, e.g., the code might increment sequence number wraps around, e.g., the code might increment
losscount in step 4 when no loss occurred or fail to increment losscount in step 4 when no loss occurred or fail to increment
losscount when a loss did occur. This can be avoided using either losscount when a loss did occur. This can be avoided using either
PAWS [RFC7323] context or internal extended sequence number Protection Against Wrapped Sequences (PAWS) [RFC7323] context or
representations (as in TCP-AO [RFC5925]). Alternately, false internal extended sequence number representations (as in TCP
positives can be tolerated because they are expected to be Authentication Option (TCP-AO) [RFC5925]). Alternately, false
infrequent and thus will not significantly impact the algorithm. positives can be tolerated because they are expected to be infrequent
and thus will not significantly impact the algorithm.
A number of additional constraints need to be imposed if this A number of additional constraints need to be imposed if this
mechanism is implemented to ensure that it defaults to values that mechanism is implemented to ensure that it defaults to values that
comply with current Internet standards, is conservative in how it comply with current Internet standards, is conservative in how it
extends those values, and returns to those values in the absence of extends those values, and returns to those values in the absence of
positive feedback (i.e., success). To that end, we recommend the positive feedback (i.e., success). To that end, we recommend the
following list of example constraints: following list of example constraints:
>> The automatic IW algorithm MUST initialize MaxIW a value no * The automatic IW algorithm MUST initialize MaxIW a value no larger
larger than the currently recommended Internet default, in the than the currently recommended Internet default in the absence of
absence of other context information. other context information.
Thus, if there are too few connections to make a decision or if Thus, if there are too few connections to make a decision or if
there is otherwise insufficient information to increase the IW, then there is otherwise insufficient information to increase the IW,
the MaxIW defaults to the current recommended value. then the MaxIW defaults to the current recommended value.
>> An implementation MAY allow the MaxIW to grow beyond the * An implementation MAY allow the MaxIW to grow beyond the currently
currently recommended Internet default, but not more than 2 segments recommended Internet default but not more than 2 segments per
per calendar year. calendar year.
Thus, if an endpoint has a persistent history of successfully Thus, if an endpoint has a persistent history of successfully
transmitting IW segments without loss, then it is allowed to probe transmitting IW segments without loss, then it is allowed to probe
the Internet to determine if larger IW values have similar success. the Internet to determine if larger IW values have similar
This probing is limited and requires a trusted time source, success. This probing is limited and requires a trusted time
otherwise the MaxIW remains constant. source; otherwise, the MaxIW remains constant.
>> An implementation MUST adjust the IW based on loss statistics at * An implementation MUST adjust the IW based on loss statistics at
least once every 1000 connections. least once every 1000 connections.
An endpoint needs to be sufficiently reactive to IW loss. An endpoint needs to be sufficiently reactive to IW loss.
>> An implementation MUST decrease the IW by at least one MSS when * An implementation MUST decrease the IW by at least 1 MSS when
indicated during an evaluation interval. indicated during an evaluation interval.
An endpoint that detects loss needs to decrease its IW by at least An endpoint that detects loss needs to decrease its IW by at least
one MSS, otherwise it is not participating in an automatic reactive 1 MSS; otherwise, it is not participating in an automatic reactive
algorithm. algorithm.
>> An implementation MUST increase by no more than 2 MSS per * An implementation MUST increase by no more than 2 MSSs per
evaluation interval. evaluation interval.
An endpoint that does not experience IW loss needs to probe the An endpoint that does not experience IW loss needs to probe the
network incrementally. network incrementally.
>> An implementation SHOULD use an IW that is an integer multiple of * An implementation SHOULD use an IW that is an integer multiple of
2 MSS. 2 MSSs.
The IW should remain a multiple of 2 MSS segments, to enable The IW should remain a multiple of 2 MSS segments to enable
efficient ACK compression without incurring unnecessary timeouts. efficient ACK compression without incurring unnecessary timeouts.
>> An implementation MUST decrease the IW if more than 95% of * An implementation MUST decrease the IW if more than 95% of
connections have IW losses. connections have IW losses.
Again, this is to ensure an implementation is sufficiently reactive. Again, this is to ensure an implementation is sufficiently
reactive.
>> An implementation MAY group IW values and statistics within * An implementation MAY group IW values and statistics within
subsets of connections. Such grouping MAY use any information about subsets of connections. Such grouping MAY use any information
connections to form groups except loss statistics. about connections to form groups except loss statistics.
There are some TCP connections which might not be counted at all, There are some TCP connections that might not be counted at all, such
such as those to/from loopback addresses, or those within the same as those to/from loopback addresses or those within the same subnet
subnet as that of a local interface (for which congestion control is as that of a local interface (for which congestion control is
sometimes disabled anyway). This may also include connections that sometimes disabled anyway). This may also include connections that
terminate before the IW is full, i.e., as a separate check at the terminate before the IW is full, i.e., as a separate check at the
time of the connection closing. time of the connection closing.
The period over which the IW is updated is intended to be a long The period over which the IW is updated is intended to be a long
timescale, e.g., a month or so, or 1,000 connections, whichever is timescale, e.g., a month or so, or 1,000 connections, whichever is
longer. An implementation might check the IW once a month, and longer. An implementation might check the IW once a month and simply
simply not update the IW or clear the connection counts in months not update the IW or clear the connection counts in months where the
where the number of connections is too small. number of connections is too small.
C.4. Discussion C.4. Discussion
There are numerous parameters to the above algorithm that are There are numerous parameters to the above algorithm that are
compliant with the given requirements; this is intended to allow compliant with the given requirements; this is intended to allow
variation in configuration and implementation while ensuring that variation in configuration and implementation while ensuring that all
all such algorithms are reactive and safe. such algorithms are reactive and safe.
This algorithm continues to assume segments because that is the This algorithm continues to assume segments because that is the basis
basis of most TCP implementations. It might be useful to consider of most TCP implementations. It might be useful to consider revising
revising the specifications to allow byte-based congestion given the specifications to allow byte-based congestion given sufficient
sufficient experience. experience.
The algorithm checks for IW losses only during the first IW after a The algorithm checks for IW losses only during the first IW after a
connection start; it does not check for IW losses elsewhere the IW connection start; it does not check for IW losses elsewhere the IW is
is used, e.g., during slow-start restarts. used, e.g., during slow-start restarts.
>> An implementation MAY detect IW losses during slow-start restarts * An implementation MAY detect IW losses during slow-start restarts
in addition to losses during the first IW of a connection. In this in addition to losses during the first IW of a connection. In
case, the implementation MUST count each restart as a "connection" this case, the implementation MUST count each restart as a
for the purposes of connection counts and periodic rechecking of the "connection" for the purposes of connection counts and periodic
IW value. rechecking of the IW value.
False positives can occur during some kinds of segment reordering, False positives can occur during some kinds of segment reordering,
e.g., that might trigger spurious retransmissions even without a e.g., that might trigger spurious retransmissions even without a true
true segment loss. These are not expected to be sufficiently common segment loss. These are not expected to be sufficiently common to
to dominate the algorithm and its conclusions. dominate the algorithm and its conclusions.
This mechanism does require additional per-connection state, which This mechanism does require additional per-connection state, which is
is currently common in some implementations, and is useful for other currently common in some implementations and is useful for other
reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism
also benefits from persistent state kept across reboots, as would be in this appendix also benefits from persistent state kept across
other state sharing mechanisms (e.g., TCP Control Block Sharing per reboots, which would also be useful to other state sharing mechanisms
the main body of this document). (e.g., TCP Control Block Sharing per the main body of this document).
The receive window (rwnd) is not involved in this calculation. The The receive window (rwnd) is not involved in this calculation. The
size of rwnd is determined by receiver resources and provides space size of rwnd is determined by receiver resources and provides space
to accommodate segment reordering. It is not involved with to accommodate segment reordering. Also, rwnd is not involved with
congestion control, which is the focus of this document and its congestion control, which is the focus of the way this appendix
management of the IW. manages the IW.
C.5. Observations C.5. Observations
The IW may not converge to a single, global value. It also may not The IW may not converge to a single global value. It also may not
converge at all, but rather may oscillate by a few MSS as it converge at all but rather may oscillate by a few MSSs as it
repeatedly probes the Internet for larger IWs and fails. Both repeatedly probes the Internet for larger IWs and fails. Both
properties are consistent with TCP behavior during each individual properties are consistent with TCP behavior during each individual
connection. connection.
This mechanism assumes that losses during the IW are due to IW size. This mechanism assumes that losses during the IW are due to IW size.
Persistent errors that drop packets for other reasons - e.g., OS Persistent errors that drop packets for other reasons, e.g., OS bugs,
bugs, can cause false positives. Again, this is consistent with can cause false positives. Again, this is consistent with TCP's
TCP's basic assumption that loss is caused by congestion and basic assumption that loss is caused by congestion and requires
requires backoff. This algorithm treats the IW of new connections as backoff. This algorithm treats the IW of new connections as a long-
a long-timescale backoff system. timescale backoff system.
Acknowledgments
The authors would like to thank Praveen Balasubramanian for
information regarding TCB sharing in Windows; Christoph Paasch for
information regarding TCB sharing in Apple OSs; Yuchung Cheng, Lars
Eggert, Ilpo Jarvinen, and Michael Scharf for comments on earlier
draft versions of this document; as well as members of the TCPM WG.
Earlier revisions of this work received funding from a collaborative
research project between the University of Oslo and Huawei
Technologies Co., Ltd. and were partly supported by USC/ISI's Postel
Center.
This document was prepared using 2-Word-v2.0.template.dot.
Authors' Addresses
Joe Touch
Manhattan Beach, CA 90266
United States of America
Phone: +1 (310) 560-0334
Email: touch@strayalpha.com
Michael Welzl
University of Oslo
PO Box 1080 Blindern
N-0316 Oslo
Norway
Phone: +47 22 85 24 20
Email: michawe@ifi.uio.no
Safiqul Islam
University of Oslo
PO Box 1080 Blindern
Oslo N-0316
Norway
Phone: +47 22 84 08 37
Email: safiquli@ifi.uio.no
 End of changes. 316 change blocks. 
1057 lines changed or deleted 1009 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/