Transmission Control Protocol (TCP)

The Transmission Control Protocol (TCP) is a connection-oriented reliable protocol. It provides a reliable transport service between pairs of processes executing on End Systems (or endpoints) using the network layer service provided by the IP protocol.

TCP provides a reliable, flow-controlled TCP service. This is much more complex than the service provided by UDP, which only provides a Best Effort service. To implement the service, TCP uses a number of protocol timers that ensure reliable and synchronised reliable communication between the two End Systems.

For most networks, approximately 90% of current traffic uses the TCP transport service. This is changing slowly, but TCP still remains a critical network protocol. It is used by such applications as telnet, World Wide Web (WWW), ftp, electronic mail. The transport header contains a Service Access Point (SAP), which indicates the protocol which is being used (e.g. 23 = Telnet; 25 = Mail; 69 = TFTP; 80 = WWW (http)). The port numbers associated with these services generally have the same value as those used for UDP services (a full list of all port numbers is provided in the reference at the end of this page).

TCP providing reliable data for the file transfer protocol over an IP network using Ethernet

Each TCP connection starts with one endpoint sending a SYN packet and the other replying with a SYN-ACK packet. A final ACK is sent in response to the SYN-ACK. This agrees the initial sequence number to be used and also allows the two endpoints to negotiate any parameters that they wish to use. The figure below shows this exchange.


--SYN-->


<--SYN+ACK--


--SYN-->

The exchange of SYN and ACK packets at the start of a TCP connection

Transmission of Data

TCP is stream-oriented, that is, TCP protocol entities exchange streams of data. Once a connection has been setup data is sent in TCP Segments, Protocol Data Units at the transport layer. Each TCP Segment is sent in the payload of a single IP packet.

When all data has been sent the endpoint sends a FIN packet to terminate (close) the connection.

A TCP Sender takes data (e.g. created by an application and sent via the Sockets programming Interface), and creates TCP Segments. The maximum size of a TCP Segment is defined by the TCP Maximum Segment Size (MSS). Hence if the sender has D bytes of data to send this results in D/MSS number of TCP Segments. The last TCP Segment carries any remaining data and is often less than the MSS.

A sliding window controls the maximum number of in-flight TCP Segments at any one time. When the sender has sent all the segments that are allowed, it stops sending and waits. The TCP Sender keeps a copy of the data sent in each TCP Segment, just in case it might need to resend it (i.e. it retransmits when it detects loss, later).

Note: All protocol fields in the TCP packet header, and the internal variables in TCP work in terms of bytes in the stream. Hence, cwnd also is measured in bytes. However, it can often be helpful to ignore the bytes and simplify to simply talk of Segments being full-sized, and numbered 1,2,3 etc, rather than 1410 bytes, 2820 bytes, 4230 bytes, etc. This will be used in the descriptions on this page.

Flow Control

When a TCP Receiver acknowledges received TCP Segments, it sends the received data to the application at the receiver. If that application is busy, then the TCP Receiver might build a queue of data to be sent, and when this reaches a threshold, it can use the receiver window (rwin) field in the TCP header of an ACK packet to tell the sender to stop sending new Segments.

Controlling the Maximum Sending Rate

The TCP Sender uses a window, called the congestion window, cwnd, to control the rate at which sends before it needs to receive an acknowledgement packet from the remote receiver. The maximum rate at which a TCP transport can send at any time is calculated by:
Transport layer packet rate = (cwnd)/(MSS x RTT)

Where:
cwnd = the sender congestion window
MSS = the Maximum Segment Size (the largest size of TCP Segment that can be sent in the IP payload)
RTT = the observed round trip time for the path between the sender and receiver.


--Segment(1),Segment(2),Segment(3),Segment(4)-->


<--ACK(2),ACK(4)--


Each received ACK covers two received segments allowing new data to be sent


--Segment(5),Segment(6),Segment(7),Segment(8)-->


<--ACK (6),ACK(8)--


---FIN-->


<--ACK--


<--FIN--


--ACK-->

The transmission of 8*MSS Data in 8 TCP Segments and their acknowledgment

A TCP Receiver sends packets back to the TCP sender when it receives TCP Segments. These are called ACK packets, although in reality they are simple a TCP header that may or may not also be followed by data. Each ACK carries a sequence number corresponding to the last TCP Segment that was received. When a Segment is lost, the ACK number stays the value of the last in-sequence TCP segment received

All TCP acknowledgment values are culmulative, i.e. an ACK packet acknowledges reception of all the segments up to the acknowledgement number. A receiver therefore does not ncessarily need to send an ACK packet for every received TCP Segment. It is common to send an ACK packet only when two full-sized segments have been received. This reduces the number of ACK packets by 2.

When there is no more data to send, the sender and receiver set the FYN flag in the TCP header. A FIN is sent by both the sender and receiver, because TCP allows segments to be sent in either direction.

Loss Detection and Retransmission

When the TCP Sender receives an ACK packet, it knows that the packet has completed its journey across the network path, and it no never need to retransmit that TCP Segment. This usually allows the TCP Sender to send new data.

TCP uses a timer to ensure each Segment is actually sent successfully to the receiver. Each time the TCP Sender sends a Segment and its retransmission timer is not currently running, it starts the timer. the timeout value is set to the longest time that the TCP sender expects to wait for the receiver to respond. It stops this timer when it receives an acknowledgement for a previously sent segment, but it then restarts the timer if there are still more segments that have been sent and which have yet been acknowledged. If the timer ever expires, the sender deduces that packets might have been lost. It then re-transmits the data in the oldest unacknowledged Segment, sending an extra copy of the TCP Segment. Afterwards, it restarts the timer, to ensure that if the retransmission fails, the retransmitted Segment is itself retransmitted.


cwnd=4

--Segment(1),Segment(2),Segment(3),X-->


<--ACK(2),ACK(3)--


Timeout, and retransmission starting at Segment 4


--Segment(4),Segment(5),Segment(6)-->


...

The re-transmission of TCP Segment 4 after a timeout showing that Segment 4 was not received at the TCP Receiver. This lost segment is denoted X in the above figure.

A TCP Sender also uses a Fast Recovery and Fast Retransmit method to also detect any lost Segments. ] This works by observing the acknowledgement value in each received ACK: Since the sender knows that the TCP Receiver sends ACKs when it receives new data, it assumes that when the sequence number contained in the ACK does not increase (i.e. the receiver confirms delivery of the same sequence number, rather than new data), this is a sign that Segments have been lost. When three duplicate ACKs are received, this is also used a trigger for retransmission.


cwnd=8


--Segment(1),Segment(2),X,Segment(4),Segment(5),Segment(6),Segment(7),Segment(8)-->


<--ACK(2),ACK(2),ACK(2)--
    

    Fast Retransmit and Recovery for Segment 3
    

After detecting loss the cwnd is halved and saved in SSthresh, cwnd=8/2=4;

SSthresh= cwnd=4


--Segment(3), ...-->


...

The re-transmission of TCP Segment 3 using Fast Retransmit and Fast Recovery. This lost segment is denoted X in the above figure.

Note: Modern TCP stacks also implement a range of more sophisticated mechanisms: Selective ACK (SACK), Tail Loss Probe (TLP) detection, Propotional Rate Reduction (PRR) and many other improvements to TCP to improve processing efficiency, transfer rate or reduce end-to-end network latency.

The Sending Rate and Network Congestion

Transports need to not only work reliably, but they must also avoid inducing starvation to the other flows that share the resources (e.g. packet buffers) along the path that they use. Unlike a hub/repeater, which processes each Ethernet frame one at a time, a bridge or router contains a buffer than can be used to hold packets/frames when the arrival rate at the Ingress interface exceeds the rate at the sending interface. Hence, these network device buffers data internally, and the buffers available have a finite size.

When the capacity of a buffer is exceeded, the network device with silently discard/drop packets/frames, it is said to be congested. Devices could be implemented to drop the packets from the front or back of the queue (or potentially using some other algorithm). The device does not notify the sender that a packet has been lost. (That also would consume network capacity!)

A router that has a full buffer, will therefore discard excess packets. This includes packets sent by control protocols, thasuch as for routing and other operational protocols essential to proper working of the network. The loss of control packets coulkd result in congestion collapse, a case where a router can do no useful work, and queues/delays build to a maximum.. This can be avoided by using congestion control, and often is prevented by using a seperate set of buffers for only the network control packets. This special queue is often assigned a higher prioroty, so the network control packets are sent before any other buffered packets.

Congestion Control

At the start of a connection, a TCP sender has no knowledge of the capacity of a path, and therefore of a suitable rate to send. The TCP congestion window (cwnd) is the key tool to control transmission, which is automatically adapted by an algorithm known as congestion control. To fully-utilise the capacity along a path with a certain RTT, the transport needs to determine an appropriate volume of bytes in flight, based on the product of the available capacity and the path RTT. This rate is determined by the congestion controller function adapting the cwnd.

In TCP, the congestion controller uses two complimentary functions: Slow Start (SS) and Congestion Avoidance (CA).

The TCP Slow Start was designed to begin at a conservative rate, but to quickly ramp-up the rate (cwnd) to fill; the available capacity. In Slow Start, the TCP sender increases its congestion window by one segment for each ACK packet that is received, effectively doubling ‪its cwnd each RTT. It continues to do this while there are no packet losses. Therefore, the sender increases the cwnd for every ACK received, roughly doubling the cwnd for each RTT, leading to an exponential growth of the sending rate.


cwnd=4; SSthresh = infinity

--Segment(1),Segment(2),Segment(3),Segment(4)-->


<--ACK(2),ACK(4)--


After receiving these acknowledgements covering 4 received segments, cwnd=4+4=8
                

                --Segment(5),Segment(6),Segment(7),Segment(8),Segment(9),Segment(10),Segment(11),Segment(12)-->
                

                ACK(6),ACK(8)...ACK(12)

The transmission of 12*MSS Data in 12 TCP Segments and their acknowledgment

In the figure above, each segment acknowledged in the slow start phase results in the cwnd being increased by one segment. This allows two packets to be sent (one because the ACK packet arrived, and one because the cwnd was increased). The TCP specification recommends sending an ACK packet every other received full-sized TCP Segment. Each ACK packet that covers two segments (e.g. ACK(2) culumulatively acknowledges all the bytes carried in the Segment(1) and Segment(2)) acknowledges the successful transmission of two segments and also increases the cwnd by two. Hence, it allows four new TCP Segments to be sent.

The Slow Start phase stops when the cwnd slow start threshold, Sthresh. This is initially set to a pre-defined constant, but if congestion is detected, it is set to half of the current cwnd. It then enters the Congestion Avodiance phase.

In the CA phase, the cwnd is grown linearly, as seen in the following figure. That is, each segment acknowledged increases the cwnd by MSS/cwnd bytes. Put another way, this results in an increase of cwnd by one full segment size each time cwnd of data is acknowledged.

As in the slow start phase, if a timeout occurs in congestion avoidance, the ssthresh is set to half the cwnd, the window size is set to restart window. The connection then returns to the slow start phase.

The goal of the CA phase is to keep the flow in equilibrium, or put another way, the congestion controller seeks to maintain the cwnd close to the available capacity of the bottleneck link along the path to the sender. CA has two functions: seeking available capacity and reacting to congestion. This results in a pattern with cycles of increasing the cwnd slowly and decreasing it rapidly. The term Additive Increase Multiplicative Decrease, AIMD, is used to describe this pattern.

The following figure shows the result when cwnd is greater than equal to SSthresh. This considers an example where a connection has a SSthresh of 6 segments, this would normally be a result of previously detecting condition when cwnd was 12. For ease of reading the 1st packet sent is numbered segment 1, and 12 segments are considered. Segments 1-4 are sent, and transmission pauses because the initial cwnd was only 4. An ACK for the first 2 segments allows two new segments to be sent and causes the sender to also grow the cwnd by one.

After the acknowledgment for the first segment is received, the sender enters CA. Each received ACK allows a new segment to be sent. The cwnd does not allow a further segment to be sent until a further cwnd of segments have been acknowledged (i.e. acknowlkedgments covering 6 received segments). Reception of the ACK packet for segment 7 allows the sender to then increase cwnd by one additional segment (i.e., cwnd=6+1=7). Although by this time, there no more data is pending transmission in this example, and the sender simplyb waits for all acknowledgments to be received.


cwnd=4; SSthresh = 6 

--Segment(1),Segment(2),Segment(3),Segment(4)-->


<--ACK(2),ACK(4)--


cwnd is less than the value of SSthresh, Slow Start is used to increase cwnd by 2 for ACK(1)


cwnd=4+2=6 (6 new Segments can be sent)
                

                --Segment(5),Segment(6),Segment(7),Segment(8),Segment(9),Segment(10)-->
                

                <--ACK(6)--
                

                2 more acknowledged segments would allow 2 new segments, but there is only one to send:
                

                    --Segment(11),Segment(12)-->
            

                    <-- ACK(8),ACK(10)--
                        

                cwnd=6+1=7 (6 acknowledged segments cause cwnd to increase by 1 MSS in CA)


                <-- ACK(12)---

The transmission of 12*MSS Data in 12 TCP Segments when SSthresh is set to 6 Segments

TCP Senders must treat a loss of all feedback (e.g., no acknowledgments received for several RTTs) as an indication of potential congestion collapse and therefore stop sending new data and restarting the slow start. If a sender experiences is a timeout because after several RTTs of time it still has not received an acknowledgement for any of the segments it has sent, the sender also adjusts the SSthresh to reflect the current value of the cwnd. It then re-initialises the cwnd to the value of the restart window (usually the same as the initial cwnd). This causes the congestion controller to re-enter Slow Start phase, and exponentially grow the cwnd each RTT, until it reaches the new SSthresh, (i.e. The value before reaching the capacity used by the flow at the last timeout/loss).

Note: Senders should actuall use the Flight_Size, rather than cwnd, to adjust Ssthresh after a loss.

See also:

UDP (and description of how ports work)

Example Packet Decodes
Congestion Control

Some Standards Documents:

TCP is defined by a number of RFCs published by the Internet Society.

Jacobson, V., Braden, R., and D. Borman, "TCP Extensions for High Performance", RFC 1323, May 1992.

Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996.

Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999.

Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's Initial Window", RFC 3390, October 2002.

Floyd, S., Henderson, T., and A. Gurtov, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC 3782, April 2004.

R.T. Braden, D.A. Borman, C. Partridge, Computing the Internet Checksum, RFC 1071.

Duke, M., Braden, R., Eddy, W., and E. Blanton, "A Roadmap for Transmission Control Protocol (TCP) Specification Documents", RFC 4614, September 2006.

Eddy, W., "Transmission Control Protocol", RFC 9293, August 2022.

See also:

List of Assigned TCP Port Numbers

Gorry Fairhurst - Date: 16/01/2023