06. HTTP Semantics and Syntax
HTTP is very human-readable, which means you can build a server by looking at examples instead of the specification. However, this approach results in buggy toy code, and you won’t learn much. So you need to consult the specification — a series of RFC documents.
6.1 High-Level Structures
Let’s review what you already know from the introductory chapter:
- An HTTP request message consists of:
- The method, which is a verb such as
GET
,POST
. - The URI.
- A list of header fields, which is a list of key-value pairs.
- A payload body, which follows the request header. Special case:
GET
andHEAD
have no payload.
- The method, which is a verb such as
- An HTTP response consists of:
- A status code, mostly to indicate whether the request was successful.
- A list of header fields.
- An optional payload body.
These things are mostly the same from HTTP/1.0 to HTTP/3.
HTTP/1.0 200 OK
Age: 525410
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Thu, 20 Oct 2020 11:11:11 GMT
Etag: "1234567890+gzip+ident"
Last-Modified: Thu, 20 Oct 2019 11:11:11 GMT
Vary: Accept-Encoding
Content-Length: 1256
Connection: close
<!doctype html>
<!-- omitted -->
6.2 Content-Length
HTTP semantics are mostly about interpreting header fields, which is described in RFC 9110. Try reading it yourself.
The most important header fields are Content-Length
and
Transfer-Encoding
, because they determine the length of an
HTTP message, which is the most important function of a protocol.
The Length of the HTTP Header
Both a request and a response consist of 2 parts: header + body. They
are separated by an empty line. A line ends with '\r\n'
. So
the header ends with '\r\n\r\n'
including the empty line.
That’s how we determine the length of the header.
The Length of the HTTP Body
The length of the body is complicated because there are 3 ways to
determine it. The first way is to use
Content-Length
, which contains the length of the body.
Some ancient HTTP/1.0 software doesn’t use
Content-Length
, so the body is just the rest of the
connection data, the parser reads the socket to EOF and that’s the body.
This is the second way to determine the body length.
This way is problematic because you cannot tell if the connection is
ended prematurely.
6.3 Chunked Transfer Encoding
Generate and Send Data on the Fly
The third way is to use
Transfer-Encoding: chunked
instead of
Content-Length
. This is called chunked transfer
encoding. It can mark the end of the payload without knowing its
size in advance.
This allows the server to send the response while generating it on the fly. This use case is called streaming. An example is displaying real-time logs to the client without waiting for the process to finish.
Another Layer of Protocol
How does chunked encoding work? As the sender, we don’t know the total payload length, but we do know the portion of the payload we have. So we can send it in a mini-message format called a “chunk”. And a special chunk marks the end of end stream.
The receiver parses the byte stream into chunks and consumes the data, until the special chunk is received. Here is a concrete example:
4\r\nHTTP\r\n5\r\nserver\r\n0\r\n\r\n
It is parsed into 3 chunks:
4\r\nHTTP\r\n
6\r\nserver\r\n
0\r\n\r\n
You can easily guess how this works. Chunks start with the size of the data, and a 0-sized chunk marks the end of the stream.
Chunks Are Not Messages
Note that the chunk data boundaries are just side effects. These chunks are not represented to the application as individual messages; the application still sees the payload as a byte stream.
6.4 Ambiguities in HTTP
The Happy Cases of Body Length
In summary, this is how to determine the length of the payload body
(if the HTTP method allows the payload body, i.e., POST
or
PUT
).
- If
Transfer-Encoding: chunked
is present. Parse chunks. - If
Content-Length: number
is valid. The length is known. - If neither field is present, use the rest of the connection data as the payload.
There are also special cases, such as GET
and
HEAD
, 304 (Not Modified) status code, which make HTTP not
easy to implement.
Mind the Nasty Cases
You may wonder what happens if both header fields are present, as there is no clear way to interpret this. This kind of ambiguity is a source of security exploits known as “HTTP request smuggling”.
Another ambiguity is the nonexistent payload body for the
GET
request, what if the the request includes
Content-Length
? Should the server ignore the field or
forbid the field? What about Content-Length: 0
?
Also, should the server or client even allow users to mess with the
Content-Length
and Transfer-Encoding
fields at
all? There are many discussions on the Internet, and although the RFC
tried to enumerate the
cases, different implementations handle them differently.
An exercise for the reader: If you are designing a new protocol, how do you avoid ambiguities like this?
6.5 HTTP Message Format
RFC 9112 describes exactly how bits are transmitted over the network.
Read the BNF Language
The HTTP message format is described in a language called BNF. Go to the “2. Message” section in RFC 9112 and you will see things like this:
HTTP-message = start-line CRLF
*( field-line CRLF )
CRLF
[ message-body ]
start-line = request-line / status-line
This says: An HTTP message is either a request message or a response
message. A message starts with either a request line or a status line,
followed by multiple header fields, then an empty line, then the
optional payload body. Lines are separated by CRLF, which is the ASCII
string '\r\n'
. The BNF language is much more concise and
less ambiguous than English.
HTTP Header Fields
field-line = field-name ":" OWS field-value OWS
The header field name and value are separated by a colon, but the rules for field name and value are defined in RFC 9110 instead.
field-name = token
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*"
/ "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
/ DIGIT / ALPHA
; any VCHAR, except delimiters
OWS = *( SP / HTAB )
; optional whitespace
field-value = *field-content
field-content = field-vchar
[ 1*( SP / HTAB / field-vchar ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-text = %x80-FF
This is the general rule for field name and value. SP
,
HTAB
, and VCHAR
refer to space, tab, and
printable ASCII character, respectively. Some characters are forbidden
in header fields, especially CR and LF.
Some header fields have additional rules for interpretation, such as comma-separated values or quoted strings. For now, we can just leave them as they are until we need them.
The HTTP specification is very large, and this chapter only covers the most important bits of implementing an HTTP server, which we will do in the next chapter.
6.6 Common Header Fields
Many header fields are either interpreted by applications or used by optional HTTP features, and are not immediately relevant to our implementation. You can become familiar with them by inspecting HTTP headers in browser dev tools.
Header Field | By | Description |
---|---|---|
Content-Length: 60 |
C/S | Discussed. |
Transfer-Encoding: chunked |
C/S | Discussed. |
Accept: text/html |
C | For negotiating content types. |
Content-Type: text/html |
S | Content type. |
Accept-Encoding: gzip |
C | For negotiating content compression. |
Content-Encoding: gzip |
S | Compressed response. |
Vary: content-encoding |
S | Tell proxies about content negotiations. |
Authorization: Basic dTpw |
C | Authorization by username and password. |
Cache-Control: no-cache |
C/S | Affect caching behavior. |
Age: 60 |
S | How long is the item cached by the proxy? |
Set-Cookie: k=v |
S | HTTP cookie. |
Cookie: k=v |
C | HTTP cookie. |
Date |
C/S | Not very useful. |
Expect: 100-continue |
C | An obscure feature. |
Host: example.com |
C/S | The host name of the URL. |
Last-Modified |
S | For cache validation and 304 Not Modified. |
If-Modified-Since |
C | Validate Last-Modified . |
ETag: abcd |
S | For cache validation and 304 Not Modified. |
If-None-Match: abcd |
C | Validate ETag . |
Range: 10- |
C | Range request. Get a portion of the response. |
Content-Range: bytes 10-/60 |
S | Range response. |
Accept-Ranges: bytes |
S | Indicate that range requests are allowed. |
Referer: http://foo.com/ |
C | Where is the user from? |
Transfer-Encoding: gzip |
S | An alternative way to achieve compression. |
TE: gzip |
C | For negotiating
Transfer-Encoding . |
Trailer: Foo |
C/S | Obscure feature: Header fields after payload. |
User-Agent: Foo |
C | Client software. |
Server: Foo |
S | Server software. |
Upgrade: websocket |
C/S | Create WebSockets. |
Access-Control-* |
S | For cross-origin resource sharing (CORS). |
Origin |
C | CORS. |
Location: http://bar.com/ |
S | For 3xx redirections. |
6.7 HTTP Methods
Read-Only Methods
The 2 most important HTTP methods are GET
and
POST
. Why do we need different HTTP methods? Besides the
obvious fact that a POST
request can carry a payload where
a GET
cannot, it is also a good idea to separate
read-only operations from write operations. You can
use GET
for read-only operations and POST
for
the rest.
A read-only method is called a “safe” method. There are 3 safe methods:
GET
.HEAD
, likeGET
but without the response body.OPTIONS
, rarely used, for identifying allowed request methods and CORS related things.
Cacheability
One reason for separating read-only operations from write operations is that read-only operations are generally cacheable. On the other hand, it makes no sense to cache write operations as they are state-changing.
However, the rules for cacheability are more complicated than different HTTP methods.
GET
andHEAD
are considered to be cacheable methods. ButOPTIONS
is not, as it is for special purposes.- The status code also affects cacheability.
Cache-Control
header can affect cacheability.POST
is usually not cacheable, unless an obscure header field (Content-Location
) is used and certain cache directives are present.- Different implementations have different cacheability rules.
CRUD and Resources
You may have wondered why there are so many HTTP methods. Wouldn’t
just GET
and POST
suffice? In fact, that’s
what many applications do. More methods were added to HTTP because
people imagined HTTP as a protocol for managing “resources”. For
example, a forum user can manipulate his posts as resources:
- Create a post via
PUT
. - Read a post via
GET
. - Update a post via
PATCH
. - Delete a post via
DELETE
.
These 4 verbs are often referred to as CRUD.
Idempotence
But why add CRUD as HTTP methods? A forum user may also move a post
to another forum, should HTTP also include a MOVE
method?
Mirroring arbitrary English verbs is not a good reason to define HTTP
methods. One of the better reasons is to define the idempotence
of operations.
An idempotent operation is one that can be repeated with the same
effect. This means that you can safely retry the operation
until it succeeds. For example, if you rm
a file over SSH
and the connection breaks before you see the result, so the state of the
file is unknown to you, but you can always blindly rm
it
again (if it’s really the same file):
- If you were to fail, you’ll probably fail again.
- If the previous
rm
failed but this one succeeds, your intention is just fulfilled. - If the previous
rm
succeeded, there is no harm in doing it again.
An idempotent operation over HTTP can still result in a different
status code, just like the return code of rm
.
Idempotence in HTTP:
- Read-only methods (
GET
andHEAD
) are obviously idempotent. PUT
andDELETE
are idempotent, as are overwriting and deleting files.POST
andPATCH
are NOT defined as idempotent. They may or may not be.
Idempotence in browsers:
- If you submit a
<form>
viaPOST
and then refresh the page, the browser will warn you against resubmitting the potentially non-idempotent form. - HTML forms are limited to
GET
andPOST
. You need AJAX to use these idempotent methods.
But this still doesn’t answer the puzzle of why there are so many
verbs, because HTTP could just add 1 more method for idempotent writes
instead of 3 (PATCH
, PUT
,
DELETE
). In fact, there may be no strong reason for apps to
use them all.
Comparison of HTTP Methods
A summary of general-purpose HTTP methods.
Verb | Safe | Idempotent | Cacheable | <form> |
CRUD | Req body | Res body |
---|---|---|---|---|---|---|---|
GET |
Yes | Yes | Yes | Yes | read | No | Yes |
HEAD |
Yes | Yes | Yes | No | read | No | No |
POST |
No | No | No* | Yes | - | Yes | Yes |
PATCH |
No | No | No* | No | update | Yes | May |
PUT |
No | Yes | No | No | create | Yes | May |
DELETE |
No | Yes | No | No | delete | May | May |
Note: Cacheable POST
or PATCH
is possible,
but rarely supported.
6.8 Discussion: Text vs. Binary
HTTP is designed in a way that you can send requests from
telnet
, so you can learn it by poking around. However,
textual protocols have downsides.
Text is Often Ambiguous
One downside is that human-readable formats are often less machine-readable, because they are more flexible than necessary. Consider the way HTTP payload length is determined:
- The common cases are
Content-Length
andTransfer-Encoding
. - There are special cases for some HTTP methods and status codes.
- There are cases that cause different interpretations, e.g., request smuggling.
HTTP is a simple protocol, where simple means it’s easy to look at. Writing code for it is not simple because there are too many rules for interpreting it, and the rules still leave you with ambiguities.
Text is More Work & Error-Prone
Another downside is that dealing with text is a lot more work. To properly handle text strings, you need to know their length first, which is often determined by delimiters. The extra work of looking for delimiters is the cost of human-readable formats.
It’s also error-prone; in C programming, null-terminated strings (0-delimited) have caused many security exploits.
HTTP/2 is binary and more complex than HTTP/1.1, but parsing the protocol is still easier because you don’t have to deal with elements of unknown length.
6.9 Discussion: Delimiters
Serialization Errors in Delimited Data
Delimiters are everywhere in textual protocols. For example, in HTTP …
- Lines in the header are delimited by CRLF
- The header and body are delimited by an empty line.
One problem with delimiters is that the data cannot contain the delimiter itself. Failure to enforce this rule can lead to some injection exploits.
If a malicious client can trick a buggy server into emitting an header field value with CRLF in it, and the header field is the last field, then the payload body starts with the part of the field value that the attacker controls. This is called “HTTP response splitting”.
A proper HTTP server/client must forbid CRLF in header fields as
there is no way to encode them. However, this is not true for many
generic data formats. For example, JSON uses {}[],:
to
delimit elements, but a JSON string can contain arbitrary characters, so
strings are quoted to avoid ambiguity with delimiters. But the
quotes themselves are also delimiters, so escape sequences are
needed to encode quotes.
This is why you need a JSON library to produce JSON instead of concatenating strings together. And HTTP is less well defined and more complicated than JSON, so pay attention to the specifications.
Length-Prefixed Data in Binary Protocols
Delimiters in text are used to separate elements. In binary protocols and formats, a better and simpler alternative is to use length-prefixed data, that is, to specify the length of the element before the element data. Some examples are:
- The chunked transfer encoding. Although the length itself is still delimited.
- The WebSocket frame format. No delimiters at all.
- HTTP/2. Frame-based.
- The MessagePack serialization format. Some kind of binary JSON.