Pavel Panchekha

By

Share under CC-BY-SA.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

How Browsers Download Web Pages

Series

This is post 1 of the Let's Build a Web Browser series.

The primary goal of a web browser is to show the user some information identified by a URL. So, how does a browser get information out of a URL like http://example.org/index.html. Well, in short (and a network class will cover this in more detail), the browser parses the URL, connects to a server over the Internet, sends that server a request, receives a reply, and finally shows that to the user.

Parsing the URL

The first thing a browser does with a URL is to parse it, which means that the URL is split into parts. Here those parts are:

  • The scheme, here http
  • The host, here example.org
  • The path, here /index.html

These parts play different roles: the host tells the browser who to get the information from, the scheme tells it how, and the path is something the browser tells the host to explain what information it wants. There are also optional parts to the URL. Sometimes, like in http://localhost:8080/, there is a port, which you can think of as telling you which door to the host's house to use; the default is 80.11 Numbers below 1024 are front doors and those above 1024 are back doors. Sometimes there is also something tacked onto the end, a fragment like #section or a query string like ?s=term.

In Python, there's a library called urllib.parse that can do this URL parsing for you. However, I'm trying to avoid using libraries here, so let's write a bad version ourselves. We'll start with the scheme—our browser only supports http, so we just need to check that the URL starts with http:// and then strip that off:

assert url.startswith("http://")
url = url[len("http://"):]

Next, the domain and port come before the first /, while the path is that slash and everything after it:

hostport, pathfragment = url.split("/", 1) if "/" in url else (url, "/")
host, port = hostport.rsplit(":", 1) if ":" in hostport else (hostport, "80")
path, fragment = ("/" + pathfragment).rsplit("#", 1) if "#" in pathfragment else ("/" + pathfragment, None)

Here I'm using the rsplit(s, n) function, which splits a string by s, starting from the end, at most n times. Note that both ports and fragments are optional.

Further reading: The syntax of URLs is defined in RFC 3987, which is pretty readable.

Further coding: Implement the full standard, including encodings for reserved characters.

Communicating with the host

With the URL parsed, a browser must connect to the host, explain what information it wants, and receive the host's reply.

Connecting to the host

First, a browser needs to find the host on the Internet and make a connection.

Usually, the browser asks the operating system to make the connection for it. The OS then talks to a DNS server which converts a host name like example.org into a IP address like 93.184.216.34.22 On some systems, you can run dig +short example.org to do this conversion yourself. Then the OS decides which hardware is best for communicating with that IP address (say, wireless or wired) using what is called a routing table, and uses that hardware to send a sort of greeting to that IP address, to the specific port at that IP address that the browser indicated. Then there's a driver inside the OS that communicates with that hardware and send signals on a wire or whatever.33 I'm skipping steps here. On wires you first have to wrap communications in ethernet frames, on wireless you have to do even more. I'm trying to be brief. On the other side of that wire (or those airwaves) is a series of routers44 Or a switch, or an access point, there are a lot of possibilities, but eventually there is a router. which each send your message in the direction they think will take it toward that IP address.55 They may also record where the message came from so they can forward the reply back, especially in the case of NATs. Anyway, the point of this is that the browser tells the OS, hey, put me in touch with example.org on port 80, and it does.

On many systems, you can set up this kind of connection manually using the telnet program. For example, if you execute:

telnet example.org 80

it will tell you

Trying 93.184.216.34...
Connected to example.org.
Escape character is '^]'.

This means that the OS converted example.org to the IP address of 93.184.216.34 and was able to connect to it.66 The escape character line is just something telnet-specific.

You can then type into the console and press enter to say stuff to example.org.

Requesting information from the host

Once it's been connected, the browser explains to the host what information it is looking for. In our case, the browser must do that explanation using the http protocol, and it must explain to the host that it is looking for /index.html. In HTTP, this request looks like this:

GET /index.html HTTP/1.0
Host: example.org

Here, the word GET means that the browser would like to receive information,77 It could say POST if it intended to send information, plus there are some other obscure options. then comes the path, and finally there is the word HTTP/1.0 which tells the host that the browser speaks version 1.0 of HTTP. There are several versions of HTTP, at least 0.9, 1.0, 1.1, and 2.0. The later standards add a variety of useful features, like virtual hosts, cookies, referrers, and so on, but in the interest of simplicity our browser will ignore them.

After the first line, each line contains a header, which has a name (like Host) and a value (like example.org). Different headers mean different things; the Host header, for example, tells the host who you think it is. This is useful when the same IP address corresponds to multiple host names (for example, example.com and example.org). There are lots of other headers one could send, but let's stick to just Host for now. Finally, after the headers are done, you need to enter one blank line; that tells the host that you are done with headers.

Enter all this into telnet and see what happens. Remember to leave add one more blank line after the line that begins with Host.

Our own Telnet

So far we've communicated with another computer using telnet. But it turns out that telnet is quite a simple program, and we can do the same programmatically, without starting another program and typing into it.

To communicate with another computer, the operating system provides a feature called "sockets". When you want to talk to other computers (either to tell them something, or to wait for them to tell you something), you create a socket, and then that socket can be used to send information back and forth. Sockets come in a few different kinds, because there are multiple ways to talk to other computers:

  • A socket has an address family, which tells you how to find to the other computer. Address families have names that begin with AF. We want AF_INET, but for example AF_BLUETOOTH is another.
  • A socket has a type, which describes the sort of conversation that's going to happen. Types have names that begin with SOCK. We want SOCK_STREAM, which means each computer can send arbitrary amounts of data over, but there's also SOCK_DGRAM, in which case they send each other packets of some fixed size.88 The DGRAM stands for "datagram" and think of it like a postcard.
  • A socket has a protocol, which describes the steps by which the two computers will establish a connection. Protocols have names that depend on the address family, but we want IPPROTO_TCP.

By picking all of these options, we can create a socket like so:

import socket
s = socket.socket(family=socket.AF_INET, type=socket.SOCK_STREAM, proto=socket.IPPROTO_TCP)

Once you have a socket, you need to tell it to connect to the other computer. For that, you need the host and the port. Note that there are two parentheses in the connect call: connect takes a single argument, and that argument is a pair of a host and a port. This is because different address families have different numbers of arguments.

s.connect(("example.org", 80))

Finally, once you've made the connection, you can send it some data using the send method.

s.send(b"GET /index.html HTTP/1.0\nHost: example.org\n\n")

When you send data, it's important to remember that you are sending raw bits and bytes: it doesn't have to be text—though in this case it is—it could be images or video instead. That's why here I have a letter b in front of the string of data: that tells Python that I mean the bits and bytes that represent the text I typed in, not the text itself, which you can tell because it has type bytes not str:

type("asdf") # -> <class 'str'>
type(b"asdf") # -> <class 'bytes'>

If you forget that letter b, you will get some error about str versus bytes. You can turn a str into bytes by calling the .encode("ascii") function on it.99 Well, to be more precise, you need to call encode and then tell it the character encoding that your string should use. This is a complicated topic. I'm using ascii here because it will throw an error if you try anything funny like a non-English-language character. In the real world, you need to be more careful about character encodings.

Also be careful with the text you type in here. It's very important to put two newlines \n at the end, so that you send that blank line. If you forget that, the other computer will keep waiting on you to send that newline, and you'll keep waiting on it to answer you. Computers are dumb.

You'll notice that the send call returns a number, in this case 44. That tells you how many bytes of data you sent to the other computer; if, say, your network connection failed midway through sending the data, you might want to know how much you sent before the connection failed.

The host's reply

If you look at your telnet session, you should see that the other computer's response starts with this line:

HTTP/1.0 200 OK

That tells you that the host confirms that it, too, speaks HTTP/1.0, and that it found your request to be "OK" (which has a corresponding numeric code of 200). You may be familiar with 404 Not Found. That's something the server could say instead of 200 OK, or it could even say 403 Forbidden or 500 Server Error. There are lots of these codes, and they have a pretty neat organization scheme:

  • The 100s are informational messages
  • The 200s mean you were successful
  • The 300s mean you need to do a follow-up action (usually to follow a redirect)
  • The 400s mean you sent a bad request
  • The 500s mean the server handled the request badly

Note the genius of having two sets of error codes (400s and 500s): which one you get tells you who the server thinks is at fault: the server or the browser. You can find a full list of the different codes on Wikipedia.

After the 200 OK line, the server sends its own headers. When I did this, I got these headers (but yours may differ):

Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 25 Feb 2019 16:49:28 GMT
Etag: "1541025663+ident"
Expires: Mon, 04 Mar 2019 16:49:28 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (sec/96EC)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1270
Connection: close

There is a lot here, including information about the information you are requesting (Content-Type, Content-Length, and Last-Modified), information about the server (Server, X-Cache), information about how long the browser should cache this information (Cache-Control, Etag), and a bunch of random other information. Let's move on for now.

After the headers there is a blank line, and then there is a bunch of HTML code. Your browser knows that it is HTML because of the Content-Type header, which says that it is text/html.1010 The charset bit of that header tells the browser which character encoding to use for the content of the page. Ideally, we would call makefile with the argument "rb" in which case read would return bytes instead of a str, and then we would decode the headers using the ascii character encoding, and then use the charset to decode the content of the page. That's a lot of work which I'm skipping here. That HTML code is the body of the server's reply.

Let's try to do this programmatically. We can read the response using:

response = s.makefile("rb").read().decode("ascii")

Here s.makefile("rb") is the file-like object corresponding to what the other computer said on the socket s, and we call read on it to get that output. That output is returned as "bytes",1111 Because I used "rb" as the argument to makefile. If you don't pass an argument, or you pass rt, Python would guess how to convert the response from bytes to str. I'm doing it this way to be clearer about what is going on. which I am instructing Python to turn into a string using the ascii encoding, or method of associating numbers to letters. It would be more correct to use ascii only to decode the headers, and then to parse the charset declaration in the Content-Type header to determine what encoding to use for the body. That's what real browsers do (they even guess at an encoding if there isn't a charset declaration, and when they guess wrong you see those ugly � or some strange áççêñ£ß.), but for simplicity let's stick to ascii, which will raise an error if there are any strange characters.

Let's split the response into pieces. The first line is the status line, then the headers, and then the body:

head, body = response.split("\n\n", 1)
lines = head.split("\n")
status = lines[0]
headers = dict(line.split(": ", 1) for line in lines[1:])

For the headers, I split each line at the first colon and make a dictionary (a key-value map) of header name to header value.

Further reading: Many common (and uncommon) HTTP headers are described on Wikipedia.

Further coding: Instead of using read to get the whole response, go line by line using readline. Then, instead of using the ascii codec for the whole response, parse the headers once you've received all of them and use those headers to determine the encoding for the body.

Displaying the HTML

The HTML code that the server sent us defines the content you see in your browser window when you go to http://example.org/index.html. I'll be talking much, much more about HTML in the future posts, but for now let me keep it very simple.

In HTML, there are tags and text. Each tag starts with a < and ends with a >; generally speaking, tags tell you what kind of thing some content is, while text is the actual content.1212 That said, some tags, like img, are content, not information about it. Most tags come in pairs of a start and an end tag; for example, the title of the page is enclosed a pair of tags: <title> and </title>. Each tag, inside the angle brackets, has a tag name (like title here), and then optionally a space followed by attributes, and its pair has a / followed by the tag name (and no attributes). Some tags do not have pairs, because they don't surround text, they just carry information. For example, on http://example.org/index.html, there is the tag:

<meta charset="utf-8" />

This tag once again repeats that the character set with which to interpret the page body is utf-8. Sometimes, tags that don't contain information end in a slash, but not always, because web developers aren't always so diligent.

The most important HTML tag is called <body> (with its pair, </body>). Between these tags is the content of the page; outside of these tags is various information about the page, like the aforementioned title, information about how the page should look (<style> and </style>), and metadata using the aforementioned <meta/> tag.

So, to create our very very simple web browser, let's take the page HTML and print all the text in it (but not the tags):1313 If this example causes Python to produce a SyntaxError pointing to the end on the last line, it is likely because you are running Python 2 instead of Python 3. These posts assume Python 3.

in_angle = False
for c in body:
    if c == "<":
        in_angle = True
    elif c == ">":
        in_angle = False
    elif not in_angle:
        print(c, end="")

This code is pretty complex. It goes through the request body character by character, and it has two states: in_angle, when it is currently between a pair of angle brackets, and not in_angle. When the current character is an angle bracket, changes between those states; when it is not, and it is not inside a tag, it prints the current character.1414 The end argument tells Python not to print a newline after the character, which it otherwise would.

Summary

This post went from an empty file to a rudimentary web browser that can:

  • Parse the URL http://example.org/index.html into a host, a port, a path, and a fragment.
  • Connect to that host at that port using telnet
  • Send an HTTP request to that host, including a Host header
  • Split the HTTP response into a status line, headers, and a body
  • Print the text (and not the tags) in the body

Yes, this is still more of a command-line tool than a web browser, but what we have already has some of the core capabilities of a browser

Assignment

Collect the code samples given in this post into a file and separate the code into three functions:

parse(url)
takes in a string URL and returns a host string, a numeric port, a path string, and a fragment string. The path should include the initial slash, and the fragment should not include the initial #.
request(host, port, path)
takes in a host, a port, and a path; connects to the host/port using sockets; sends it an HTTP request (including the Host header); splits the response into a status line, headers, and a body; checks that the status line starts with HTTP/1.0 and has the status code 2001515 The status text like OK can actually be anything and is just there for humans, not for machines; and then returns the headers as a dictionary and the body as a string.
show(body)
prints the text, but not the tags, in an HTML document

It should be possible to string these functions together like so:

import sys
host, port, path, fragment = parse(sys.argv[1])
headers, body = request(host, port, path)
show(body)

This code uses the sys library to read the first argument (sys.argv[1]) from the command line to use as a URL.

Finally, make some improvements to the code:

  • Along with Host, send the User-Agent header in the request function. Its value can be whatever you want—it identifies your browser to the host.
  • Add support for the file:// scheme to parse. Unlike http://, the file protocol has an empty host and port, because it always refers to a path on your local computer. You will need to modify parse to return the scheme as an extra output, which will be either http or file.
  • Add support for the file:// scheme to request. Instead of using sockets for file:// URLs, you will use open to open a file and read from it. When you do that, there won't be headers.
  • Only show the text of an HTML document between <body> and </body>. This will avoid printing the title and various style information. You will need to add additional variables in_body and tag to that loop, to track whether or not you are between body tags and to keep around the tag name when inside a tag.
  • Add content type support to show: use the Content-Type header to determine the content type, and if it isn't text/html, just show the whole document instead of stripping out tags and only showing text in the <body>.

Footnotes:

1

Numbers below 1024 are front doors and those above 1024 are back doors.

2

On some systems, you can run dig +short example.org to do this conversion yourself.

3

I'm skipping steps here. On wires you first have to wrap communications in ethernet frames, on wireless you have to do even more. I'm trying to be brief.

4

Or a switch, or an access point, there are a lot of possibilities, but eventually there is a router.

5

They may also record where the message came from so they can forward the reply back, especially in the case of NATs.

6

The escape character line is just something telnet-specific.

7

It could say POST if it intended to send information, plus there are some other obscure options.

8

The DGRAM stands for "datagram" and think of it like a postcard.

9

Well, to be more precise, you need to call encode and then tell it the character encoding that your string should use. This is a complicated topic. I'm using ascii here because it will throw an error if you try anything funny like a non-English-language character. In the real world, you need to be more careful about character encodings.

10

The charset bit of that header tells the browser which character encoding to use for the content of the page. Ideally, we would call makefile with the argument "rb" in which case read would return bytes instead of a str, and then we would decode the headers using the ascii character encoding, and then use the charset to decode the content of the page. That's a lot of work which I'm skipping here.

11

Because I used "rb" as the argument to makefile. If you don't pass an argument, or you pass rt, Python would guess how to convert the response from bytes to str. I'm doing it this way to be clearer about what is going on.

12

That said, some tags, like img, are content, not information about it.

13

If this example causes Python to produce a SyntaxError pointing to the end on the last line, it is likely because you are running Python 2 instead of Python 3. These posts assume Python 3.

14

The end argument tells Python not to print a newline after the character, which it otherwise would.

15

The status text like OK can actually be anything and is just there for humans, not for machines