Pavel Panchekha

By

Share under CC-BY-SA.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

How Browsers Construct Document Trees

Series

This is post 4 of the Let's Build a Web Browser series.

So far, our web browser does layout directly from the stream of HTML tokens (open tags, close tags, and text). It's worked for now, but HTML has a tree structure that is important for correctly drawing borders, menus, and lists, let alone for using CSS code to change how a web page looks. So let's change our browser to parse HTML into a tree, and do layout on that tree instead of on the tokens directly.

Table of Contents

A tree of nodes

Right now, our browser understands web pages as a flat sequence of tokens—either text or tag. However, HTML in fact has more structure than this: each open tag has a corresponding close tag, and the stuff between those two is "contained" within that tag: every open tag must be closed before closing its parent.

So, the tags form a tree, and every pair of open and close tags forms a node of that tree. Of course, the tree also has to contain another kind of node: text. We call the two types of nodes element nodes and text nodes; there isn't additional structure besides those two.1 [1 Well, full HTML has all sorts of other stuff, but we're skipping it for now.] An element node contains any number of either element or text nodes. Here's a data structure for these two node types:2 [2 Tracking the parent will turn out to be useful shortly.]

class ElementNode:
    def __init__(self, parent, tagname):
        self.tag = tagname
        self.children = []
        self.parent = parent

class TextNode:
    def __init__(self, parent, text):
        self.text = text
        self.parent = parent

The goal is now filling in these Node structures from the list of tokens that we have. The overall idea is pretty simple: keep track of the node that we're in, create a new one when we see an open tag, and go back to its parent when we see a close tag:

We'll start by tracking a "current node", which at first will be a dummy object (in Python, None) to indicate the fact that we haven't started any nodes yet:

current = None
for tok in tokens:
    # ...
return current

Now for every token, we need to figure out if it is text, an open tag, or a close tag, and do the appropriate thing. Text tokens are the easiest:

new = TextNode(current, tok.text)
current.children.append(new)

Meanwhile, for Tag tokens, we can figure out whether it is an open or close tag by looking to see if it starts with a slash. For open tags, we need to create a new ElementNode and add it as a child to the current node. We then "enter" that node by making it current:

new = ElementNode(current, tok.tag)
if current is not None: current.children.append(new)
current = new

Finally, close tags instead go up from the current node to its parent:

if current.parent is not None:
    current = current.parent

Here I'm testing current.parent because of a subtlety with the root of the tree. The first thing in an HTML document is always the <html> open tag, so it's safe to start current with the dummy value None: it'll immediately be replaced by an ElementNode. But then when we reach the </html> close tag, we don't want to walk up to the parent, which would cause current to take on that dummy None value again: if that happened, we wouldn't be able to get back to the nodes we just created! So instead I do nothing, so that moments later we exit the for loop and return the nodes we've created.

HTML attributes

HTML tags not only have tag names but also attributes, which are added to open tags to give additional information about that element. For example, the lang attribute tells you what language the text in that attribute is in, while the href attribute on a tags gives the URL that the link points to (href stands for "hypertext reference"). Attributes look like this:

<tagname attrname=attrvalue attrname=attrvalue>

You can any number of attributes in an open tag; the names are alphabetic and case-insensitive, while the values can be anything and can even include spaces if you surround them by quotes.

To add attributes to our parser, we'll need to extend ElementNode to store them:

class ElementNode:
    def __init__(self, parent, tagname):
        # ...
        self.attributes = {}

Right now, the tagname passed to an ElementNode contains both the tag name and all the attributes. A quick and dirty way to separate them is to split on whitespace:

self.tag, *attrs = tagname.split(" ")

This syntax tells Python to split tagname on whitespace and to store the first bit into self.tag while the rest are collected into the list attrs. Now, we can go through attrs extracting each attribute-value pair:

for attr in attrs:
    out = attr.split("=", 1)
    name = out[0]
    val = out[1].strip("\"") if len(out) > 1 else ""
    self.attributes[name.lower()] = val

Here the attribute name is split from the attribute value by looking for the first equal sign, and then if the value has quotes on either side, those are stripped off. The name is made lowercase before adding it to self.attributes because attribute names are case-insensitive.

Handling exceptions and errors

Now, you might notice that the HTML parser I've written does confusing, sort-of arbitrary things when tags are left unclosed, or when the wrong tag is closed in the wrong order. Real HTML documents usually have all sorts of mistakes like that, let alone the parts of HTML that are standardized but which our parser does not handle. So real HTML parsers have all sorts of tricks to guess at what the user meant and somehow return a tree anyway.

For example, some HTML tags serve as both an open and a close tag (they are called self-closing tags). The <br> tag, which inserts a line break, is one such tag. However, right now, our parser doesn't know that, and will just keep looking for a </br> tag that never arrives. Self-closing tags sometimes end in a slash (like <br/>) but in general they don't have to: <br> is self-closing no matter how you write it, as are other self-closing tags like meta and link.

To support self-closing tags, we just need to modify the part of the parser that creates a new node:

new = ElementNode(current, tok.tag)
if current is not None: current.children.append(new)
current = new

Here, the first line creates the element and the second makes it a child of its parent. Then, the third one "enters" the new node, in effect opening it. For a self-closing tag, we just need to avoid doing that:

if new.tag not in ["br", "link", "meta"]:
    current = new

Note that I'm checking new.tag, not tok.tag. That's important, because link and meta usually have attributes.

Now is a good time, by the way, to implement <br> in layout; it works pretty much the same way that <p> does, ending the current line by resetting x and incrementing y.

Self-closing tags are at least a normal part of the standard. However, sometimes people forget close tags. For example, you might have a <p> inside a <section> and then close the </section> without closing the </p>. Because these errors are so common, browsers try to automatically fix them so they do the "right thing". The way they actually do that is pretty complicated (there are all sorts of different types of tags and so on) but let's at least implement a simplified version. When a closing tag is encountered, we will walk up the tree until we find a tag that it closes. We'll again be modifying the parser:

tag = tok.tag[1:]
node = current
while node is not None and node.tag != tag:
    node = node.parent
if not node and current.parent is not None:
    current = current.parent
elif node.parent is not None:
    current = node.parent

Here, we start by taking tok.tag and stripping off the initial slash. Then, we walk up the tree of nodes until we find a node with a matching tag, and set current to its parent (thus closing it). Now, there's a chance we never find the closing tag, like if someone mistypes a closing tag. In that case, we guess that you intended to close the currently-open tag and close it.

Finally, the page author might not have closed all the tags they opened, so at the end of the parser current might point somewhere in the middle of the tree. We can fix that with a simple for loop just before the return statement.

while current.parent:
    current = current.parent
return current

HTML parsers actually have to follow a messy and complex thing called the HTML 5 parsing algorithm, which spells out in detail how to handle user errors, like the handling we've been doing here. Our implementation here is quite limited, but it gives you some sense of how browsers handle user errors.

Layout from a tree

In the previous post, we had our browser process web pages with the following pipeline:

url → [ parse ] → host, port, path → [ lex ] → tokens → [ layout ] → displaylist → [ show ]

Now we want to insert parsing between lex and layout, so layout will now take in not a list of tokens but a tree of nodes. To handle the tree of nodes, layout will have to be a recursive function. But changing it from a single for loop to a recursive function will be tricky because that for loop stores a lot of state: the current x and y position, the current bold and italic state, and even that tricky terminal_space variable that took up so much time in the last post. Plus, the layout function appends render commands to a display_list variable, and that is also part of the state.

Before, this state was just a set of local variables in the layout function. Now that we have a recursive function, we'll need that state to live outside of layout. I'll define a simple class for it:

class State:
    def __init__(self):
        self.x = 13
        self.y = 13
        self.bold = False
        self.italic = False
        self.terminal_space = True
        self.dl = []

state = State()

Setting a field like state.x is now the replacement for setting a variable like x.

Inside the body of this function, we're going to need to:

  1. Do all the things the old layout function did for the start tag, updating state with the results
  2. Recursively call layout for each child, updating state with the results
  3. Do all the things the old layout function did for the end tag, updating state with the results

Plus, we'll need specialized handling for text nodes, since they don't have open or close tags, and since they don't have children. I'm going to handle all of these cases in the main layout function, and call out to helper functions to do the actual layout:

def layout(node):
    if isinstance(node, ElementNode):
        layout_open(node)
        for child in node.children:
            layout(child)
        layout_close(node)
    else:
        layout_text(node)

The functions layout_open, layout_close, and layout_text contain the bits and pieces of the old layout function, and look something like this:3 [3 If you implemented earlier exercises like support for <a>, <small>, and <big>, you will likely have more parts to the state.]

def layout_open(node):
    if node.tag == "b":
        state.bold = True
    elif node.tag == "i":
        state.italic = True

When you call layout (in show) you'll need so supply an initial state, like this:

layout(nodes)
display_list = state.display_list

Summary

In this post, we taught our browser to understand HTML as a structured tree, not just a flat list of tokens, and we've updated layout to be a recursive tree traversal instead of a linear pass through the document. This has also refactored layout into three functions: layout_open, layout_close, and layout_text. While the changes to the browser seem slight, this new, structured understanding of HTML sets us up to make layout much more structured in the next post.

Exercises

  • HTML documents sometimes start with "doctype declarations". These look like tags that begin with <!doctype and end with >, and are always the first token in a document. Skip these doctype declarations in the parser.
  • Update the HTML lexer and parser to support comments. Comments in HTML begin with <!-- and end with -->, so they look a little bit like tags but actually aren't tags at all and can contain any text, including something that looks open or close tags. Comments should be a new token type and add a new node type to the tree, which layout should ignore.
  • Update the HTML lexer or parser to support entities. Entities in HTML begin an ampersand &, then contain letters and numbers, and end with a semicolon ";"; they resolve to particular characters. Implement support for &amp;, &lt;, &gt;, and &quot;, which expand to ampersand, less than, greater than, and the quotation mark. Should you handle entities in the lexer, the parser, or in an earlier pass? Consider that attribute values can contain entities.
  • For some tags, it doesn't make sense to have one inside the other. For example, it's not clear what it would mean for one paragraph to contain another, so the most common reason for this to happen in a web page is that someone forgot a close tag. Change the parser so that a document like <p>hello<p>world</p> results in two sibling nodes instead of one paragraph inside another. This change should only affect <p> tags directly inside other <p> tags.
  • The attribute parser in this post doesn't handle attributes correctly when the attribute value contains a space (in which case it has to be quoted). Update the attribute parser to handle this case correctly.

Footnotes:

1

Well, full HTML has all sorts of other stuff, but we're skipping it for now.

2

Tracking the parent will turn out to be useful shortly.

3

If you implemented earlier exercises like support for <a>, <small>, and <big>, you will likely have more parts to the state.