build-your-own.org > Books > From Source Code To Machine Code
EBook·Paperback
⟵ prev Contents next ⟶

🔥 My other Book: Build Your Own Redis

02. A Simple Calculator

2.1 Introduction

The first step in building a compiler or interpreter is to parse the source code into tree structures. Most parsing can be done recursively, and you can parse most computer languages once you learn how to do it.

Parsing is the boring part of a compiler, and most production programming languages require a lot of labor work to parse. Fortunately, there is a type of grammar that can be parsed with minimal effort: the S-expression (sexpr). We’ll use it to save the time.

2.2 The Sexpr

An sexpr is either a nested list enclosed in parentheses, or an indivisible element called an “atom”. Some examples:

;; atom
a
123
;; list
(a)
(a b c d)
;; nested list
(a ((b) c))

It naturally represents a tree, and most parsing is converting a string representation into a tree structure, so why not just use the sexpr for syntax? Especially when the sexpr is simpler and easier to parse than alternatives.

Besides the obvious tree structure, sexpr-based syntax often uses prefix operators instead of infix operators. Consider the following cases:

Infix operators:

1 + 2 * 3

Prefix operators:

(+ 1 (* 2 3))

Infix operators are complicated by different operator precedence — a problem that doesn’t exist with prefix operators. Prefix operators are also easier to handle when we need further parsing. That’s why I use sexpr in this book. If you like challenges, you can try using an alternative grammar, such as a C-like grammar with infix operators.

We’ll implement the following operators:

  1. Binary operators: (op a b).

    Arithmetics:

    • (+ 1 2)
    • (- 1 2)
    • (* 1 2)
    • (/ 1 2)

    Comparisons:

    • (lt 1 2), less than.
    • (gt 1 2), greater than.
    • (eq 1 2), equal.
    • (ne 1 2), not equal.

    Booleans:

    • (and a b)
    • (or a b)
  2. Unary operators:

    • (- a), negative.
    • (not a), boolean not.
  3. Conditional: (? expr-cond expr-then expr-else).

    e.g., (? (lt 1 2) "yes" "no") yields “yes”.

  4. (print a b c)

2.3 Start Coding

An sexpr will be parsed into either a Python list or a Python string. For example, the sexpr (a b (+ 1 2)) will be parsed into:

["a", "b", ["+", ["val", 1], ["val", 2]]]

Note that numbers or strings in the sexpr are wrapped by the ["val", x] structure unlike the symbols a or b, this is explained later.

The function parse_expr take the input string and the current offset (idx) into the input as arguments. It advances the offset during parsing until an error is encountered or the sexpr is terminated.

To parse an sexpr, the first step is to determine whether it’s an atom or a list, this is done by looking at the first non-whitespace character.

def parse_expr(s: str, idx: int):
    idx = skip_space(s, idx)
    if s[idx] == '(':
        # a list
        ...
    elif s[idx] == ')':
        raise Exception('bad parenthesis')
    else:
        # an atom
        ...

If it’s a list, parse recursively until the closing parenthesis is met.

def parse_expr(s: str, idx: int):
    idx = skip_space(s, idx)
    if s[idx] == '(':
        # a list
        idx += 1
        l = []
        while True:
            idx = skip_space(s, idx)
            if idx >= len(s):
                raise Exception('unbalanced parenthesis')
            if s[idx] == ')':
                idx += 1
                break

            idx, v = parse_expr(s, idx)
            l.append(v)
        return idx, l
    elif s[idx] == ')':
        raise Exception('bad parenthesis')
    else:
        # an atom
        start = idx
        while idx < len(s) and (not s[idx].isspace()) and s[idx] not in '()':
            idx += 1
        if start == idx:
            raise Exception('empty program')
        return idx, parse_atom(s[start:idx])

Any non-whitespace character is considered part of an atom.

def skip_space(s, idx):
    while idx < len(s) and s[idx].isspace():
        idx += 1
    return idx

Since we are going to evaluate the parsed sexpr, it makes sense to extract values such as numbers and strings from atoms. The “val” string at the list head is for distinguishing parsed values from other atoms (symbol).

# bool, number, string or a symbol
def parse_atom(s):
    # TODO: actually implement this
    import json
    try:
        return ['val', json.loads(s)]
    except json.JSONDecodeError:
        return s

The above function is just a placeholder, we’ll ditch it in later chapters.

As we can see, the pattern of parsing is:

  1. Look at what’s at the beginning and decide on the next step.
  2. Recurse if necessary.

Even for more complicated languages such as C or Java, their grammar can be parsed in this way. Although this might not be obvious to you at this point.

And finally, we need to check that the input is fully exhausted:

def pl_parse(s):
    idx, node = parse_expr(s, 0)
    idx = skip_space(s, idx)
    if idx < len(s):
        raise ValueError('trailing garbage')
    return node

2.4 Expression Evaluation

Evaluating an sexpr is dead simple:

  1. Look at the first item in the list and decide what the operation is.
  2. Evaluate the rest of the items recursively.
  3. Perform the operation.
def pl_eval(node):
    if len(node) == 0:
        raise ValueError('empty list')

    # bool, number, string and etc
    if len(node) == 2 and node[0] == 'val':
        return node[1]

    # binary operators
    import operator
    binops = {
        '+': operator.add,
        '-': operator.sub,
        '*': operator.mul,
        '/': operator.truediv,
        'eq': operator.eq,
        'ne': operator.ne,
        'ge': operator.ge,
        'gt': operator.gt,
        'le': operator.le,
        'lt': operator.lt,
        'and': operator.and_,
        'or': operator.or_,
    }
    if len(node) == 3 and node[0] in binops:
        op = binops[node[0]]
        return op(pl_eval(node[1]), pl_eval(node[2]))

    # unary operators
    unops = {
        '-': operator.neg,
        'not': operator.not_,
    }
    if len(node) == 2 and node[0] in unops:
        op = unops[node[0]]
        return op(pl_eval(node[1]))

    # conditionals
    if len(node) == 4 and node[0] == '?':
        _, cond, yes, no = node
        if pl_eval(cond):
            return pl_eval(yes)
        else:
            return pl_eval(no)

    # print
    if node[0] == 'print':
        return print(*(pl_eval(val) for val in node[1:]))

    raise ValueError('unknown expression')

It’s basically mapping operators to Python operators. The expression evaluator can be used as an (unconventional) calculator.

def test_eval():
    def f(s):
        return pl_eval(pl_parse(s))
    assert f('1') == 1
    assert f('(+ 1 3)') == 4
    assert f('(? (lt 1 3) "yes" "no")') == "yes"
    assert f('(print 1 2 3)') is None

We have implemented a simple calculator, which is the basis of the next chapter project — a programming language with variables, control flows, and functions.

( Report an Error | Ask a Question) @ build-your-own.org


⟵ prev Contents next ⟶