R1. Parsing Regular Expressions (Beginner)

01. Introduction

What you will learn:

  • Regex is not magic.
  • Trees and structured data.
  • Recursion and parsing techniques.

Regex features you will implement:

  • ab, concatenation.

    Match the subexpression a, then b.

  • a|b, alternation.

    Match either a or b.

  • a*, a+, a{n}, a{x,y}, repetition.

    Repeat a zero or many times, at least one time, n times, and x to y times.

  • (a), subexpression.

    Quote a subexpression. Expressions can be nested.

  • x, ., match a character.

    Match a literal character, match an arbitrary character.

02. Problem Analysis

A regex is a piece of string. The string represents a structure. Often we are not working on the string but the structure it represents. The process of converting a string into a structure is called “parsing”.

The structure of a regex is recursive. Recursion implies hierarchy and substructures that are similar to the structure itself. It might not be obvious to you how regex is recursive. Let’s start with something you are familiar with.

The formula (1+2)+3 is recursive, the second + has two subformulas: 1+2 and 3.

1+2+3 is equivalent to (1+2)+3, only the parentheses are implicit.

1+2×3 is implicitly 1+(2×3), the × takes high priority than the +.

Now back to the regex. abc is implicitly ((a)b)c, just treat the invisible concatenation operator as the ×, like ((a)×b)×c.

Likewise, a|bc is a|(bc), like a+(b×c).

The concatenation and the alternation are “infix” operators, their subexpressions are the left and right of the operator. The repetition operator is a “postfix” operator, like the factorial (n!).

Three operators, three priorities. The postfix operator is the highest one, followed by the concatenation, then the alternation.

To parse a regex with recursion, the operator of highest priority must be parsed at the lowest hierarchy (deepest). The lowest level of a regex is either a single character or a quoted subexpression. At one level higher, the repetition operator is added (there might be none). Then the concatenation, then the alternation.

03. Start Coding

Firstly, we decide the shapes of our parsed structures. They are trees, obviously.

assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))

The highest level of parsing is the alternation operator.

Arguments:

  • r: the regex string.
  • idx: the current parsing position.
# a|b|c...
def parse_split(r, idx):
    idx, prev = parse_concat(r, idx)
    while idx < len(r):
        if r[idx] == ')':
            # return to the outer parse_node
            break
        assert r[idx] == '|', 'BUG'
        idx, node = parse_concat(r, idx + 1)
        prev = ('split', prev, node)
    return idx, prev

At a glance, The parse_split takes subexpressions from the parse_concat and creates the “split” node.

The parse_concat is one level lower. It is very similar to the parse_split.

# abc...
def parse_concat(r, idx):
    prev = None
    while idx < len(r):
        if r[idx] in '|)':
            # return to the outer parse_split or parse_node
            break
        idx, node = parse_node(r, idx)
        if prev is None:
            prev = node
        else:
            prev = ('cat', prev, node)
    # when the prev is still None, it denotes the empty string
    return idx, prev

Both the parse_split and the parse_concat stop at the “)”. A “)” normally implies that the parsing is inside an outer quoted subexpression, the parsing reached the end of the subexpression and must be returned to the outer parser. Likewise, the parse_concat stops at the “|” to return to the outer parse_split.

Descending into the parse_node. It parses either a single character or a quoted subexpression. Then it might add a repetition operator.

# parse a single element
def parse_node(r, idx):
    ch = r[idx]
    idx += 1
    assert ch not in '|)'
    if ch == '(':
        idx, node = parse_split(r, idx)
        if idx < len(r) and r[idx] == ')':
            idx += 1
        else:
            raise Exception('unbalanced parenthesis')
    elif ch == '.':
        node = 'dot'
    elif ch in '*+{':
        raise Exception('nothing to repeat')
    else:
        node = ch

    idx, node = parse_postfix(r, idx, node)
    return idx, node

Note that the parse_split call is recursing back to the highest hierarchy.

The parse_postfix is a lot longer. But it’s just some boring sequential code.

# a*, a+, a{x}, a{x,}, a{x,y}
def parse_postfix(r, idx, node):
    if idx == len(r) or r[idx] not in '*+{':
        return idx, node

    ch = r[idx]
    idx += 1
    if ch == '*':
        rmin, rmax = 0, float('inf')
    elif ch == '+':
        rmin, rmax = 1, float('inf')
    else:
        # the first number inside the {}
        idx, i = parse_int(r, idx)
        if i is None:
            raise Exception('expect int')
        rmin = rmax = i
        # the optional second number
        if idx < len(r) and r[idx] == ',':
            idx, j = parse_int(r, idx + 1)
            rmax = j if (j is not None) else float('inf')
        # close the brace
        if idx < len(r) and r[idx] == '}':
            idx += 1
        else:
            raise Exception('unbalanced brace')

    # sanity checks
    if rmax < rmin:
        raise Exception('min repeat greater than max repeat')
    if rmin > RE_REPEAT_LIMIT:
        raise Exception('the repetition number is too large')

    node = ('repeat', node, rmin, rmax)
    return idx, node
def parse_int(r, idx):
    save = idx
    while idx < len(r) and r[idx].isdigit():
        idx += 1
    return idx, int(r[save:idx]) if save != idx else None

The actual interface:

def re_parse(r):
    idx, node = parse_split(r, 0)
    if idx != len(r):
        # parsing stopped at a bad ")"
        raise Exception('unexpected ")"')
    return node

04. Closing Remarks

We have learned how to parse regex with recursion. That knowledge is also useful for parsing other stuff.

Although parsing often occupies a significant number of pages in compiler construction books, lots of real-world languages can be parsed with just recursion, and often the simplest way to parse something is through handcrafted recursion code.

The next chapter will teach you how to execute the parsed regex via “backtracking”, which also uses recursion.