# R1. Parsing Regular Expressions (Beginner)

### 01. Introduction

What you will learn:

• Regex is not magic.
• Trees and structured data.
• Recursion and parsing techniques.

Regex features you will implement:

• `ab`, concatenation.

Match the subexpression `a`, then `b`.

• `a|b`, alternation.

Match either `a` or `b`.

• `a*`, `a+`, `a{n}`, `a{x,y}`, repetition.

Repeat `a` zero or many times, at least one time, `n` times, and `x` to `y` times.

• `(a)`, subexpression.

Quote a subexpression. Expressions can be nested.

• `x`, `.`, match a character.

Match a literal character, match an arbitrary character.

### 02. Problem Analysis

A regex is a piece of string. The string represents a structure. Often we are not working on the string but the structure it represents. The process of converting a string into a structure is called “parsing”.

The structure of a regex is recursive. Recursion implies hierarchy and substructures that are similar to the structure itself. It might not be obvious to you how regex is recursive. Let’s start with something you are familiar with.

The formula `(1+2)+3` is recursive, the second `+` has two subformulas: `1+2` and `3`.

`1+2+3` is equivalent to `(1+2)+3`, only the parentheses are implicit.

`1+2×3` is implicitly `1+(2×3)`, the `×` takes high priority than the `+`.

Now back to the regex. `abc` is implicitly `((a)b)c`, just treat the invisible concatenation operator as the `×`, like `((a)×b)×c`.

Likewise, `a|bc` is `a|(bc)`, like `a+(b×c)`.

The concatenation and the alternation are “infix” operators, their subexpressions are the left and right of the operator. The repetition operator is a “postfix” operator, like the factorial (`n!`).

Three operators, three priorities. The postfix operator is the highest one, followed by the concatenation, then the alternation.

To parse a regex with recursion, the operator of highest priority must be parsed at the lowest hierarchy (deepest). The lowest level of a regex is either a single character or a quoted subexpression. At one level higher, the repetition operator is added (there might be none). Then the concatenation, then the alternation.

### 03. Start Coding

Firstly, we decide the shapes of our parsed structures. They are trees, obviously.

``````assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))``````

The highest level of parsing is the alternation operator.

Arguments:

• `r`: the regex string.
• `idx`: the current parsing position.
``````# a|b|c...
def parse_split(r, idx):
idx, prev = parse_concat(r, idx)
while idx < len(r):
if r[idx] == ')':
break
assert r[idx] == '|', 'BUG'
idx, node = parse_concat(r, idx + 1)
prev = ('split', prev, node)
return idx, prev``````

At a glance, The `parse_split` takes subexpressions from the `parse_concat` and creates the “split” node.

The `parse_concat` is one level lower. It is very similar to the `parse_split`.

``````# abc...
def parse_concat(r, idx):
prev = None
while idx < len(r):
if r[idx] in '|)':
break
idx, node = parse_node(r, idx)
if prev is None:
prev = node
else:
prev = ('cat', prev, node)
# when the prev is still None, it denotes the empty string
return idx, prev``````

Both the `parse_split` and the `parse_concat` stop at the “`)`”. A “`)`” normally implies that the parsing is inside an outer quoted subexpression, the parsing reached the end of the subexpression and must be returned to the outer parser. Likewise, the `parse_concat` stops at the “`|`” to return to the outer `parse_split`.

Descending into the `parse_node`. It parses either a single character or a quoted subexpression. Then it might add a repetition operator.

``````# parse a single element
def parse_node(r, idx):
ch = r[idx]
idx += 1
assert ch not in '|)'
if ch == '(':
idx, node = parse_split(r, idx)
if idx < len(r) and r[idx] == ')':
idx += 1
else:
raise Exception('unbalanced parenthesis')
elif ch == '.':
node = 'dot'
elif ch in '*+{':
raise Exception('nothing to repeat')
else:
node = ch

idx, node = parse_postfix(r, idx, node)
return idx, node``````

Note that the `parse_split` call is recursing back to the highest hierarchy.

The `parse_postfix` is a lot longer. But it’s just some boring sequential code.

``````# a*, a+, a{x}, a{x,}, a{x,y}
def parse_postfix(r, idx, node):
if idx == len(r) or r[idx] not in '*+{':
return idx, node

ch = r[idx]
idx += 1
if ch == '*':
rmin, rmax = 0, float('inf')
elif ch == '+':
rmin, rmax = 1, float('inf')
else:
# the first number inside the {}
idx, i = parse_int(r, idx)
if i is None:
raise Exception('expect int')
rmin = rmax = i
# the optional second number
if idx < len(r) and r[idx] == ',':
idx, j = parse_int(r, idx + 1)
rmax = j if (j is not None) else float('inf')
# close the brace
if idx < len(r) and r[idx] == '}':
idx += 1
else:
raise Exception('unbalanced brace')

# sanity checks
if rmax < rmin:
raise Exception('min repeat greater than max repeat')
if rmin > RE_REPEAT_LIMIT:
raise Exception('the repetition number is too large')

node = ('repeat', node, rmin, rmax)
return idx, node``````
``````def parse_int(r, idx):
save = idx
while idx < len(r) and r[idx].isdigit():
idx += 1
return idx, int(r[save:idx]) if save != idx else None``````

The actual interface:

``````def re_parse(r):
idx, node = parse_split(r, 0)
if idx != len(r):
# parsing stopped at a bad ")"
raise Exception('unexpected ")"')
return node``````

### 04. Closing Remarks

We have learned how to parse regex with recursion. That knowledge is also useful for parsing other stuff.

Although parsing often occupies a significant number of pages in compiler construction books, lots of real-world languages can be parsed with just recursion, and often the simplest way to parse something is through handcrafted recursion code.

The next chapter will teach you how to execute the parsed regex via “backtracking”, which also uses recursion.