R1. Parsing Regular Expressions (Beginner)
01. Introduction
What you will learn:
- Regex is not magic.
- Trees and structured data.
- Recursion and parsing techniques.
Regex features you will implement:
ab, concatenation.Match the subexpression
a, thenb.a|b, alternation.Match either
aorb.a*,a+,a{n},a{x,y}, repetition.Repeat
azero or many times, at least one time,ntimes, andxtoytimes.(a), subexpression.Quote a subexpression. Expressions can be nested.
x,., match a character.Match a literal character, match an arbitrary character.
02. Problem Analysis
A regex is a piece of string. The string represents a structure. Often we are not working on the string but the structure it represents. The process of converting a string into a structure is called “parsing”.
The structure of a regex is recursive. Recursion implies hierarchy and substructures that are similar to the structure itself. It might not be obvious to you how regex is recursive. Let’s start with something you are familiar with.
The formula (1+2)+3 is recursive, the second
+ has two subformulas: 1+2 and
3.
1+2+3 is equivalent to (1+2)+3, only the
parentheses are implicit.
1+2×3 is implicitly 1+(2×3), the
× takes high priority than the +.
Now back to the regex. abc is implicitly
((a)b)c, just treat the invisible concatenation
operator as the ×, like ((a)×b)×c.
Likewise, a|bc is a|(bc), like
a+(b×c).
The concatenation and the alternation are “infix” operators, their
subexpressions are the left and right of the operator. The repetition
operator is a “postfix” operator, like the factorial
(n!).
Three operators, three priorities. The postfix operator is the highest one, followed by the concatenation, then the alternation.
To parse a regex with recursion, the operator of highest priority must be parsed at the lowest hierarchy (deepest). The lowest level of a regex is either a single character or a quoted subexpression. At one level higher, the repetition operator is added (there might be none). Then the concatenation, then the alternation.
03. Start Coding
Firstly, we decide the shapes of our parsed structures. They are trees, obviously.
assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))The highest level of parsing is the alternation operator.
Arguments:
r: the regex string.idx: the current parsing position.
# a|b|c...
def parse_split(r, idx):
idx, prev = parse_concat(r, idx)
while idx < len(r):
if r[idx] == ')':
# return to the outer parse_node
break
assert r[idx] == '|', 'BUG'
idx, node = parse_concat(r, idx + 1)
prev = ('split', prev, node)
return idx, prevAt a glance, The parse_split takes subexpressions from
the parse_concat and creates the “split” node.
The parse_concat is one level lower. It is very similar
to the parse_split.
# abc...
def parse_concat(r, idx):
prev = None
while idx < len(r):
if r[idx] in '|)':
# return to the outer parse_split or parse_node
break
idx, node = parse_node(r, idx)
if prev is None:
prev = node
else:
prev = ('cat', prev, node)
# when the prev is still None, it denotes the empty string
return idx, prevBoth the parse_split and the parse_concat
stop at the “)”. A “)” normally implies that
the parsing is inside an outer quoted subexpression, the parsing reached
the end of the subexpression and must be returned to the outer parser.
Likewise, the parse_concat stops at the “|” to
return to the outer parse_split.
Descending into the parse_node. It parses either a
single character or a quoted subexpression. Then it might add a
repetition operator.
# parse a single element
def parse_node(r, idx):
ch = r[idx]
idx += 1
assert ch not in '|)'
if ch == '(':
idx, node = parse_split(r, idx)
if idx < len(r) and r[idx] == ')':
idx += 1
else:
raise Exception('unbalanced parenthesis')
elif ch == '.':
node = 'dot'
elif ch in '*+{':
raise Exception('nothing to repeat')
else:
node = ch
idx, node = parse_postfix(r, idx, node)
return idx, nodeNote that the parse_split call is recursing back to the
highest hierarchy.
The parse_postfix is a lot longer. But it’s just some
boring sequential code.
# a*, a+, a{x}, a{x,}, a{x,y}
def parse_postfix(r, idx, node):
if idx == len(r) or r[idx] not in '*+{':
return idx, node
ch = r[idx]
idx += 1
if ch == '*':
rmin, rmax = 0, float('inf')
elif ch == '+':
rmin, rmax = 1, float('inf')
else:
# the first number inside the {}
idx, i = parse_int(r, idx)
if i is None:
raise Exception('expect int')
rmin = rmax = i
# the optional second number
if idx < len(r) and r[idx] == ',':
idx, j = parse_int(r, idx + 1)
rmax = j if (j is not None) else float('inf')
# close the brace
if idx < len(r) and r[idx] == '}':
idx += 1
else:
raise Exception('unbalanced brace')
# sanity checks
if rmax < rmin:
raise Exception('min repeat greater than max repeat')
if rmin > RE_REPEAT_LIMIT:
raise Exception('the repetition number is too large')
node = ('repeat', node, rmin, rmax)
return idx, nodedef parse_int(r, idx):
save = idx
while idx < len(r) and r[idx].isdigit():
idx += 1
return idx, int(r[save:idx]) if save != idx else NoneThe actual interface:
def re_parse(r):
idx, node = parse_split(r, 0)
if idx != len(r):
# parsing stopped at a bad ")"
raise Exception('unexpected ")"')
return node04. Closing Remarks
We have learned how to parse regex with recursion. That knowledge is also useful for parsing other stuff.
Although parsing often occupies a significant number of pages in compiler construction books, lots of real-world languages can be parsed with just recursion, and often the simplest way to parse something is through handcrafted recursion code.
The next chapter will teach you how to execute the parsed regex via “backtracking”, which also uses recursion.