R1. Parsing Regular Expressions (Beginner)
01. Introduction
What you will learn:
- Regex is not magic.
- Trees and structured data.
- Recursion and parsing techniques.
Regex features you will implement:
ab
, concatenation.Match the subexpression
a
, thenb
.a|b
, alternation.Match either
a
orb
.a*
,a+
,a{n}
,a{x,y}
, repetition.Repeat
a
zero or many times, at least one time,n
times, andx
toy
times.(a)
, subexpression.Quote a subexpression. Expressions can be nested.
x
,.
, match a character.Match a literal character, match an arbitrary character.
02. Problem Analysis
A regex is a piece of string. The string represents a structure. Often we are not working on the string but the structure it represents. The process of converting a string into a structure is called “parsing”.
The structure of a regex is recursive. Recursion implies hierarchy and substructures that are similar to the structure itself. It might not be obvious to you how regex is recursive. Let’s start with something you are familiar with.
The formula (1+2)+3
is recursive, the second
+
has two subformulas: 1+2
and
3
.
1+2+3
is equivalent to (1+2)+3
, only the
parentheses are implicit.
1+2×3
is implicitly 1+(2×3)
, the
×
takes high priority than the +
.
Now back to the regex. abc
is implicitly
((a)b)c
, just treat the invisible concatenation
operator as the ×
, like ((a)×b)×c
.
Likewise, a|bc
is a|(bc)
, like
a+(b×c)
.
The concatenation and the alternation are “infix” operators, their
subexpressions are the left and right of the operator. The repetition
operator is a “postfix” operator, like the factorial
(n!
).
Three operators, three priorities. The postfix operator is the highest one, followed by the concatenation, then the alternation.
To parse a regex with recursion, the operator of highest priority must be parsed at the lowest hierarchy (deepest). The lowest level of a regex is either a single character or a quoted subexpression. At one level higher, the repetition operator is added (there might be none). Then the concatenation, then the alternation.
03. Start Coding
Firstly, we decide the shapes of our parsed structures. They are trees, obviously.
assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))
The highest level of parsing is the alternation operator.
Arguments:
r
: the regex string.idx
: the current parsing position.
# a|b|c...
def parse_split(r, idx):
= parse_concat(r, idx)
idx, prev while idx < len(r):
if r[idx] == ')':
# return to the outer parse_node
break
assert r[idx] == '|', 'BUG'
= parse_concat(r, idx + 1)
idx, node = ('split', prev, node)
prev return idx, prev
At a glance, The parse_split
takes subexpressions from
the parse_concat
and creates the “split” node.
The parse_concat
is one level lower. It is very similar
to the parse_split
.
# abc...
def parse_concat(r, idx):
= None
prev while idx < len(r):
if r[idx] in '|)':
# return to the outer parse_split or parse_node
break
= parse_node(r, idx)
idx, node if prev is None:
= node
prev else:
= ('cat', prev, node)
prev # when the prev is still None, it denotes the empty string
return idx, prev
Both the parse_split
and the parse_concat
stop at the “)
”. A “)
” normally implies that
the parsing is inside an outer quoted subexpression, the parsing reached
the end of the subexpression and must be returned to the outer parser.
Likewise, the parse_concat
stops at the “|
” to
return to the outer parse_split
.
Descending into the parse_node
. It parses either a
single character or a quoted subexpression. Then it might add a
repetition operator.
# parse a single element
def parse_node(r, idx):
= r[idx]
ch += 1
idx assert ch not in '|)'
if ch == '(':
= parse_split(r, idx)
idx, node if idx < len(r) and r[idx] == ')':
+= 1
idx else:
raise Exception('unbalanced parenthesis')
elif ch == '.':
= 'dot'
node elif ch in '*+{':
raise Exception('nothing to repeat')
else:
= ch
node
= parse_postfix(r, idx, node)
idx, node return idx, node
Note that the parse_split
call is recursing back to the
highest hierarchy.
The parse_postfix
is a lot longer. But it’s just some
boring sequential code.
# a*, a+, a{x}, a{x,}, a{x,y}
def parse_postfix(r, idx, node):
if idx == len(r) or r[idx] not in '*+{':
return idx, node
= r[idx]
ch += 1
idx if ch == '*':
= 0, float('inf')
rmin, rmax elif ch == '+':
= 1, float('inf')
rmin, rmax else:
# the first number inside the {}
= parse_int(r, idx)
idx, i if i is None:
raise Exception('expect int')
= rmax = i
rmin # the optional second number
if idx < len(r) and r[idx] == ',':
= parse_int(r, idx + 1)
idx, j = j if (j is not None) else float('inf')
rmax # close the brace
if idx < len(r) and r[idx] == '}':
+= 1
idx else:
raise Exception('unbalanced brace')
# sanity checks
if rmax < rmin:
raise Exception('min repeat greater than max repeat')
if rmin > RE_REPEAT_LIMIT:
raise Exception('the repetition number is too large')
= ('repeat', node, rmin, rmax)
node return idx, node
def parse_int(r, idx):
= idx
save while idx < len(r) and r[idx].isdigit():
+= 1
idx return idx, int(r[save:idx]) if save != idx else None
The actual interface:
def re_parse(r):
= parse_split(r, 0)
idx, node if idx != len(r):
# parsing stopped at a bad ")"
raise Exception('unexpected ")"')
return node
04. Closing Remarks
We have learned how to parse regex with recursion. That knowledge is also useful for parsing other stuff.
Although parsing often occupies a significant number of pages in compiler construction books, lots of real-world languages can be parsed with just recursion, and often the simplest way to parse something is through handcrafted recursion code.
The next chapter will teach you how to execute the parsed regex via “backtracking”, which also uses recursion.