R2. Backtracking Regex (Beginner)
01. Introduction
What you will learn:
- Breaking down a problem into subproblems.
- Searching and backtracking.
We have learned how to parse regexes into tree structures in the last chapter:
assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))
From the tree structure, we can start building a regex engine. The method used in this chapter is called “backtracking”. Backtracking is not the most efficient method for implementing regex, other methods will be introduced in later chapters.
02. Problem Analysis
Often the first step of problem-solving is breaking down the problem into smaller pieces or subproblems. The nature of the tree structure made this even more obvious.
- Concatenation,
ab
: Execute the subexpressiona
, then execute theb
from the result position of thea
. - Alternation,
a|b
: Combine the results from botha
andb
.
For concatenations, the first subexpression can generate any number of candidate matches. And the second subexpression depends on the results of the first subexpression, if the second subexpression fails to find a match from a particular position of the results of the first expression, the search will start again from another position if there is any. This is what “backtracking” means.
For repetitions, the situation is the same as concatenations.
In summary:
- A subexpression generates a list of candidates from a given string position.
- The concatenation operator: for each result from the first subexpression, execute and pass the results from the second subexpression.
- The repetition operator is the same as the concatenation operator.
- Union the results from alternation subexpressions.
03. Start Coding
The code is short:
# the backtracking method is pretty much brute force.
# can also add some caching to eliminate duplicated idx.
def match_backtrack(node, text, idx):
if node is None:
yield idx # empty string
elif node == 'dot':
if idx < len(text):
yield idx + 1
elif isinstance(node, str):
assert len(node) == 1 # single char
if idx < len(text) and text[idx] == node:
yield idx + 1
elif node[0] == 'cat':
# the `yield from` is equivalent to:
# for idx1 in match_backtrack_concat(node, text, idx):
# yield idx1
yield from match_backtrack_concat(node, text, idx)
elif node[0] == 'split':
yield from match_backtrack(node[1], text, idx)
yield from match_backtrack(node[2], text, idx)
elif node[0] == 'repeat':
yield from match_backtrack_repeat(node, text, idx)
else:
assert not 'reachable'
For people unfamiliar with Python, the yield
statement
suspends the execution of itself and passes a value back to the caller,
and the caller is usually looping for the results. The
yield from
is a shorthand for looping and passing through
everything.
Every time a match is found, the result is passed to the caller via
yield
. The caller can then do something with it, and ask
for the next result (by continuing the iteration).
The back-and-forth transfer of control makes backtracking code
trivial to write in Python. However, if this is your first time trying
to understand backtracking, it is advised to do this without the
yield
statement and explicitly managing backtracking states
— an exercise for people not using Python as well.
The code for concatenation:
def match_backtrack_concat(node, text, idx):
= set()
met for idx1 in match_backtrack(node[1], text, idx):
if idx1 in met:
continue # duplication
met.add(idx1)yield from match_backtrack(node[2], text, idx1)
Note that I added a hash set to skip duplicated results of the first subexpression.
The code for repetition is longer:
def match_backtrack_repeat(node, text, idx):
= node
_, node, rmin, rmax = min(rmax, RE_REPEAT_LIMIT)
rmax # the output is buffered and reversed later
= []
output if rmin == 0:
# don't have to match anything
output.append(idx)# positions from the previous step
= {idx}
start # try every possible repetition number
for i in range(1, rmax + 1):
= set()
found for idx1 in start:
for idx2 in match_backtrack(node, text, idx1):
found.add(idx2)if i >= rmin:
output.append(idx2)# TODO: bail out if the only match is of zero-length
if not found:
break
= found
start # repetition is greedy, output the most repetitive match first.
yield from reversed(output)
Repetition is handled by executing the subexpression again and again until it can no longer find a match. Each step starts with the results of the previous step. And the start positions are deduplicated as well.
Another note is that regex repetition is “greedy”, we need to find the match with maximum repetition and output that first.
The function for testing whether a string matches a regex fully:
def re_full_match_bt(node, text):
for idx in match_backtrack(node, text, 0):
# idx is the size of the matched prefix
if idx == len(text):
# NOTE: the greedy aspect of regexes seems to be irrelevant
# if we are only accepting the fully matched text.
return True
return False
Finding test cases is an exercise for readers.
04. Closing Remarks
We have learned how to match a regex via backtracking. From there we can add more regex features.
However, the backtracking approach is rather brute force, it can still perform duplicated searches. We can probably add some caching to mitigate this; but don’t be bothered with this, in the next chapter you will be exploring a new method: Nondeterministic Finite Automaton (NFA).