R2. Backtracking Regex (Beginner)

01. Introduction

What you will learn:

  • Breaking down a problem into subproblems.
  • Searching and backtracking.

We have learned how to parse regexes into tree structures in the last chapter:

assert re_parse('') is None
assert re_parse('.') == 'dot'
assert re_parse('a') == 'a'
assert re_parse('ab') == ('cat', 'a', 'b')
assert re_parse('a|b') == ('split', 'a', 'b')
assert re_parse('a+') == ('repeat', 'a', 1, float('inf'))
assert re_parse('a{3,6}') == ('repeat', 'a', 3, 6)
assert re_parse('a|bc') == ('split', 'a', ('cat', 'b', 'c'))

From the tree structure, we can start building a regex engine. The method used in this chapter is called “backtracking”. Backtracking is not the most efficient method for implementing regex, other methods will be introduced in later chapters.

02. Problem Analysis

Often the first step of problem-solving is breaking down the problem into smaller pieces or subproblems. The nature of the tree structure made this even more obvious.

  • Concatenation, ab: Execute the subexpression a, then execute the b from the result position of the a.
  • Alternation, a|b: Combine the results from both a and b.

For concatenations, the first subexpression can generate any number of candidate matches. And the second subexpression depends on the results of the first subexpression, if the second subexpression fails to find a match from a particular position of the results of the first expression, the search will start again from another position if there is any. This is what “backtracking” means.

For repetitions, the situation is the same as concatenations.

In summary:

  1. A subexpression generates a list of candidates from a given string position.
  2. The concatenation operator: for each result from the first subexpression, execute and pass the results from the second subexpression.
  3. The repetition operator is the same as the concatenation operator.
  4. Union the results from alternation subexpressions.

03. Start Coding

The code is short:

# the backtracking method is pretty much brute force.
# can also add some caching to eliminate duplicated idx.
def match_backtrack(node, text, idx):
    if node is None:
        yield idx   # empty string
    elif node == 'dot':
        if idx < len(text):
            yield idx + 1
    elif isinstance(node, str):
        assert len(node) == 1   # single char
        if idx < len(text) and text[idx] == node:
            yield idx + 1
    elif node[0] == 'cat':
        # the `yield from` is equivalent to:
        # for idx1 in match_backtrack_concat(node, text, idx):
        #     yield idx1
        yield from match_backtrack_concat(node, text, idx)
    elif node[0] == 'split':
        yield from match_backtrack(node[1], text, idx)
        yield from match_backtrack(node[2], text, idx)
    elif node[0] == 'repeat':
        yield from match_backtrack_repeat(node, text, idx)
    else:
        assert not 'reachable'

For people unfamiliar with Python, the yield statement suspends the execution of itself and passes a value back to the caller, and the caller is usually looping for the results. The yield from is a shorthand for looping and passing through everything.

Every time a match is found, the result is passed to the caller via yield. The caller can then do something with it, and ask for the next result (by continuing the iteration).

The back-and-forth transfer of control makes backtracking code trivial to write in Python. However, if this is your first time trying to understand backtracking, it is advised to do this without the yield statement and explicitly managing backtracking states — an exercise for people not using Python as well.

The code for concatenation:

def match_backtrack_concat(node, text, idx):
    met = set()
    for idx1 in match_backtrack(node[1], text, idx):
        if idx1 in met:
            continue    # duplication
        met.add(idx1)
        yield from match_backtrack(node[2], text, idx1)

Note that I added a hash set to skip duplicated results of the first subexpression.

The code for repetition is longer:

def match_backtrack_repeat(node, text, idx):
    _, node, rmin, rmax = node
    rmax = min(rmax, RE_REPEAT_LIMIT)
    # the output is buffered and reversed later
    output = []
    if rmin == 0:
        # don't have to match anything
        output.append(idx)
    # positions from the previous step
    start = {idx}
    # try every possible repetition number
    for i in range(1, rmax + 1):
        found = set()
        for idx1 in start:
            for idx2 in match_backtrack(node, text, idx1):
                found.add(idx2)
                if i >= rmin:
                    output.append(idx2)
        # TODO: bail out if the only match is of zero-length
        if not found:
            break
        start = found
    # repetition is greedy, output the most repetitive match first.
    yield from reversed(output)

Repetition is handled by executing the subexpression again and again until it can no longer find a match. Each step starts with the results of the previous step. And the start positions are deduplicated as well.

Another note is that regex repetition is “greedy”, we need to find the match with maximum repetition and output that first.

The function for testing whether a string matches a regex fully:

def re_full_match_bt(node, text):
    for idx in match_backtrack(node, text, 0):
        # idx is the size of the matched prefix
        if idx == len(text):
            # NOTE: the greedy aspect of regexes seems to be irrelevant
            #       if we are only accepting the fully matched text.
            return True
    return False

Finding test cases is an exercise for readers.

04. Closing Remarks

We have learned how to match a regex via backtracking. From there we can add more regex features.

However, the backtracking approach is rather brute force, it can still perform duplicated searches. We can probably add some caching to mitigate this; but don’t be bothered with this, in the next chapter you will be exploring a new method: Nondeterministic Finite Automaton (NFA).