R3. Graph Traversal and Regex (Intermediate)

01. Introduction

What you will learn:

  • An alternative way of breaking down the problem.
  • Graph traversal.
  • Nondeterministic finite automatons (NFA).

We have learned how to execute regexes via backtracking in the last chapter. Now we’ll ditch backtracking and explore a more efficient method.

As said in the last chapter, the first step is breaking down problems into subproblems. This time we do it differently, instead of breaking down the regex, we break down the input string.

02. The Graph Representation of Regex

Any regex can be represented as a graph whose links between nodes are characters from input strings.

ab

start --a--> node --b--> end

a|b

       +--a--+
       |     |
start -+     +-> end
       |     |
       +--b--+

a+

start ----a--+--> end
       ^     |
       |     |
       +-----+

There are 2 kinds of links:

  1. Links can only be passed with a certain input character.
  2. “Free” links that can be passed without consuming an input character.

For a given input string, the task is to determine whether the end can be reached. There are too many ways to traverse a graph, especially when the graph contains loops. Let’s start with a smaller problem.

Given the first character of the input string, start from the “start” node, what’s the possible position you can travel to? (There might be multiple positions because of split paths and loops.) Record all possible positions, they are now the starting positions of the second character. We now have broke down the problem into step-by-step graph traversals!

We walk from one set of positions to another set of positions, this technique is called “nondeterministic finite automaton” (NFA) in textbooks.

03. Enforcing Constraints in NFA

The number of repetitions in a regex can be constrained, such as a{3,6}. We’ll add a special kind of node to enforce this.

start --> door_in --> subgraph --> door_out --> end
             ^                         |
             |                         |
             +-------------------------+

The door_in is free and leads only to the repetition subgraph. It also separates the subgraph from the start node.

The door_out is a special node — when you reach it, a counter is increased, and it checks the counter against the constraints:

  1. If it is below the minimum repetition, go back to the door_in.
  2. If it is equal to the maximum repetition, exit to the end and discard that counter.
  3. Otherwise, it can go to both the door_in and the end.

The “positions” mentioned before are no longer just node references, as we have added counters too:

(node_id, set_of_counters)

04. Converting Trees to Graphs

Our analysis is done. Let’s start coding. The first step is converting trees to graphs.

# Build a graph from a node.
# The graph is entered/exited via the start/end node.
# The id2node is a mapping from integer IDs to the graph nodes.
# The graph node is either a list of links or a special "boss" node.
def nfa_make(node, start, end, id2node):
    if node is None:
        start.append((None, end))
    elif node == 'dot':
        start.append(('dot', end))
    elif isinstance(node, str):
        start.append((node, end))
    elif node[0] == 'cat':
        # connect the two subgraphs via a middle node
        middle = []
        id2node[id(middle)] = middle
        nfa_make(node[1], start, middle, id2node)
        nfa_make(node[2], middle, end, id2node)
    elif node[0] == 'split':
        # connect with both subgraphs
        nfa_make(node[1], start, end, id2node)
        nfa_make(node[2], start, end, id2node)
    elif node[0] == 'repeat':
        nfa_make_repeat(node, start, end, id2node)
    else:
        assert not 'reachable'

This is easily done with recursion. The id2node is a map, it maps a node ID to the node itself, we’ll refer to nodes by their IDs later.

The special node for repetition. Let’s call it the “boss” node.

def nfa_make_repeat(node, start, end, id2node):
    # unpack
    _, node, rmin, rmax = node
    rmax = min(rmax, RE_REPEAT_LIMIT)
    # the door_in only leads to the subgraph,
    # it is necessary for repetitions to work.
    door_in = []
    # the door_out leads to either the door_in or the end.
    door_out = ('boss', door_in, end, rmin, rmax)
    id2node[id(door_in)] = door_in
    id2node[id(door_out)] = door_out
    # the subgraph between the door_in and the door_out
    nfa_make(node, door_in, door_out, id2node)
    # links from the start node
    start.append((None, door_in))
    if rmin == 0:
        start.append((None, end))

05. Traversing the Graph

This gives you a high-level overview:

def re_full_match_nfa(node, text):
    # build the graph
    start, end = [], []
    id2node = {id(start): start, id(end): end}
    nfa_make(node, start, end, id2node)
    # explore and expand the initial position set
    node_set = {(id(start), ())}
    nfa_expand(node_set, id2node)
    for ch in text:
        # move to the next position set
        node_set = nfa_step(node_set, ch, id2node)
        nfa_expand(node_set, id2node)
    return (id(end), ()) in node_set

Remember there are 2 kinds of links:

  1. The nfa_step consumes an input character and finds the next possible position set.
  2. The nfa_expand traverses free links.

Listing the nfa_step:

# the next position set from the input character
def nfa_step(node_set, ch, id2node):
    assert len(ch) == 1
    next_nodes = set()
    for node, kv in node_set:
        node = id2node[node]
        # only normal nodes since bosses were handled by the nfa_expand.
        assert not isinstance(node, tuple), 'unexpected boss'
        for cond, dst in node:
            if cond == 'dot' or cond == ch:
                next_nodes.add((id(dst), kv))
    return next_nodes

The node ID is used instead of the node itself, this is for the hash set.

The nfa_expand is more complicated.

# expand the position set via free links and boss nodes
def nfa_expand(node_set, id2node):
    start = list(node_set)
    while start:
        new_nodes = []
        for node, kv in start:
            node = id2node[node]
            if isinstance(node, tuple) and node[0] == 'boss':
                # a boss, replace it with the outcome
                node_set.remove((id(node), kv))
                for dst, kv in nfa_boss(node, kv):
                    new_nodes.append((id(dst), kv))
            else:
                # explore new nodes via free links
                for cond, dst in node:
                    if cond is None:
                        new_nodes.append((id(dst), kv))

        # newly added nodes will be used for the next iteration
        start = []
        for state in new_nodes:
            if state not in node_set:
                node_set.add(state)
                start.append(state)

At a high level, it enumerates each free link, then adds their destinations to the position set. The newly discovered nodes can lead to even more free links, thus, we need to repeat this in a loop until no more nodes can be reached via free links.

It also handles the boss node since the boss gives free links too.

def nfa_boss(node, kv):
    _, door_in, end, rmin, rmax = node
    key = id(door_in)   # this is unique for identifying the boss
    kv, cnt = kv_increase(kv, key)
    if cnt < rmax:
        # repeat the level again
        yield (door_in, kv)
    if rmin <= cnt <= rmax:
        # pass the level
        yield (end, kv_delete(kv, key))

The counter set is just a sorted tuple of key-value pairs:

def kv_increase(kv, key):
    kv = dict(kv)
    val = kv.get(key, 0) + 1
    kv[key] = val
    return tuple(sorted(kv.items())), val

def kv_delete(kv, key):
    return tuple((k, v) for k, v in kv if k != key)

06. Closing Remarks

One can learn further by comparing different methods of doing things. Let’s compare the backtracking and the graph approach:

  1. How the problem is reduced? Backtracking breaks down the regex into subexpressions. NFA breaks down the input string into step-by-step graph traversals.
  2. The order of search. Backtracking is a deep-first search (DFS) on subexpressions. NFA is a breath-first search (BFS) on graphs.

Some lessons can be learned from this:

  1. Looking for different ways of breaking down the problem.
  2. Why is the NFA more efficient? DFS vs BFS?

Further reading on regex: https://swtch.com/~rsc/regexp/