R3. Graph Traversal and Regex (Intermediate)
01. Introduction
What you will learn:
- An alternative way of breaking down the problem.
- Graph traversal.
- Nondeterministic finite automatons (NFA).
We have learned how to execute regexes via backtracking in the last chapter. Now we’ll ditch backtracking and explore a more efficient method.
As said in the last chapter, the first step is breaking down problems into subproblems. This time we do it differently, instead of breaking down the regex, we break down the input string.
02. The Graph Representation of Regex
Any regex can be represented as a graph whose links between nodes are characters from input strings.
ab
start --a--> node --b--> end
a|b
+--a--+
| |
start -+ +-> end
| |
+--b--+
a+
start ----a--+--> end
^ |
| |
+-----+
There are 2 kinds of links:
- Links can only be passed with a certain input character.
- “Free” links that can be passed without consuming an input character.
For a given input string, the task is to determine whether the end can be reached. There are too many ways to traverse a graph, especially when the graph contains loops. Let’s start with a smaller problem.
Given the first character of the input string, start from the “start” node, what’s the possible position you can travel to? (There might be multiple positions because of split paths and loops.) Record all possible positions, they are now the starting positions of the second character. We now have broke down the problem into step-by-step graph traversals!
We walk from one set of positions to another set of positions, this technique is called “nondeterministic finite automaton” (NFA) in textbooks.
03. Enforcing Constraints in NFA
The number of repetitions in a regex can be constrained, such as
a{3,6}
. We’ll add a special kind of node to enforce
this.
start --> door_in --> subgraph --> door_out --> end
^ |
| |
+-------------------------+
The door_in
is free and leads only to the repetition
subgraph. It also separates the subgraph from the start node.
The door_out
is a special node — when you reach it, a
counter is increased, and it checks the counter against the
constraints:
- If it is below the minimum repetition, go back to the
door_in
. - If it is equal to the maximum repetition, exit to the end and discard that counter.
- Otherwise, it can go to both the
door_in
and the end.
The “positions” mentioned before are no longer just node references, as we have added counters too:
(node_id, set_of_counters)
04. Converting Trees to Graphs
Our analysis is done. Let’s start coding. The first step is converting trees to graphs.
# Build a graph from a node.
# The graph is entered/exited via the start/end node.
# The id2node is a mapping from integer IDs to the graph nodes.
# The graph node is either a list of links or a special "boss" node.
def nfa_make(node, start, end, id2node):
if node is None:
None, end))
start.append((elif node == 'dot':
'dot', end))
start.append((elif isinstance(node, str):
start.append((node, end))elif node[0] == 'cat':
# connect the two subgraphs via a middle node
= []
middle id(middle)] = middle
id2node[1], start, middle, id2node)
nfa_make(node[2], middle, end, id2node)
nfa_make(node[elif node[0] == 'split':
# connect with both subgraphs
1], start, end, id2node)
nfa_make(node[2], start, end, id2node)
nfa_make(node[elif node[0] == 'repeat':
nfa_make_repeat(node, start, end, id2node)else:
assert not 'reachable'
This is easily done with recursion. The id2node
is a
map, it maps a node ID to the node itself, we’ll refer to nodes by their
IDs later.
The special node for repetition. Let’s call it the “boss” node.
def nfa_make_repeat(node, start, end, id2node):
# unpack
= node
_, node, rmin, rmax = min(rmax, RE_REPEAT_LIMIT)
rmax # the door_in only leads to the subgraph,
# it is necessary for repetitions to work.
= []
door_in # the door_out leads to either the door_in or the end.
= ('boss', door_in, end, rmin, rmax)
door_out id(door_in)] = door_in
id2node[id(door_out)] = door_out
id2node[# the subgraph between the door_in and the door_out
nfa_make(node, door_in, door_out, id2node)# links from the start node
None, door_in))
start.append((if rmin == 0:
None, end)) start.append((
05. Traversing the Graph
This gives you a high-level overview:
def re_full_match_nfa(node, text):
# build the graph
= [], []
start, end = {id(start): start, id(end): end}
id2node
nfa_make(node, start, end, id2node)# explore and expand the initial position set
= {(id(start), ())}
node_set
nfa_expand(node_set, id2node)for ch in text:
# move to the next position set
= nfa_step(node_set, ch, id2node)
node_set
nfa_expand(node_set, id2node)return (id(end), ()) in node_set
Remember there are 2 kinds of links:
- The
nfa_step
consumes an input character and finds the next possible position set. - The
nfa_expand
traverses free links.
Listing the nfa_step
:
# the next position set from the input character
def nfa_step(node_set, ch, id2node):
assert len(ch) == 1
= set()
next_nodes for node, kv in node_set:
= id2node[node]
node # only normal nodes since bosses were handled by the nfa_expand.
assert not isinstance(node, tuple), 'unexpected boss'
for cond, dst in node:
if cond == 'dot' or cond == ch:
id(dst), kv))
next_nodes.add((return next_nodes
The node ID is used instead of the node itself, this is for the hash set.
The nfa_expand
is more complicated.
# expand the position set via free links and boss nodes
def nfa_expand(node_set, id2node):
= list(node_set)
start while start:
= []
new_nodes for node, kv in start:
= id2node[node]
node if isinstance(node, tuple) and node[0] == 'boss':
# a boss, replace it with the outcome
id(node), kv))
node_set.remove((for dst, kv in nfa_boss(node, kv):
id(dst), kv))
new_nodes.append((else:
# explore new nodes via free links
for cond, dst in node:
if cond is None:
id(dst), kv))
new_nodes.append((
# newly added nodes will be used for the next iteration
= []
start for state in new_nodes:
if state not in node_set:
node_set.add(state) start.append(state)
At a high level, it enumerates each free link, then adds their destinations to the position set. The newly discovered nodes can lead to even more free links, thus, we need to repeat this in a loop until no more nodes can be reached via free links.
It also handles the boss node since the boss gives free links too.
def nfa_boss(node, kv):
= node
_, door_in, end, rmin, rmax = id(door_in) # this is unique for identifying the boss
key = kv_increase(kv, key)
kv, cnt if cnt < rmax:
# repeat the level again
yield (door_in, kv)
if rmin <= cnt <= rmax:
# pass the level
yield (end, kv_delete(kv, key))
The counter set is just a sorted tuple of key-value pairs:
def kv_increase(kv, key):
= dict(kv)
kv = kv.get(key, 0) + 1
val = val
kv[key] return tuple(sorted(kv.items())), val
def kv_delete(kv, key):
return tuple((k, v) for k, v in kv if k != key)
06. Closing Remarks
One can learn further by comparing different methods of doing things. Let’s compare the backtracking and the graph approach:
- How the problem is reduced? Backtracking breaks down the regex into subexpressions. NFA breaks down the input string into step-by-step graph traversals.
- The order of search. Backtracking is a deep-first search (DFS) on subexpressions. NFA is a breath-first search (BFS) on graphs.
Some lessons can be learned from this:
- Looking for different ways of breaking down the problem.
- Why is the NFA more efficient? DFS vs BFS?
Further reading on regex: https://swtch.com/~rsc/regexp/