Given a set of regular expressions, determine the corresponding NFA (non-deterministic finite automaton)!
Before I can parse a text, it first has do be translated into tokens. Tokens group the stream of input characters into usable chunks. Typical tokens are parentheses, identifiers, numbers or keywords. I desire to convert a stream of characters into a stream of tokens. This step is called lexical analysis. The executing automaton is called lexer. The class of language this automaton can handle are regular expressions.
They are classes of character sequences. Each class can be described using the following operators:
- Single character:
a
- Concatenation:
ab
(searching ana
followed by ab
) - Repetition:
a*
=a
oraa
oraaa
… (finite sequence) - Alternation:
aa|bb
=aa
orbb
Combinations of these operators are also possible, like (0|1)+.(0|1)+
for binary decimal strings, like 0.0
, 1010.110
...
Basically it is a certain class of state machine. A deterministic finite automaton (DFA) starts at one state. Each state has n
outgoing edges, where n
is the number of distinct input symbols. Some states are marked as accepting states. If the state machine holds at an accepting state, a token could be emitted. Non-determinism just says that the edges could occur multiple times for one state or that no symbol must be read to change the state (ε transitions).
Cycles represent state. Double cycles are accepting states. Transitions are labelled with input symbols. The input symbol ε
is the EMPTY
input (no character).
public interface IAutomaton
{
int StartState { get; }
int StateCount { get; }
ISet<int> AcceptingStates { get; }
IReadOnlyDictionary<int, ITransitionTargets> TransitionsBySource { get; }
}
public interface ITransitionTargets: ILookup<ICharSet, int>
{
bool Contains(char c);
bool ContainsEpsilon();
}
public interface ICharSet : IEnumerable<CharRange>, IComparable<ICharSet>
{
int Length { get; }
bool Includes(char c);
bool Includes(char from, char to);
bool Excludes(char c);
bool Excludes(char from, char to);
}
To convert a regular expression into a NFA, you just have to find an automaton representation for each construction rule (concatenation, repetition and alternation). Let’s do this!
Here is an automaton for the regular expression (aa|b)*
.
We can eliminate them by converting the NFA to an DFA.
Instead of starting with the start state, we start with the epsilon closure of the start state. That are the states that are reachable via epsilon transition from the start. Each combination of NFA states (possibly 2^|states|
), stands for a DFA state.
For every input character we advance from one DFA state to the next, always expanding the NFA states with an epsilon closure to a set of states representing the resulting DFA state. By doing this we eliminate all epsilon transitions – isn’t that great?!
The advantage of an DFA is that we can run simulations faster. The disadvantage is the building time and its space. In worst case they could go up to O(2^n)
, where n
is the length of the regular expression.
By applying this natural language algorithm, we get the following automaton.
DFA state | NFA states |
---|---|
0 | {0, 1, 2, 7, 9, 11} |
1 | {0, 1, 2, 7, 9, 10, 11, 12} |
2 | {3, 4, 6} |
3 | {0, 1, 2, 5, 7, 8, 9, 11, 12} |
Can we not reduce some states here!? The answer is yes. The algorithm is quiet simple... but expensive (O(n^2), where n is the number of states). We are searching all equivalence classes of the given states. Two states are equal, if both are accepting or both are denying and all their transition’s target states for the same input are pairwise equal (recursion). You can start the recursion by distinguishing all denying from accepting states.
Equal? | 0 | 1 | 2 |
---|---|---|---|
3 | e1 | e1 | no |
2 | no | no | - |
1 | e1 | - | - |
We have found one equivalence class e1
. So, at the end we can merge the three states 0
, 1
, 3
to a new DFA state 0
and 2
becomes 1
. This is the resulting DFA (quite small, huh?):
Great the automata resulting from a regular expression are complete and minimal. Now let’s construct the lexer!
The task now is to transform a text with a set of regular expressions into ordered set of tokens. So, the trivial approach is a for loop over the set of regular expressions until one does match.
- Alfred Aho et. al. – Compilers
- Wikipedia articles
- Github repository
- GraphViz used for generating diagrams