We use a Java library called dk.brics.automaton to analyse regexes and generate valid (and invalid for violation) strings based on them. It works by representing the regex as a finite state machine. It might be worth reading about state machines for those who aren't familiar: https://en.wikipedia.org/wiki/Finite-state_machine. Consider the following regex: ABC[a-z]?(A|B)
. It would be represented by the following state machine:
The states (circular nodes) represent a string that may (green) or may not (white) be valid. For example, s0
is the empty string, s5
is ABCA
.
The transitions represent adding another character to the string. The characters allowed by a transition may be a range (as with [a-z]
). A state does not store the string it represents itself, but is defined by the ordered list of transitions that lead to that state.
Another project that also uses dk.brics.automaton in a similar way to us might be useful to look at for further study: https://github.com/mifmif/Generex.
Other than the fact that we can use the state machine to generate strings, the main benefit that we get from using this library are:
- Finding the intersection of two regexes, used when there are multiple regex constraints on the same field.
- Finding the complement of a regex, which we use for generating invalid regexes for violation.
Due to the way that the generator computes textual data internally the generation of strings is not deterministic and may output valid values in a different order with each generation run.
dk.brics.automaton doesn't support start and end anchors ^
& $
and instead matches the entire word as if the anchors were always present. For some of our use cases though it may be that we want to match the regex in the middle of a string somewhere, so we have two versions of the regex constraint - matchingRegex and containingRegex. If containingRegex
is used then we simply add a .*
to the start and end of the regex before passing it into the automaton. Any ^
or $
characters passed at the start or end of the string respectively are removed, as the automaton will treat them as literal characters.
The automaton represents the state machine using the following types:
Transition
State
A transition holds the following properties and are represented as lines in the above graph
min: char
- The minimum permitted character that can be emitted at this positionmax: char
- The maximum permitted character that can be emitted at this positionto: State[]
- TheState
s that can follow this transition
In the above A
looks like:
property | initial | [a-z] |
---|---|---|
min | A | a |
max | A | z |
to | 1 state, s1 |
1 state, s4 |
A state holds the following properties and are represented as circles in the above graph
accept: boolean
- is this a termination state, can string production stop here?transitions: HashSet<Transition>
- which transitions, if any, follow this statenumber: int
- the number of this stateid: int
- the id of this state (not sure what this is used for)
In the above s0
looks like:
property | initial | s3 |
---|---|---|
accept | false | false |
transitions | 1 transition, → A |
2 transitions: → [a-z] → `A |
number | 4 | 0 |
id | 49 | 50 |
The automaton can represent the state machine in a textual representation such as:
initial state: 4
state 0 [reject]:
a-z -> 3
A-B -> 1
state 1 [accept]:
state 2 [reject]:
B -> 5
state 3 [reject]:
A-B -> 1
state 4 [reject]:
A -> 2
state 5 [reject]:
C -> 0
This shows each state and each transition in the automaton, lines 2-4 show the State
as shown in the previous section.
Lines 10-11 show the transition 'A' as shown in the prior section.
The pathway through the automaton is:
- transition to state 4 (because the initial - empty string state - is rejected/incomplete)
- add an 'A' (state 4)
- transition to state 2 (because the current state "A" is rejected/incomplete)
- add a 'B' (state 2)
- transition to state 5 (because the current state "AB" is rejected/incomplete)
- add a 'C' (state 5)
- transition to state 0 (because the current state "ABC" is rejected/incomplete)
- either
- add a letter between
a..z
(state 0, transition 1) - transition to state 3 (because the current state "ABCa" is rejected/incomplete)
- add either 'A' or 'B' (state 3)
- transition to state 1 (because the current state "ABCaA" is rejected/incomplete)
- current state is accepted so exit with the current string "ABCaA"
- add a letter between
- or
- add either 'A' or 'B' (state 0, transition 2)
- transition to state 1 (because the current state "ABCA" is rejected/incomplete)
- current state is accepted so exit with the current string "ABCA"
- either
The generator does not support generating strings above the Basic Unicode plane (Plane 0). Using regexes that match characters above the basic plane may lead to unexpected behaviour.