Skip to content

Commit

Permalink
experimental: Add eager outputs, similar to endids but eagerly matched.
Browse files Browse the repository at this point in the history
When combining several unanchored regexes it becomes VERY expensive
to handle combinations of matches via the end state -- essentially,
the whole reachable DFA gets separate matching and non-matching copies
for each pattern, leading to a DFA whose size is proportional to the
number of *possible combinations* of matches. With eager outputs,
we can set a flag for matching as we reach the end of the original
pattern (before looping back and possibly also matching other patterns),
which keeps the state count from blowing up in fsm_determinise.

To see how much difference this makes, the test tests/eager_output/run7
combines 26 different patterns. It should finish very quickly (~50 msec,
just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from
4 to 26. Around 10-11 it will start taking several seconds, and memory
usage will roughly double with each step.

This PR adds `fsm_union_repeated_pattern_group`, a variant of
`fsm_union_array` that combines a set of DFAs into a single NFA, but
correctly handles a mix of anchored and unanchored ends without the
state count blowing up. It currently needs flags passed in for each fsm
indicating whether the start and/or end are anchored, and there is a
hacky special case that removes self-edges from states with eager
outputs and instead connects them to a single overall unanchored end
loop. I haven't yet figured out how to handle this properly in the
general case, but it works for this specific use case, provided all the
DFAs are combined at once. (Combining multiple DFAs each produced by
determinising fsm_union_repeated_pattern_group's result probably won't
work correctly.) I have tried detecting and ignoring those edges in
fsm_determinise, after epsilon removal, but so far either it still
causes the graph size to blow up or subtly breaks something else.

This is still experimental, and the code generation for `-lc` here is
quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT`
acro, since the code generation interface doesn't define where the
match info will go yet. A later PR will add a new code generation mode
with better support for eager outputs, and I plan to eventually
integrate this better with rx, AMBIG_MULTIPLE, and so on.

(This squashes down a couple false starts.)
  • Loading branch information
silentbicycle committed Oct 10, 2024
1 parent 796c6ab commit 8c918bb
Show file tree
Hide file tree
Showing 43 changed files with 2,632 additions and 38 deletions.
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ SUBDIR += tests/equals
SUBDIR += tests/subtract
SUBDIR += tests/detect_required
SUBDIR += tests/determinise
SUBDIR += tests/eager_output
SUBDIR += tests/endids
SUBDIR += tests/epsilons
SUBDIR += tests/fsm
Expand Down
Loading

0 comments on commit 8c918bb

Please sign in to comment.