experimental: Add eager outputs, similar to endids but eagerly matched. · fastly/libfsm@8c918bb

Commit

experimental: Add eager outputs, similar to endids but eagerly matched.

When combining several unanchored regexes it becomes VERY expensive
to handle combinations of matches via the end state -- essentially,
the whole reachable DFA gets separate matching and non-matching copies
for each pattern, leading to a DFA whose size is proportional to the
number of *possible combinations* of matches. With eager outputs,
we can set a flag for matching as we reach the end of the original
pattern (before looping back and possibly also matching other patterns),
which keeps the state count from blowing up in fsm_determinise.

To see how much difference this makes, the test tests/eager_output/run7
combines 26 different patterns. It should finish very quickly (~50 msec,
just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from
4 to 26. Around 10-11 it will start taking several seconds, and memory
usage will roughly double with each step.

This PR adds `fsm_union_repeated_pattern_group`, a variant of
`fsm_union_array` that combines a set of DFAs into a single NFA, but
correctly handles a mix of anchored and unanchored ends without the
state count blowing up. It currently needs flags passed in for each fsm
indicating whether the start and/or end are anchored, and there is a
hacky special case that removes self-edges from states with eager
outputs and instead connects them to a single overall unanchored end
loop. I haven't yet figured out how to handle this properly in the
general case, but it works for this specific use case, provided all the
DFAs are combined at once. (Combining multiple DFAs each produced by
determinising fsm_union_repeated_pattern_group's result probably won't
work correctly.) I have tried detecting and ignoring those edges in
fsm_determinise, after epsilon removal, but so far either it still
causes the graph size to blow up or subtly breaks something else.

This is still experimental, and the code generation for `-lc` here is
quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT`
acro, since the code generation interface doesn't define where the
match info will go yet. A later PR will add a new code generation mode
with better support for eager outputs, and I plan to eventually
integrate this better with rx, AMBIG_MULTIPLE, and so on.

(This squashes down a couple false starts.)

Loading branch information

silentbicycle committed Oct 10, 2024

1 parent 796c6ab commit 8c918bb

Makefile

-Original file line number
+Diff line change
@@ Expand Up / @@ -118,6 +118,7 @@ SUBDIR += tests/equals @@
     SUBDIR += tests/subtract
     SUBDIR += tests/detect_required
     SUBDIR += tests/determinise
+    SUBDIR += tests/eager_output
     SUBDIR += tests/endids
     SUBDIR += tests/epsilons
     SUBDIR += tests/fsm
@@ Expand Down @@

0 comments on commit `8c918bb`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `8c918bb`

Commit

There are no files selected for viewing

0 comments on commit 8c918bb

0 comments on commit `8c918bb`