Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
experimental: Add eager outputs, similar to endids but eagerly matched.
When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of *possible combinations* of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise. To see how much difference this makes, the test tests/eager_output/run7 combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step. This PR adds `fsm_union_repeated_pattern_group`, a variant of `fsm_union_array` that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else. This is still experimental, and the code generation for `-lc` here is quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT` acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on. (This squashes down a couple false starts.)
- Loading branch information