-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DFA match utf8 rather than char #59
Comments
A simple way to implement this would be to replace
(state S1 switches to S2 on character 'ö') This becomes
(state S1 switches to a new intermediate step Sn on byte 0xC3, which then switches to S2 on byte 0xB6) A problem with this is that the number of states will increase proportional to a matched string/char's UTF-8 encoding, rather than the number of chars. Maybe that's not too bad though, and I don't know if it's possible to avoid new states. |
How would we implement this for character classes, e.g. Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we). |
Good point, I think regexes like
Currently we can't really reuse DFAs (or states) as each DFA/state will have a different next state or semantic action (i.e. continuation). For example, we have "match A then do B" and "match A then do C", the "match A" machines cannot be reused as each machine will do something different after a successful match. I guess we should keep the DFA and NFA the same and find another way. Do you know how re2 doing this? Btw, for runtime performance, I think this change will probably be a micro-optimization compared to DFA minimization (#38). We should do that first. It will also reduce code size. |
This would improve performance, as no utf8 decoding is necessary.
This is what re2 does too.
The text was updated successfully, but these errors were encountered: