Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DFA match utf8 rather than char #59

Open
seanyoung opened this issue Dec 8, 2022 · 3 comments
Open

Make DFA match utf8 rather than char #59

seanyoung opened this issue Dec 8, 2022 · 3 comments
Labels
perf Related to runtime performance

Comments

@seanyoung
Copy link
Contributor

This would improve performance, as no utf8 decoding is necessary.

This is what re2 does too.

@osa1
Copy link
Owner

osa1 commented Dec 12, 2022

A simple way to implement this would be to replace chars in the DFA (and NFA, code generator) with u8. When a char is encoded as multiple bytes introduce new states to match the whole encoding. E.g. if I have this transition

S1 --('ö')--> S2

(state S1 switches to S2 on character 'ö')

This becomes

S1 --(0xC3)--> Sn --(0xB6)--> S2

(state S1 switches to a new intermediate step Sn on byte 0xC3, which then switches to S2 on byte 0xB6)

A problem with this is that the number of states will increase proportional to a matched string/char's UTF-8 encoding, rather than the number of chars. Maybe that's not too bad though, and I don't know if it's possible to avoid new states.

@osa1 osa1 added the perf Related to runtime performance label Dec 12, 2022
@seanyoung
Copy link
Contributor Author

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

@osa1
Copy link
Owner

osa1 commented Dec 12, 2022

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Good point, I think regexes like $$XID_Start will probably make things unacceptably slow in compile time and cause huge code generation.

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

Currently we can't really reuse DFAs (or states) as each DFA/state will have a different next state or semantic action (i.e. continuation).

For example, we have "match A then do B" and "match A then do C", the "match A" machines cannot be reused as each machine will do something different after a successful match.


I guess we should keep the DFA and NFA the same and find another way. Do you know how re2 doing this?

Btw, for runtime performance, I think this change will probably be a micro-optimization compared to DFA minimization (#38). We should do that first. It will also reduce code size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Related to runtime performance
Projects
None yet
Development

No branches or pull requests

2 participants