Make DFA match utf8 rather than char #59

seanyoung · 2022-12-08T12:38:31Z

This would improve performance, as no utf8 decoding is necessary.

This is what re2 does too.

osa1 · 2022-12-12T12:18:28Z

A simple way to implement this would be to replace chars in the DFA (and NFA, code generator) with u8. When a char is encoded as multiple bytes introduce new states to match the whole encoding. E.g. if I have this transition

S1 --('ö')--> S2

(state S1 switches to S2 on character 'ö')

This becomes

S1 --(0xC3)--> Sn --(0xB6)--> S2

(state S1 switches to a new intermediate step Sn on byte 0xC3, which then switches to S2 on byte 0xB6)

A problem with this is that the number of states will increase proportional to a matched string/char's UTF-8 encoding, rather than the number of chars. Maybe that's not too bad though, and I don't know if it's possible to avoid new states.

seanyoung · 2022-12-12T14:34:18Z

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

osa1 · 2022-12-12T15:34:00Z

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Good point, I think regexes like $$XID_Start will probably make things unacceptably slow in compile time and cause huge code generation.

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

Currently we can't really reuse DFAs (or states) as each DFA/state will have a different next state or semantic action (i.e. continuation).

For example, we have "match A then do B" and "match A then do C", the "match A" machines cannot be reused as each machine will do something different after a successful match.

I guess we should keep the DFA and NFA the same and find another way. Do you know how re2 doing this?

Btw, for runtime performance, I think this change will probably be a micro-optimization compared to DFA minimization (#38). We should do that first. It will also reduce code size.

osa1 added the perf Related to runtime performance label Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DFA match utf8 rather than char #59

Make DFA match utf8 rather than char #59

seanyoung commented Dec 8, 2022

osa1 commented Dec 12, 2022

seanyoung commented Dec 12, 2022

osa1 commented Dec 12, 2022

Make DFA match utf8 rather than char #59

Make DFA match utf8 rather than char #59

Comments

seanyoung commented Dec 8, 2022

osa1 commented Dec 12, 2022

seanyoung commented Dec 12, 2022

osa1 commented Dec 12, 2022