Replies: 5 comments 16 replies
-
Sure.
May I suggest starting by creating a test that does involve 32-bits ?
Envoyé de mon iPhone
… Le 11 mars 2022 à 19:59, Terence Parr ***@***.***> a écrit :
Hi @ericvergnaud @jcking @marcospassos @mike-lischke @KvanTTT!
Now that all targets other than java use integer arrays, perhaps it's time to move away from the 16 bit encoding used now in favor of 32 bit integers. ThatGives the advantage that ATNs are not limited in size, and there are a number of bugs issued for this. Java would have to be treated specially but the idea is simple: remove the masking of numbers with 0xFFFF and avoid the use of 2 16-bit numbers represent 32-bit values.
@KvanTTT has started on various variations of this but a new PR with his 32-bit ATN unit tests shouldn't be too hard. this would require a small change to each target deserialization to use int32 not int16 (although I think most to just you I'm integer but Go using uint16) and avoid packing 2 16s into a 32 bit num for unicode>16bit.
shall we do this for 4.10? while we are doing work on the serialization seems like we should fix this and then stop thinking about it.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
@parrt Can you give a bit more context? I don't know what you mean. Maybe point to a place in the code to allow following the discussion? |
Beta Was this translation helpful? Give feedback.
-
32-bit integers increases the size twice of output ATN compare to 16-bit. Is this ok?
I think so becuase in 4.10 we've changed ATN format anyway. |
Beta Was this translation helpful? Give feedback.
-
Wow. Ok, Java now supports 32-bit ATNs. It handles @KvanTTT's large generated lexer and the case where the token type is 0xFFFF. :) the remaining issue I think, other than cleanup you guys suggest, is to handle the shift by two thing for java so that the modified utf-8 string encoding in the class files does not make 0 and 1 inefficient to represent. I think the easiest answer is to do the bump by 2 trick unless the 32-bit encoding kicks in. See #3591 |
Beta Was this translation helpful? Give feedback.
-
I'm thinking That the bump by two feature isn't really a big problem. Yes it creates a bigger class file but that has all kinds of symbol table stuff and other junk as well. when it is loaded by some fast C code, it unpacks the utf8 into 16bit char strings so we don't have a memory problem there. I think I will carve out the bump by 2 stuff. we can revise it later if for some reason it's a problem |
Beta Was this translation helpful? Give feedback.
-
Hi @ericvergnaud @jcking @marcospassos @mike-lischke @KvanTTT!
Now that all targets other than java use integer arrays, perhaps it's time to move away from the 16 bit encoding used now in favor of 32 bit integers. ThatGives the advantage that ATNs are not limited in size, and there are a number of bugs issued for this. Java would have to be treated specially but the idea is simple: remove the masking of numbers with 0xFFFF and avoid the use of 2 16-bit numbers represent 32-bit values.
@KvanTTT has started on various variations of this but a new PR with his 32-bit ATN unit tests shouldn't be too hard. this would require a small change to each target deserialization to use int32 not int16 (although I think most to just you I'm integer but Go using uint16) and avoid packing 2 16s into a 32 bit num for unicode>16bit.
shall we do this for 4.10? while we are doing work on the serialization seems like we should fix this and then stop thinking about it.
Beta Was this translation helpful? Give feedback.
All reactions