32-bit ATN Serialization #3580

parrt · 2022-03-11T18:59:34Z

parrt
Mar 11, 2022
Maintainer

Hi @ericvergnaud @jcking @marcospassos @mike-lischke @KvanTTT!

Now that all targets other than java use integer arrays, perhaps it's time to move away from the 16 bit encoding used now in favor of 32 bit integers. ThatGives the advantage that ATNs are not limited in size, and there are a number of bugs issued for this. Java would have to be treated specially but the idea is simple: remove the masking of numbers with 0xFFFF and avoid the use of 2 16-bit numbers represent 32-bit values.

@KvanTTT has started on various variations of this but a new PR with his 32-bit ATN unit tests shouldn't be too hard. this would require a small change to each target deserialization to use int32 not int16 (although I think most to just you I'm integer but Go using uint16) and avoid packing 2 16s into a 32 bit num for unicode>16bit.

shall we do this for 4.10? while we are doing work on the serialization seems like we should fix this and then stop thinking about it.

ericvergnaud · 2022-03-11T23:25:39Z

ericvergnaud
Mar 11, 2022
Maintainer

Sure. May I suggest starting by creating a test that does involve 32-bits ? Envoyé de mon iPhone

…

Le 11 mars 2022 à 19:59, Terence Parr ***@***.***> a écrit : Hi @ericvergnaud @jcking @marcospassos @mike-lischke @KvanTTT! Now that all targets other than java use integer arrays, perhaps it's time to move away from the 16 bit encoding used now in favor of 32 bit integers. ThatGives the advantage that ATNs are not limited in size, and there are a number of bugs issued for this. Java would have to be treated specially but the idea is simple: remove the masking of numbers with 0xFFFF and avoid the use of 2 16-bit numbers represent 32-bit values. @KvanTTT has started on various variations of this but a new PR with his 32-bit ATN unit tests shouldn't be too hard. this would require a small change to each target deserialization to use int32 not int16 (although I think most to just you I'm integer but Go using uint16) and avoid packing 2 16s into a 32 bit num for unicode>16bit. shall we do this for 4.10? while we are doing work on the serialization seems like we should fix this and then stop thinking about it. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

2 replies

parrt Mar 11, 2022
Maintainer Author

I was thinking the same. @KvanTTT Created some of these already in one of his PRs.

KvanTTT Mar 14, 2022

Yes, it exists here: https://github.com/antlr/antlr4/pull/3546/files#diff-859a1117c3f13396162959e824858c01d179a61f3280ab64fd68c9c9490b2729R71

mike-lischke · 2022-03-12T10:21:10Z

mike-lischke
Mar 12, 2022

@parrt Can you give a bit more context? I don't know what you mean. Maybe point to a place in the code to allow following the discussion?

7 replies

mike-lischke Mar 13, 2022

Aha, right, I remember that part of the code. Can we change it also so that we get more type safety? Currently it's just a bunch of numbers, that are arbitrarily interpreted to be a count or state number etc. Instead that should be a stream of structures with well known fields and their types.

A possible implementation could be like the PNG chunk structure: there's base record in each chunk with the overall length of that chunk and an indicator what type of chunk that is. The good thing about this is you can place the chunks in any order and you can easily add new chunks at any time without using a UUID or version field and so on. Implementations that don't know how to handle a chunk can just jump over it (because of that length field).

parrt Mar 13, 2022
Maintainer Author

A very interesting idea @mike-lischke! I suppose that we could create java objects that represent this serialized form (though course we all already have the full ATN as a set of state objects), and then the individual target templates could generate C++ structs or whatever. Many targets may not have such useful critters as structs though.

Given that I hope we don't have to mess with the ATN serialization ever again, the fastest and lowest risk solution is to tweak what we have. I will make an attempt and then have you folks take a look but it should not be very hard.

KvanTTT Mar 14, 2022

A possible implementation could be like the PNG chunk structure

I afraid it could increase the size of output serialized data because of types serialization. Moreover there is no compression for ATN unlike PNG.

mike-lischke Mar 14, 2022

Given that I hope we don't have to mess with the ATN serialization ever again, the fastest and lowest risk solution is to tweak what we have. I will make an attempt and then have you folks take a look but it should not be very hard.

OK, if that's a one time thing then this is probably the way to go.

lppedd Dec 19, 2023

Java does not properly handle static initialized integer arrays

Where can I find more info on this topic? I didn't know there was an issue about static initialization.

@ericvergnaud found this discussion by following what you told me yesterday, thanks!

KvanTTT · 2022-03-14T07:48:01Z

KvanTTT
Mar 14, 2022

32-bit integers increases the size twice of output ATN compare to 16-bit. Is this ok?

shall we do this for 4.10? while we are doing work on the serialization seems like we should fix this and then stop thinking about it.

I think so becuase in 4.10 we've changed ATN format anyway.

4 replies

parrt Mar 18, 2022
Maintainer Author

Yep,It will increase the size but based upon the numbers we've seen it'll be like 12k -> 24k for a big grammar like SQL. that's probably a reasonable trade to get a simplified atn that doesn't do any encoding at all. if this becomes an issue in the future coma we can always add the high-bit encoding trick. I like the idea of using -1 instead of trying to encode Token.EOF as 0xFFFF or something as well.

parrt Mar 18, 2022
Maintainer Author

I'm going to try building this today

parrt Mar 18, 2022
Maintainer Author

Almost done!!!

parrt Mar 19, 2022
Maintainer Author

See #3591

parrt · 2022-03-21T01:23:01Z

parrt
Mar 21, 2022
Maintainer Author

Wow. Ok, Java now supports 32-bit ATNs. It handles @KvanTTT's large generated lexer and the case where the token type is 0xFFFF. :) the remaining issue I think, other than cleanup you guys suggest, is to handle the shift by two thing for java so that the modified utf-8 string encoding in the class files does not make 0 and 1 inefficient to represent. I think the easiest answer is to do the bump by 2 trick unless the 32-bit encoding kicks in. See #3591

3 replies

mike-lischke Mar 21, 2022

Can I see that large generated Lexer @KvanTTT? I'd like to compare it with stuff I have.

parrt Mar 21, 2022
Maintainer Author

i emailed it to you

KvanTTT Mar 21, 2022

Sure, sorry for delay, it's also available on gist: https://gist.github.com/KvanTTT/301b9d4d97e8b0a74ab1a7d6c0dee715 (or can be generated using getAtnStatesSizeMoreThan65535Descriptor test)

parrt · 2022-03-21T16:28:19Z

parrt
Mar 21, 2022
Maintainer Author

I'm thinking That the bump by two feature isn't really a big problem. Yes it creates a bigger class file but that has all kinds of symbol table stuff and other junk as well. when it is loaded by some fast C code, it unpacks the utf8 into 16bit char strings so we don't have a memory problem there. I think I will carve out the bump by 2 stuff. we can revise it later if for some reason it's a problem

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32-bit ATN Serialization #3580

{{title}}

Replies: 5 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

32-bit ATN Serialization #3580

parrt Mar 11, 2022 Maintainer

Replies: 5 comments · 16 replies

ericvergnaud Mar 11, 2022 Maintainer

parrt Mar 11, 2022 Maintainer Author

parrt Mar 13, 2022 Maintainer Author

parrt Mar 18, 2022 Maintainer Author

parrt Mar 18, 2022 Maintainer Author

parrt Mar 18, 2022 Maintainer Author

parrt Mar 19, 2022 Maintainer Author

parrt Mar 21, 2022 Maintainer Author

parrt Mar 21, 2022 Maintainer Author

parrt Mar 21, 2022 Maintainer Author

parrt
Mar 11, 2022
Maintainer

Replies: 5 comments 16 replies

ericvergnaud
Mar 11, 2022
Maintainer

parrt Mar 11, 2022
Maintainer Author

parrt Mar 13, 2022
Maintainer Author

parrt Mar 18, 2022
Maintainer Author

parrt Mar 18, 2022
Maintainer Author

parrt Mar 18, 2022
Maintainer Author

parrt Mar 19, 2022
Maintainer Author

parrt
Mar 21, 2022
Maintainer Author

parrt Mar 21, 2022
Maintainer Author

parrt
Mar 21, 2022
Maintainer Author