-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweaked the fast-avro schema fingerprinting logic #508
Tweaked the fast-avro schema fingerprinting logic #508
Conversation
It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues. The new approach is to simply take the hash code of the full string of the schema.
schemaId = SchemaNormalization.parsingFingerprint64(schema); | ||
schemaId = schema.toString().hashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons to avoid this approach:
Schema#toString()
is not stable. Long story, but it even has bugs under old versions of avro (see AVRO-702)String#hashCode()
isn't necessarily stable either.
We should probably normalize the schema first (you can use AvroUtilSchemaNormalization
and write a AvscWriterPlugin
(see the FieldLevelPlugin
example - something close to that should work for your use case) to include the junk json you want. Here's a test case you can follow as an example).
However, you can also just use AvscWriter to serialize the schema if you don't need to normalize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by stable? If you mean that it will always yield the same value regardless of Avro version or JVM vendor/version, then that doesn't really matter. We only need this id to be stable within the scope of one runtime of the JVM.
That being said, I can still make it stable in case this ends up being used beyond one runtime in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @dg-builder, I have made the ID generation stable in my latest commit.
Thanks for reviewing @dg-builder ! |
It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues. The new approach is to simply take the MD5 hash of the full string of the schema.
It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues.
The new approach is to simply take the hash code of the full string of the schema.