Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaked the fast-avro schema fingerprinting logic #508

Merged

Conversation

FelixGV
Copy link
Collaborator

@FelixGV FelixGV commented Aug 17, 2023

It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues.

The new approach is to simply take the hash code of the full string of the schema.

It used to leverage the Avro fingerprinting, which is based on the
"parsing canonical form" of the schema, but this ignores properties
such as which class to use for String deserialization. This is a
problem in cases where we wish to deserialize the same schema both
with and without Java Strings, since the cache will return the same
deserializer for both, ultimately leading to class cast issues.

The new approach is to simply take the hash code of the full string
of the schema.
schemaId = SchemaNormalization.parsingFingerprint64(schema);
schemaId = schema.toString().hashCode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons to avoid this approach:

  1. Schema#toString() is not stable. Long story, but it even has bugs under old versions of avro (see AVRO-702)
  2. String#hashCode() isn't necessarily stable either.

We should probably normalize the schema first (you can use AvroUtilSchemaNormalization and write a AvscWriterPlugin (see the FieldLevelPlugin example - something close to that should work for your use case) to include the junk json you want. Here's a test case you can follow as an example).

However, you can also just use AvscWriter to serialize the schema if you don't need to normalize.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by stable? If you mean that it will always yield the same value regardless of Avro version or JVM vendor/version, then that doesn't really matter. We only need this id to be stable within the scope of one runtime of the JVM.

That being said, I can still make it stable in case this ends up being used beyond one runtime in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @dg-builder, I have made the ID generation stable in my latest commit.

@FelixGV FelixGV merged commit 60fa32a into linkedin:master Aug 18, 2023
3 checks passed
@FelixGV
Copy link
Collaborator Author

FelixGV commented Aug 18, 2023

Thanks for reviewing @dg-builder !

flowenol pushed a commit to RTBHOUSE/avro-util that referenced this pull request Aug 23, 2023
It used to leverage the Avro fingerprinting, which is based on the
"parsing canonical form" of the schema, but this ignores properties
such as which class to use for String deserialization. This is a
problem in cases where we wish to deserialize the same schema both
with and without Java Strings, since the cache will return the same
deserializer for both, ultimately leading to class cast issues.

The new approach is to simply take the MD5 hash of the full string
of the schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants