Tweaked the fast-avro schema fingerprinting logic #508

FelixGV · 2023-08-17T15:43:10Z

It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues.

The new approach is to simply take the hash code of the full string of the schema.

It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues. The new approach is to simply take the hash code of the full string of the schema.

dg-builder · 2023-08-17T16:37:31Z

fastserde/avro-fastserde/src/main/java/com/linkedin/avro/fastserde/Utils.java

-      schemaId = SchemaNormalization.parsingFingerprint64(schema);
+      schemaId = schema.toString().hashCode();


Two reasons to avoid this approach:

Schema#toString() is not stable. Long story, but it even has bugs under old versions of avro (see AVRO-702)

String#hashCode() isn't necessarily stable either.

We should probably normalize the schema first (you can use AvroUtilSchemaNormalization and write a AvscWriterPlugin (see the FieldLevelPlugin example - something close to that should work for your use case) to include the junk json you want. Here's a test case you can follow as an example).

However, you can also just use AvscWriter to serialize the schema if you don't need to normalize.

What do you mean by stable? If you mean that it will always yield the same value regardless of Avro version or JVM vendor/version, then that doesn't really matter. We only need this id to be stable within the scope of one runtime of the JVM.

That being said, I can still make it stable in case this ends up being used beyond one runtime in the future.

FYI @dg-builder, I have made the ID generation stable in my latest commit.

FelixGV · 2023-08-18T11:33:43Z

Thanks for reviewing @dg-builder !

It used to leverage the Avro fingerprinting, which is based on the "parsing canonical form" of the schema, but this ignores properties such as which class to use for String deserialization. This is a problem in cases where we wish to deserialize the same schema both with and without Java Strings, since the cache will return the same deserializer for both, ultimately leading to class cast issues. The new approach is to simply take the MD5 hash of the full string of the schema.

dg-builder reviewed Aug 17, 2023

View reviewed changes

Made the id generation stable across runtimes.

cf8bea2

FelixGV requested a review from dg-builder August 18, 2023 00:00

dg-builder approved these changes Aug 18, 2023

View reviewed changes

FelixGV merged commit 60fa32a into linkedin:master Aug 18, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweaked the fast-avro schema fingerprinting logic #508

Tweaked the fast-avro schema fingerprinting logic #508

FelixGV commented Aug 17, 2023

dg-builder Aug 17, 2023

FelixGV Aug 17, 2023

FelixGV Aug 17, 2023

FelixGV commented Aug 18, 2023

		schemaId = SchemaNormalization.parsingFingerprint64(schema);
		schemaId = schema.toString().hashCode();

Tweaked the fast-avro schema fingerprinting logic #508

Tweaked the fast-avro schema fingerprinting logic #508

Conversation

FelixGV commented Aug 17, 2023

dg-builder Aug 17, 2023

Choose a reason for hiding this comment

FelixGV Aug 17, 2023

Choose a reason for hiding this comment

FelixGV Aug 17, 2023

Choose a reason for hiding this comment

FelixGV commented Aug 18, 2023