Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fastserde] Logical types are now supported for avro version 1.9+ #516

Merged

Conversation

krisso-rtb
Copy link
Contributor

The goal of this PR is to allow fastserde (de-)serializing records with logical-types.

I realize it's a quite big change so we did our best to verify it works fine... and no new bugs are added :)


Commits guide:

  1. Adding unit tests to verify LogicalTypes support. - TDD approach, and failing tests are marked with @Ignored
  2. minor code cleanup - self explanatory
  3. Generated classes - BEFORE - temporarily adding all generated classes before the main changes
  4. modelData (GenericData or SpecificData instance) passed to FastSerdeBase - should be part of the main commit (5.) however it's extracted as separate commit to make the main commit cleaner
  5. [fast-avro] Logical types are now supported by fast-avro in version 1.9+ - the heart of this PR, tests from (1.) are unignored.
  6. Generated fast-avro classes AFTER (logical-types related). - temporary commit to see how main commit affected generated classes
  7. Generated fast-avro classes AFTER (side effects of supporting logical-types). - same as above, could be considered as "side effects"
  8. Generated classes moved back to .gitignore - drops changes from all temporary commits (3, 6 and 7)

Due to limitations of current Gradle's config it's impossible to generate Avro classes with custom logicalTypes. To verify this PR works fine in that case as well I created Unit Test in external (dedicated) project:

which uses dependency built locally from this PR's branch:

<dependency>
    <groupId>com.linkedin.avroutil1</groupId>
    <artifactId>avro-fastserde</artifactId>
    <version>0.2.3-SNAPSHOT</version>
</dependency>

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2023

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (917b1b9) 45.83% compared to head (995dd6d) 45.77%.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #516      +/-   ##
============================================
- Coverage     45.83%   45.77%   -0.06%     
+ Complexity     4440     4438       -2     
============================================
  Files           398      403       +5     
  Lines         28040    28070      +30     
  Branches       4622     4622              
============================================
- Hits          12851    12850       -1     
- Misses        13634    13663      +29     
- Partials       1555     1557       +2     
Files Coverage Δ
.../linkedin/avroutil1/compatibility/AvroAdapter.java 42.30% <ø> (ø)
...vroutil1/compatibility/avro110/Avro110Adapter.java 63.33% <0.00%> (-0.54%) ⬇️
...vroutil1/compatibility/avro111/Avro111Adapter.java 63.33% <0.00%> (-0.54%) ⬇️
.../avroutil1/compatibility/avro14/Avro14Adapter.java 53.48% <0.00%> (-0.51%) ⬇️
...bility/avro14/backports/GenericDatumWriterExt.java 0.00% <0.00%> (ø)
.../avroutil1/compatibility/avro15/Avro15Adapter.java 57.93% <0.00%> (-0.51%) ⬇️
...bility/avro15/backports/GenericDatumReaderExt.java 0.00% <0.00%> (ø)
...bility/avro15/backports/GenericDatumWriterExt.java 0.00% <0.00%> (ø)
.../avroutil1/compatibility/avro16/Avro16Adapter.java 57.64% <0.00%> (-0.51%) ⬇️
...bility/avro16/backports/GenericDatumReaderExt.java 0.00% <0.00%> (ø)
... and 5 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the big code drop! The commit breakdown definitely helps navigate the change more easily. It LGTM overall. I left some minor comments and questions.

}

protected <T extends GenericData> void fixConversionsIfAvro19(T modelData) {
if (AvroCompatibilityHelperCommon.getRuntimeAvroVersion() == AvroVersion.AVRO_1_9) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue specific to 1.9 or should the code be run in later versions as well? Should this condition be AvroCompatibilityHelperCommon.getRuntimeAvroVersion().laterThan(AvroVersion.AVRO_1_8) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I revisited this method and new TimeConversions.TimeMillisConversion() is not needed (will be removed in fix-commit).

Other 3 conversions

new Conversions.DecimalConversion()
new TimeConversions.TimeMicrosConversion()
new TimeConversions.TimestampMicrosConversion()

must be manually added because they're missing. They're actually "lost" when downgrading avro from 1.10.2 to 1.10.1:

@@ -444,6 +464,61 @@ public JClass findStringClass(Schema schema) {
return outputClass;
}

boolean logicalTypeEnabled(Schema schema) {
// return modelData != null && schema != null && Utils.isLogicalTypeSupported() && schema.getLogicalType() != null;
// TODO uuid in 1.10 no supported ??
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, weird... I see UUIDs mentioned in the spec as far back as 1.9... but maybe in the actual implementation it's not properly supported until 1.11? IDK... this is the kind of minute detail @radai-rosenblatt would know :D ...

Copy link
Contributor Author

@krisso-rtb krisso-rtb Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you reminded me something.
I did an experiment to see how generated Avro classes change when avro version is downgraded:

private java.util.UUID uuidField;   // avro 1.11.1
private java.lang.String uuidField; // avro 1.11.0

from the same .avsc file.
(just take a look at diff for FastSerdeLogicalTypesTest1.java)

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying the code in the latest commit.

I have a high-level question about the strategy for this work...

Given that which supported logical type varies by Avro version (even by patch versions, unfortunately...), I wonder what is the ideal way to approach this?

I believe the approach you took in this PR is to take all of the logical types supported in the latest version of Avro, and if the runtime is Avro 1.9 or later, then use all of these logical types (I assume this is the case because of the missingConversions that you are registering into the modelData...). First of all, is this the right understanding of the current implementation? If it is, then I have some concerns, namely:

  1. If using a SpecificRecord class that was generated in a version of Avro older than the latest, which is missing some of the logical types, then would we not have a class cast exception when trying to put an object of the latest logical type into a field that was generated to not have such type?
  2. If using a GenericRecord, in theory we can put anything in any fields since it's just an array of Objects internally... however, during the period of time when fast-avro is "warming up" (i.e. before the serde compilation finishes), then we'll be falling back on vanilla avro... so does that mean the user will see older types for a brief period of time and later on see the latest logical types?

Both of the above behaviors seem wrong... so I wonder if we should instead do some runtime introspection to figure out what the right type to use is, at the time of the serde code gen, and then use the right type accordingly in the generated code. For a SpecificRecord serde, the right type could be figured out by using reflection to introspect the fields of the SpecificRecord class. For GenericRecord, perhaps we could synthetically generate a payload based on the schema, deserialize it with vanilla Avro, and introspect what type each field is...

It is troublesome, but I think it would avoid the issues mentioned above.

Alternatively, we could say that fast-avro only supports logical types on the most recent of all Avro versions, and for that version it supports all the types... This would be roughly equivalent to the current PR, but we would just bump the minimum Avro version to activate the logical type behavior... Easier to implement, though technically still incorrect in terms of Avro 1.9 and 1.10, but at least it's easier to explain to users what to expect.

Maybe I'm twisting myself into knots over this and it need not be this complicated... IDK. What do you think?

@krisso-rtb
Copy link
Contributor Author

krisso-rtb commented Sep 20, 2023

Thanks for simplifying the code in the latest commit.

I have a high-level question about the strategy for this work...

Given that which supported logical type varies by Avro version (even by patch versions, unfortunately...), I wonder what is the ideal way to approach this?

I believe the approach you took in this PR is to take all of the logical types supported in the latest version of Avro, and if the runtime is Avro 1.9 or later, then use all of these logical types (I assume this is the case because of the missingConversions that you are registering into the modelData...). First of all, is this the right understanding of the current implementation? If it is, then I have some concerns, namely:

  1. If using a SpecificRecord class that was generated in a version of Avro older than the latest, which is missing some of the logical types, then would we not have a class cast exception when trying to put an object of the latest logical type into a field that was generated to not have such type?
  2. If using a GenericRecord, in theory we can put anything in any fields since it's just an array of Objects internally... however, during the period of time when fast-avro is "warming up" (i.e. before the serde compilation finishes), then we'll be falling back on vanilla avro... so does that mean the user will see older types for a brief period of time and later on see the latest logical types?

Both of the above behaviors seem wrong... so I wonder if we should instead do some runtime introspection to figure out what the right type to use is, at the time of the serde code gen, and then use the right type accordingly in the generated code. For a SpecificRecord serde, the right type could be figured out by using reflection to introspect the fields of the SpecificRecord class. For GenericRecord, perhaps we could synthetically generate a payload based on the schema, deserialize it with vanilla Avro, and introspect what type each field is...

It is troublesome, but I think it would avoid the issues mentioned above.

Alternatively, we could say that fast-avro only supports logical types on the most recent of all Avro versions, and for that version it supports all the types... This would be roughly equivalent to the current PR, but we would just bump the minimum Avro version to activate the logical type behavior... Easier to implement, though technically still incorrect in terms of Avro 1.9 and 1.10, but at least it's easier to explain to users what to expect.

Maybe I'm twisting myself into knots over this and it need not be this complicated... IDK. What do you think?

  1. I assumed it's the responsibility of fast-avro user to provide correct modelData (which is either GenericData or SpecificData). If specific record class has been generated years ago then we must provide specificData with no conversions (or simply null) to FastSpecificDatumReader.
    When it comes to (de)serialization, conversion classes can't be deducted from the schema. That's why we need modelData. What's more - implementation of custom conversion class may change between these two points in time:
  • generation of Avro specific record class
  • actual (de)serialization
    Providing correct modelData is as important as providing correct schema.
  1. During "warm up" period we use standard Avro (de)serializers which also receive modelData, e.g.
protected FastDeserializer<T> getRegularAvroImpl(Schema writerSchema, Schema readerSchema, GenericData modelData) {
  return new FastSerdeCache.FastDeserializerWithAvroGenericImpl<>(writerSchema, readerSchema, modelData);
}

so they use the same conversion classes. It's implicitly tested but just in case I will explicitly add these classes to LogicalTypesTestBase as well:

  • ColdGenericDatumReader
  • ColdSpecificDatumReader
  • FastDeserializerWithAvroGenericImpl
  • FastDeserializerWithAvroSpecificImpl
  • FastSerializerWithAvroGenericImpl
  • FastSerializerWithAvroSpecificImpl

The hacks/tricks in unit tests, represented by fixConversionsIfAvro19() method, come from the fact generated Avro classes are inconsistent.
E.g. FastSerdeLogicalTypesTest1 in Avro 1.9 has field for "logicalType": "time-micros" but corresponding conversion is not added to MODEL$

static {
    MODEL$.addLogicalTypeConversion(new org.apache.avro.data.TimeConversions.DateConversion());
    MODEL$.addLogicalTypeConversion(new org.apache.avro.data.TimeConversions.TimestampMillisConversion());
    MODEL$.addLogicalTypeConversion(new org.apache.avro.data.TimeConversions.TimeMillisConversion());
  }
  ...
private java.time.LocalTime timeMicrosField;

@krisso-rtb krisso-rtb force-pushed the origin/logical-types-in-fast-avro-v3 branch 2 times, most recently from 62d55fe to 07ec099 Compare September 27, 2023 17:15
@krisso-rtb krisso-rtb force-pushed the origin/logical-types-in-fast-avro-v3 branch from 07ec099 to 995dd6d Compare September 27, 2023 17:41
@krisso-rtb
Copy link
Contributor Author

krisso-rtb commented Sep 27, 2023

I pushed some updates:

  • branch rebase
  • new unit test to cover default values (just in case, no issues found)
  • previous code-review fixes were added to corresponding commits

We ran quite a few benchmarks to see how fast-avro works with logical types.
As part of the benchmark we run verification to confirm (de-)serialization with fast-serde returns exactly the same results as with vanilla readers/writers. Very similar to the unit tests added in this PR.
For the records with no logical-types we also compared benchmark with fast-serde from master - very much the same.
I collected representative results here:

If we could somehow help with this PR please let us know.

Copy link
Collaborator

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to push on this. I discussed it offline with @radai-rosenblatt, and we have a minor concern, but I don't think it should be a blocker given that this new capability is opt-in.

The concern is that there is maybe too much trust that the users will do the right thing. In our internal ecosystem, we usually cannot make assumptions about which version of Avro a SR was generated from and need to be able to deal with all of them. Furthermore, some SR may be embedded in libraries and mixed with other libraries or main application code and each of these sources of code may have been touched by different versions of Avro. We think there may be ways to make the work begun here even more robust by introspecting types at serde code gen time, rather than relying on the user to pass in the right mappings in their modelData... but I don't think that should block this from being merged, since as I said, it is opt-in.

It is great to see support for logical types finally added 😄 🎉 ...

@FelixGV FelixGV merged commit 73acd3c into linkedin:master Sep 27, 2023
2 checks passed
@krisso-rtb krisso-rtb deleted the origin/logical-types-in-fast-avro-v3 branch September 27, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants