Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when the column size equals, there is no need to rewrite no matter whether the schema is compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 @ThinkerLei
i think again.
SameCols. size() == colNamesFromWriteScheme. size() only happen in following scence
The table has new columns, while the old columns have not been changed(rename, type change).
eg:
In this case
SameCols. size() == colNamesFromWriteScheme. size().
and, writeSchema is equivalent to a pruned readschema.
However, some versions of AVRO, such as AVRO 1.8. x , may report errors when using pruned schemas to read AVRO files. (avro 1.10x has no such problem)
Therefore, even if sameCols. size() == colNamesFromWriteScheme. size(), we still need to check the compatibility of the read-write schema. If it is compatible, we can directly use this writeSchema to read avo data.
maybe we can use following logic to avoid unnecessary rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405
This place can actually raise an additional question,
Now when we are reading the MOR table, we pass the full schema when reading the AVRO log; Even if we only query one column, if this table has 100 rows of avro logs, using full schema to read data and generate BitCatstMap will consume a lot of memory, and the performance will not be good.
now our current version of Avro has been upgraded to 1.10. x. In fact, we can pass pruned schemas directly when reading logs. This way, when reading logs and generating bitcastmaps, the speed and memory consumption are much better.
Forgive me for that i can not paste test pic due to company information security reasons
presto read hudi log
pass full schema, we will see following log
Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 712,956,000
final query time: 35672ms
pass puned schema
Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 45,500,000
final query time: 13373ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point for optimization, we introduce some changes like the dynamic read schema based on write schema in release 1.x as for the
HoodieFileGroupReader
, but I'm not sure whether it is applied automically for all the read paths, cc @yihua for confirming this.And anyway, I think we should have such optimization in 0.x branch and master for the legacy
HoodieMergedLogRecordReader
which will still be benefic to engines line Flink and Hive.@xiarixiaoyao do you have intreast to contribute this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the logic I initially fixed. Do I still need to make changes based on this PR? cc @danny0405 @xiarixiaoyao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's change it back, we better have some test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao This info is valuable. Basically using pruned schema to read Avro records is supported on Avro 1.10 and above, not on lower versions. I see that Spark 3.2 and above and all Flink versions use Avro 1.10 and above. So for these integrations and others that rely on Avro 1.10 and above, we should use pruned schema to read log records to improve performance. I'll check the new file group reader.