Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

lokeshj1703
Copy link
Contributor

Change Logs

Fixing functional index record generation using spark distributed computation.

Impact

NA

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@lokeshj1703 lokeshj1703 force-pushed the HUDI-8340-fixingFunctionalIndexUpdates2 branch from bc6dbd7 to 1426d09 Compare October 18, 2024 19:15
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 18, 2024
try {
return (GenericRecord) payload.getInsertValue(schema).get();
return (GenericRecord) (r.getData() instanceof GenericRecord ? r.getData()
: ((HoodieRecordPayload) r.getData()).getInsertValue(schema, new Properties()).get());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passed empty properties here. How do we usually pass properties for this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get your question. We pass it in from driver only (hoodieWriteConfig.getProps())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

try {
return (GenericRecord) payload.getInsertValue(schema).get();
return (GenericRecord) (r.getData() instanceof GenericRecord ? r.getData()
: ((HoodieRecordPayload) r.getData()).getInsertValue(schema, new Properties()).get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get your question. We pass it in from driver only (hoodieWriteConfig.getProps())

}
})
.map(converterToRow::apply)
// .map(row -> RowFactory.create(path, row))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove L 223. (the commented out line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@lokeshj1703
Copy link
Contributor Author

Tested functional index in spark shell

@nsivabalan
Copy link
Contributor

@hudi-bot run azure

@nsivabalan
Copy link
Contributor

@lokeshj1703 :

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

did you try this out?


@Ignore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember some hive sync related test was failing and hence entire class is disabled. unless you know you fixed the hive sync test, can you revert the unintended changes

@nsivabalan
Copy link
Contributor

nsivabalan commented Oct 19, 2024

Pushed an update to address my own comments.

@codope @lokeshj1703 : do we know for bloom filter based index, we only need to maintain stats just for base file or even for log file?
May be we can take it as a follow up. I do not want to expand the scope of this patch.

bcoz, one of the inner most method HoodieMetadataPayload.createBloomFilterMetadataRecord accepts only base file. But in general, we have made col stats and functional index stats w/ col stats index type maintained for every file(base file and log file).

@lokeshj1703
Copy link
Contributor Author

@lokeshj1703 :

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

did you try this out?

Yes tried it out in Spark shell

@nsivabalan nsivabalan force-pushed the HUDI-8340-fixingFunctionalIndexUpdates2 branch from a567a5c to c623adc Compare October 19, 2024 17:44
@@ -433,7 +433,7 @@ private boolean initializeFromFilesystem(String initializationTime, List<Metadat
}
ValidationUtils.checkState(functionalIndexPartitionsToInit.size() == 1, "Only one functional index at a time is supported for now");
partitionName = functionalIndexPartitionsToInit.iterator().next();
fileGroupCountAndRecordsPair = initializeFunctionalIndexPartition(partitionName);
fileGroupCountAndRecordsPair = initializeFunctionalIndexPartition(partitionName, commitTimeForPartition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be consistent on using commit or instant: commitTimeForPartition and instantTime.

@@ -537,27 +537,41 @@ private Pair<Integer, HoodieData<HoodieRecord>> initializeBloomFiltersPartition(
return Pair.of(fileGroupCount, records);
}

protected abstract HoodieData<HoodieRecord> getFunctionalIndexRecords(List<Pair<String, FileSlice>> partitionFileSlicePairs,
protected abstract HoodieData<HoodieRecord> getFunctionalIndexRecords(List<Pair<String, Pair<String, Long>>> partitionFilePathPairs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename partitionFilePathPairs to contain size or use StoragePathInfo instead of Pair<String, Long> to maintain the readability?

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants