[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

lokeshj1703 · 2024-10-18T19:05:55Z

Change Logs

Fixing functional index record generation using spark distributed computation.

Impact

NA

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…omputation

lokeshj1703 · 2024-10-18T19:57:00Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java

 try {
- return (GenericRecord) payload.getInsertValue(schema).get();
+ return (GenericRecord) (r.getData() instanceof GenericRecord ? r.getData()
+ : ((HoodieRecordPayload) r.getData()).getInsertValue(schema, new Properties()).get());


Passed empty properties here. How do we usually pass properties for this function?

not sure I get your question. We pass it in from driver only (hoodieWriteConfig.getProps())

nsivabalan

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

nsivabalan · 2024-10-19T00:48:37Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java

 try {
- return (GenericRecord) payload.getInsertValue(schema).get();
+ return (GenericRecord) (r.getData() instanceof GenericRecord ? r.getData()
+ : ((HoodieRecordPayload) r.getData()).getInsertValue(schema, new Properties()).get());


not sure I get your question. We pass it in from driver only (hoodieWriteConfig.getProps())

nsivabalan · 2024-10-19T00:48:59Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java

 }
 })
+ .map(converterToRow::apply)
+ // .map(row -> RowFactory.create(path, row))


can we remove L 223. (the commented out line)

...t/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java

...park-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java

lokeshj1703 · 2024-10-19T12:33:28Z

Tested functional index in spark shell

nsivabalan · 2024-10-19T14:39:10Z

@hudi-bot run azure

nsivabalan · 2024-10-19T14:39:51Z

@lokeshj1703 :

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

did you try this out?

nsivabalan · 2024-10-19T15:26:00Z

.../hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestFunctionalIndex.scala


-@Ignore
+


I remember some hive sync related test was failing and hence entire class is disabled. unless you know you fixed the hive sync test, can you revert the unintended changes

nsivabalan · 2024-10-19T15:30:41Z

Pushed an update to address my own comments.

@codope @lokeshj1703 : do we know for bloom filter based index, we only need to maintain stats just for base file or even for log file?
May be we can take it as a follow up. I do not want to expand the scope of this patch.

bcoz, one of the inner most method HoodieMetadataPayload.createBloomFilterMetadataRecord accepts only base file. But in general, we have made col stats and functional index stats w/ col stats index type maintained for every file(base file and log file).

lokeshj1703 · 2024-10-19T17:14:18Z

@lokeshj1703 :

can you test this once in real cluster. just to ensure we don't run into NotSerializable exception by any chance.

did you try this out?

Yes tried it out in Spark shell

yihua · 2024-10-19T18:25:26Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

@@ -433,7 +433,7 @@ private boolean initializeFromFilesystem(String initializationTime, List<Metadat
 }
 ValidationUtils.checkState(functionalIndexPartitionsToInit.size() == 1, "Only one functional index at a time is supported for now");
 partitionName = functionalIndexPartitionsToInit.iterator().next();
- fileGroupCountAndRecordsPair = initializeFunctionalIndexPartition(partitionName);
+ fileGroupCountAndRecordsPair = initializeFunctionalIndexPartition(partitionName, commitTimeForPartition);


Be consistent on using commit or instant: commitTimeForPartition and instantTime.

yihua · 2024-10-19T18:31:14Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

@@ -537,27 +537,41 @@ private Pair<Integer, HoodieData<HoodieRecord>> initializeBloomFiltersPartition(
 return Pair.of(fileGroupCount, records);
 }

- protected abstract HoodieData<HoodieRecord> getFunctionalIndexRecords(List<Pair<String, FileSlice>> partitionFileSlicePairs,
+ protected abstract HoodieData<HoodieRecord> getFunctionalIndexRecords(List<Pair<String, Pair<String, Long>>> partitionFilePathPairs,


Could we rename partitionFilePathPairs to contain size or use StoragePathInfo instead of Pair<String, Long> to maintain the readability?

hudi-bot · 2024-10-19T18:37:14Z

CI report:

1426d09 UNKNOWN
0bd5f62 UNKNOWN
a567a5c Azure: FAILURE
c623adc Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan and others added 4 commits October 19, 2024 00:43

Fixing functional index MDT record generation for spark distributed c…

f4db25c

…omputation

replace spark engine context parallelism by usual java parallelism

8fca96d

[wip] trying to fetch records in parallel and then apply index

fdd9ddf

Optimise the index computation

1426d09

lokeshj1703 force-pushed the HUDI-8340-fixingFunctionalIndexUpdates2 branch from bc6dbd7 to 1426d09 Compare October 18, 2024 19:15

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 18, 2024

lokeshj1703 added 2 commits October 19, 2024 01:25

Fix test failure

f16348c

Fix test failure

30f00f3

lokeshj1703 commented Oct 18, 2024

View reviewed changes

nsivabalan reviewed Oct 19, 2024

View reviewed changes

lokeshj1703 added 2 commits October 19, 2024 17:28

Fix checkstyle

0bd5f62

Fix checkstyle

9529c87

nsivabalan reviewed Oct 19, 2024

View reviewed changes

nsivabalan approved these changes Oct 19, 2024

View reviewed changes

Addressing minor comments

c623adc

nsivabalan force-pushed the HUDI-8340-fixingFunctionalIndexUpdates2 branch from a567a5c to c623adc Compare October 19, 2024 17:44

yihua requested changes Oct 19, 2024

View reviewed changes

yihua reviewed Oct 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

lokeshj1703 commented Oct 18, 2024

lokeshj1703 Oct 18, 2024

nsivabalan Oct 19, 2024

lokeshj1703 Oct 19, 2024

nsivabalan left a comment

nsivabalan Oct 19, 2024

nsivabalan Oct 19, 2024

lokeshj1703 Oct 19, 2024

lokeshj1703 commented Oct 19, 2024

nsivabalan commented Oct 19, 2024

nsivabalan commented Oct 19, 2024

nsivabalan Oct 19, 2024

nsivabalan commented Oct 19, 2024 •

edited

Loading

lokeshj1703 commented Oct 19, 2024

yihua Oct 19, 2024

yihua Oct 19, 2024

hudi-bot commented Oct 19, 2024


		@Ignore

[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

Are you sure you want to change the base?

[HUDI-8340] Fixing functional index record generation using spark distributed computation #12127

Conversation

lokeshj1703 commented Oct 18, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lokeshj1703 commented Oct 19, 2024

nsivabalan commented Oct 19, 2024

nsivabalan commented Oct 19, 2024

Choose a reason for hiding this comment

nsivabalan commented Oct 19, 2024 • edited Loading

lokeshj1703 commented Oct 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Oct 19, 2024

CI report:

nsivabalan commented Oct 19, 2024 •

edited

Loading