[HUDI-6787] Integrate the new file group reader with Hive query engine #10422

jonvex · 2023-12-28T01:36:53Z

Change Logs

Replace existing hive read logic with filegroup reader

HoodieFileGroupReader is the generic implementation of a filegroup reader that is intended to be used by all engines. I created HoodieFileGroupReaderRecordReader which implements RecordReader. HoodieFileGroupReaderRecordReader uses HoodieFileGroupReader with HiveHoodieReaderContext to read filegroups (cow, mor, bootstrap) with the hive/hadoop engine.

Impact

hive will be more maintainable

Risk level (write none, low medium or high below)

high
need to do lots of testing

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml

jonvex · 2024-01-04T03:38:00Z

@vinothchandar

vinothchandar

Looks very promising.

packaging/bundle-validation/validate.sh

vinothchandar · 2024-01-08T17:38:21Z

...-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeRecordReader.java

@@ -116,6 +117,7 @@ public void setUp() {
 hadoopConf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
 baseJobConf = new JobConf(hadoopConf);
 baseJobConf.set(HoodieMemoryConfig.MAX_DFS_STREAM_BUFFER_SIZE.key(), String.valueOf(1024 * 1024));
+ baseJobConf.set(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "false");


why "false"

This test is directly creating the record reader instead of reading using the file format

HoodieRealtimeRecordReader recordReader = new HoodieRealtimeRecordReader(split, jobConf, reader);

Looking back at this, I think I might be able to update this test to use the new fg reader.

Tried again and still couldn't get them working.

vinothchandar · 2024-01-08T19:29:41Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

- Option<ClosableIterator<T>> skeletonFileIterator = requiredFields.getLeft().isEmpty() ? Option.empty() :
- Option.of(readerContext.getFileRecordIterator(baseFile.getHadoopPath(), 0, baseFile.getFileLen(),
- createSchemaFromFields(allFields.getLeft()), createSchemaFromFields(requiredFields.getLeft()), hadoopConf));
+ Option<Pair<ClosableIterator<T>,Schema>> dataFileIterator =


its cool we are able to add a new engine without much changes to this class.

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

vinothchandar · 2024-01-08T19:44:01Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

+ if (!(split instanceof FileSplit) || !checkTableIsHudi(split, job)) {
+ return super.getRecordReader(split, job, reporter);
+ }
+ if (supportAvroRead && HoodieColumnProjectionUtils.supportTimestamp(job)) {


note to self; dig into these.

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java

...doop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieCombineRealtimeRecordReader.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

danny0405 · 2024-01-29T03:07:16Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java

+ }
+
+ public static List<String> getPartitionFieldNames(JobConf jobConf) {
+ String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "");


@xicm Can you help confirm this part?

danny0405 · 2024-01-29T03:09:27Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveRecord.java

+
+ @Override
+ public String getRecordKey(Schema recordSchema, Option<BaseKeyGenerator> keyGeneratorOpt) {
+ throw new UnsupportedOperationException("Not supported for HoodieHiveRecord");


Why this is not supported?

It wasn't needed for reading so I didn't implement it.

shall we implement getRecordKey. MIght be useful later. I guess we only need to call getValue() for record key field, isn't it?

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveRecord.java

danny0405 · 2024-01-29T03:09:40Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveRecord.java

+
+ @Override
+ public HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) {
+ throw new UnsupportedOperationException("Not supported for HoodieHiveRecord");


Why this is not supported?

It wasn't needed for reading so I didn't implement it.

danny0405 · 2024-01-29T03:11:07Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveRecordMerger.java

+public class HoodieHiveRecordMerger implements HoodieRecordMerger {
+ @Override
+ public Option<Pair<HoodieRecord, Schema>> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException {
+ ValidationUtils.checkArgument(older.getRecordType() == HoodieRecord.HoodieRecordType.HIVE);


Not sure why we need to override the merge logic?

I just copied what HoodieSparkRecordMerger does

danny0405 · 2024-01-29T03:15:04Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java

+ TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0)));
+
+ StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+ ArrayWritableObjectInspector objectInspector = new ArrayWritableObjectInspector(rowTypeInfo);


@xicm @xiarixiaoyao can you help confirm the correctness?

There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP

There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP

I think hive will handle this itself.

FYI this is pretty much a copy of

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java

Line 111 in e9389ff

private void prepareHiveAvroSerializer() {

I test with hive3, this works well.

xicm · 2024-01-29T07:44:06Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

 @Override
 public RecordReader<NullWritable, ArrayWritable> getRecordReader(final InputSplit split, final JobConf job,
 final Reporter reporter) throws IOException {
+
+ if (HoodieFileGroupReaderRecordReader.useFilegroupReader(job)) {
+ try {


We need to confirm that the values of "hive.io.file.readcolumn.names" and "hive.io.file.readcolumn.ids" in Jobconf contain partition fields, if not, hive3 partition query returns null. see #7355

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java

Line 249 in 2c38ef7

private static Schema createRequestedSchema(Schema tableSchema, JobConf jobConf) {

I remove the partition fields from the read columns if the parquet file doesn't contain them. Does that help?

This problem is parquet file in hive doesn't hava partition fields, while hudi parquet files have.

xicm · 2024-01-31T06:38:45Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

+ }
+ }, split, job, reporter);
+ } else {
+ return new HoodieFileGroupReaderRecordReader(super::getRecordReader, split, job, reporter);


I test with hive3.1.2, partition query returns empty result with this reader.

we can move

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

Lines 150 to 156 in 2c38ef7

new SchemaEvolutionContext(split, job).doEvolutionForParquetFormat();

if (LOG.isDebugEnabled()) {

LOG.debug("EMPLOYING DEFAULT RECORD READER - " + split);

}

HoodieRealtimeInputFormatUtils.addProjectionField(job, job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").split("/"));

to the top of this method.

I added your suggestion. Could you please let me know if that fixes the issue? Thanks!

xicm · 2024-01-31T06:50:21Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java

+ partitionColumns = Arrays.stream(partitionColString.split(",")).collect(Collectors.toSet());
+ }
+ //if they are actually written to the file, then it is ok to read them from the file
+ tableSchema.getFields().forEach(f -> partitionColumns.remove(f.name().toLowerCase(Locale.ROOT)));


I'm confused, will partitionColumns always be empty?

partitionColumns won't be empty if any of the partition columns are not written to the file. tableSchema only has the columns that are written to the file. This is the case in the docker demo https://hudi.apache.org/docs/docker_demo#step-3-sync-with-hive . If you look at the data in https://github.com/apache/hudi/tree/master/docker/demo/data, there is no field named "dt".

danny0405 · 2024-02-03T02:31:21Z

@bvaradar Can you help the review of the hive related code?

bvaradar · 2024-02-07T06:05:03Z

@bvaradar Can you help the review of the hive related code?

Yes @danny0405 . Will review this PR.

jonvex · 2024-02-07T17:35:05Z

Thanks @bvaradar, we all appreciate it!

bvaradar

Overall looks good to me. @jonvex : What Hive versions are we targeting/testing ?

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

bvaradar · 2024-02-12T03:52:17Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

+ }
+
+ @Override
+ public String getRecordKey(ArrayWritable record, Schema schema) {


Isn't this method already defined in the same way in the base class

In the base class it just uses the meta col while here we use the actual field if meta cols are disabled. TBH maybe that should be the case for the base class?

jonvex · 2024-02-20T15:49:07Z

Overall looks good to me. @jonvex : What Hive versions are we targeting/testing ?

@bvaradar I used the docker demo to test. I think that is using Hive 2. We would like this to replace the existing implementation so the goal is to support everything that works when fg reader is disabled.

hudi-bot · 2024-06-09T00:30:02Z

CI report:

99517e2 UNKNOWN
89a4a8c UNKNOWN
e95bcb8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2024-06-23T20:00:24Z

@jonvex could you check in the code of testing new file group reader on Hive 3 from #11398?

jonvex commented Dec 28, 2023

View reviewed changes

docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml Outdated Show resolved Hide resolved

jonvex commented Dec 28, 2023

View reviewed changes

docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml Outdated Show resolved Hide resolved

jonvex force-pushed the use_fg_reader_hive branch from 99517e2 to 7d6e12d Compare December 28, 2023 01:42

jonvex force-pushed the use_fg_reader_hive branch from a84b7a8 to 82f87fa Compare January 4, 2024 21:14

vinothchandar self-assigned this Jan 5, 2024

vinothchandar added the release-1.0.0 label Jan 5, 2024

vinothchandar changed the title ~~[HUDI-6787] Implement the HoodieFileGroupReader API for Hive~~ [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive Jan 8, 2024

vinothchandar reviewed Jan 8, 2024

View reviewed changes

jonvex requested a review from vinothchandar January 20, 2024 01:09

codope assigned danny0405 Jan 25, 2024