-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6787] Integrate the new file group reader with Hive query engine #10422
Conversation
docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml
Outdated
Show resolved
Hide resolved
docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml
Outdated
Show resolved
Hide resolved
99517e2
to
7d6e12d
Compare
a84b7a8
to
82f87fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very promising.
@@ -116,6 +117,7 @@ public void setUp() { | |||
hadoopConf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName()); | |||
baseJobConf = new JobConf(hadoopConf); | |||
baseJobConf.set(HoodieMemoryConfig.MAX_DFS_STREAM_BUFFER_SIZE.key(), String.valueOf(1024 * 1024)); | |||
baseJobConf.set(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "false"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is directly creating the record reader instead of reading using the file format
HoodieRealtimeRecordReader recordReader = new HoodieRealtimeRecordReader(split, jobConf, reader);
Looking back at this, I think I might be able to update this test to use the new fg reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried again and still couldn't get them working.
Option<ClosableIterator<T>> skeletonFileIterator = requiredFields.getLeft().isEmpty() ? Option.empty() : | ||
Option.of(readerContext.getFileRecordIterator(baseFile.getHadoopPath(), 0, baseFile.getFileLen(), | ||
createSchemaFromFields(allFields.getLeft()), createSchemaFromFields(requiredFields.getLeft()), hadoopConf)); | ||
Option<Pair<ClosableIterator<T>,Schema>> dataFileIterator = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its cool we are able to add a new engine without much changes to this class.
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
Outdated
Show resolved
Hide resolved
if (!(split instanceof FileSplit) || !checkTableIsHudi(split, job)) { | ||
return super.getRecordReader(split, job, reporter); | ||
} | ||
if (supportAvroRead && HoodieColumnProjectionUtils.supportTimestamp(job)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self; dig into these.
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
Outdated
Show resolved
Hide resolved
...doop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieCombineRealtimeRecordReader.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public static List<String> getPartitionFieldNames(JobConf jobConf) { | ||
String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm Can you help confirm this part?
|
||
@Override | ||
public String getRecordKey(Schema recordSchema, Option<BaseKeyGenerator> keyGeneratorOpt) { | ||
throw new UnsupportedOperationException("Not supported for HoodieHiveRecord"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't needed for reading so I didn't implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we implement getRecordKey
. MIght be useful later. I guess we only need to call getValue()
for record key field, isn't it?
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHiveRecord.java
Show resolved
Hide resolved
|
||
@Override | ||
public HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) { | ||
throw new UnsupportedOperationException("Not supported for HoodieHiveRecord"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't needed for reading so I didn't implement it.
public class HoodieHiveRecordMerger implements HoodieRecordMerger { | ||
@Override | ||
public Option<Pair<HoodieRecord, Schema>> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException { | ||
ValidationUtils.checkArgument(older.getRecordType() == HoodieRecord.HoodieRecordType.HIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need to override the merge logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied what HoodieSparkRecordMerger does
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0))); | ||
|
||
StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList); | ||
ArrayWritableObjectInspector objectInspector = new ArrayWritableObjectInspector(rowTypeInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm @xiarixiaoyao can you help confirm the correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP
I think hive will handle this itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this is pretty much a copy of
hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
Line 111 in e9389ff
private void prepareHiveAvroSerializer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test with hive3, this works well.
@Override | ||
public RecordReader<NullWritable, ArrayWritable> getRecordReader(final InputSplit split, final JobConf job, | ||
final Reporter reporter) throws IOException { | ||
|
||
if (HoodieFileGroupReaderRecordReader.useFilegroupReader(job)) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to confirm that the values of "hive.io.file.readcolumn.names" and "hive.io.file.readcolumn.ids" in Jobconf contain partition fields, if not, hive3 partition query returns null. see #7355
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java
Line 249 in 2c38ef7
private static Schema createRequestedSchema(Schema tableSchema, JobConf jobConf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem is parquet file in hive doesn't hava partition fields, while hudi parquet files have.
} | ||
}, split, job, reporter); | ||
} else { | ||
return new HoodieFileGroupReaderRecordReader(super::getRecordReader, split, job, reporter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test with hive3.1.2, partition query returns empty result with this reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move
hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
Lines 150 to 156 in 2c38ef7
new SchemaEvolutionContext(split, job).doEvolutionForParquetFormat(); | |
if (LOG.isDebugEnabled()) { | |
LOG.debug("EMPLOYING DEFAULT RECORD READER - " + split); | |
} | |
HoodieRealtimeInputFormatUtils.addProjectionField(job, job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").split("/")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added your suggestion. Could you please let me know if that fixes the issue? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
partitionColumns = Arrays.stream(partitionColString.split(",")).collect(Collectors.toSet()); | ||
} | ||
//if they are actually written to the file, then it is ok to read them from the file | ||
tableSchema.getFields().forEach(f -> partitionColumns.remove(f.name().toLowerCase(Locale.ROOT))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, will partitionColumns always be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionColumns won't be empty if any of the partition columns are not written to the file. tableSchema only has the columns that are written to the file. This is the case in the docker demo https://hudi.apache.org/docs/docker_demo#step-3-sync-with-hive . If you look at the data in https://github.com/apache/hudi/tree/master/docker/demo/data, there is no field named "dt".
@bvaradar Can you help the review of the hive related code? |
Yes @danny0405 . Will review this PR. |
Thanks @bvaradar, we all appreciate it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. @jonvex : What Hive versions are we targeting/testing ?
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public String getRecordKey(ArrayWritable record, Schema schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this method already defined in the same way in the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the base class it just uses the meta col while here we use the actual field if meta cols are disabled. TBH maybe that should be the case for the base class?
d8ed4e3
to
18fbd92
Compare
Change Logs
Replace existing hive read logic with filegroup reader
HoodieFileGroupReader is the generic implementation of a filegroup reader that is intended to be used by all engines. I created HoodieFileGroupReaderRecordReader which implements RecordReader. HoodieFileGroupReaderRecordReader uses HoodieFileGroupReader with HiveHoodieReaderContext to read filegroups (cow, mor, bootstrap) with the hive/hadoop engine.
Impact
hive will be more maintainable
Risk level (write none, low medium or high below)
high
need to do lots of testing
Documentation Update
N/A
Contributor's checklist