-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HadoopInputFile to pass down FileStatus when opening file #2955
HadoopInputFile to pass down FileStatus when opening file #2955
Conversation
in sync with ongoing hadoop pr, commit 3d7dc340c9a1 Change-Id: I868df6afb373d57179c9cb9d90164e71b0571faf
Change-Id: I7de43d8426b56800c540a520f1fb7fef21ae60ba
* eases future upgrades of hadoop dependencies. * updated uses of FileSystem.open() where the file read policy is known and/or exists()/getFileStatus() calls are executed immediately before. * use it in footer reading as well as file reading This looks like coverage of the core production use; ignoring CLI operations Change-Id: Id1c35619a04a500c7cccd131358b22eaa1e0f984
Got signature wrong. Change-Id: I2923fa0eb11b4cf779eb7b7fc79dcc7917d14db1
/** | ||
* Read policy for parquet files: {@value}. | ||
*/ | ||
public static final String PARQUET_READ_POLICIES = "parquet, columnar, vector, random"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here on what effect this has? It's not immediately obvious why this would be better than a sequential read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. key thing: it tells all prefetching/caching/range limiting logic what you will have to do, so avoids inefficiencies such as: aborting reads against s3, wasted prefetch on abfs etc.
The parquet option is to say 'do what you need in terms of footer prefetch/cache'; google gcs connector does this but not so explicitly
@@ -160,7 +160,11 @@ | |||
<artifactId>zstd-jni</artifactId> | |||
<version>${zstd-jni.version}</version> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to revisit why this got in
try { | ||
commonFileStatus = fileSystem.getFileStatus(filePath); | ||
} catch (FileNotFoundException e) { | ||
// file does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a DEBUG log here if we are not throwing the exception.
* | ||
* @throws IOException failure to open the file. | ||
*/ | ||
public static FSDataInputStream openFile(final FileSystem fileSystem, final Path path, final boolean randomIO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readPolicies method's parameter randomIO could be confusing. It might be more clear if renamed to useRandomIO or isRandomIO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we could take a full string list of read policies, which is what is really happening underneath. Hadoop 3.4.1 explicitly adds "parquet" as an input format to tell the FS to optimise for that (footer caching, assume random IO everywhere else...)
* false if the method is not loaded or the path lacks the capability. | ||
* @throws IllegalArgumentException invalid arguments | ||
*/ | ||
public boolean pathCapabilities_hasPathCapability(Object fs, Path path, String capability) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method name can be shortened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the design, we have pulled in a lot of interfaces and their methods, so the naming is designed to isolate them both. It's a bit clunky but it is intended to show where operations come from
Class<?> source, Class<? extends T> returnType, String name, Class<?>... parameterTypes) { | ||
|
||
final DynMethods.UnboundMethod method = loadInvocation(source, returnType, name, parameterTypes); | ||
checkState(method.isStatic(), "Method is not static %s", method); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add class name as well in the log ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
*/ | ||
public List<Map.Entry<Path, String>> bulkDelete_delete(FileSystem fs, Path base, Collection<Path> paths) { | ||
checkAvailable(bulkDeleteDeleteMethod); | ||
return bulkDeleteDeleteMethod.invoke(null, fs, base, paths); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be renamed to bulkDeleteMethod ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. in the actual new api we've got an interface_method split, but here it looks clunky in the actual invocation
* This class is derived from {@code org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO}. | ||
* If a bug is found here, check to see if it has been fixed in hadoop trunk branch. | ||
* If not: please provide a patch for that project alongside one here. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic nit :- found the variable and method names longer in the entire file
* Note: that is the default behaviour of {@code FSDataInputStream#readFully(long, ByteBuffer)}. | ||
*/ | ||
public void byteBufferPositionedReadable_readFully(InputStream in, long position, ByteBuffer buf) { | ||
checkAvailable(byteBufferPositionedReadableReadFullyMethod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be combined into a single statement like the previous method
long.class, | ||
ByteBuffer.class); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a suggestion that we can cache the Method objects after they are first loaded. This can avoid repeated lookups using reflection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea!
Superceded by #3079 now reflection is not needed |
Rationale for this change
What changes are included in this PR?
1,. Uses reflection to load reflection-friendly bindings to the enhanced openFile method of apache/hadoop#6686 . Although openFile() has been present since Hadoop 3.3.0, because parquet still builds against hadoop 2.x reflection is required.
Are these changes tested?
Existing tests have been modified.
apache/hadoop#6686
Are there any user-facing changes?
no
Closes #${#2915}