HadoopInputFile to pass down FileStatus when opening file #2955

steveloughran · 2024-07-15T18:59:59Z

Rationale for this change

Saves overhead of HTTP head request when opening a file
tells the hadoop FS client that the file being opened is parquet, and should use the first recognized policy of "parquet, columnar, vector, random". These can disable prefetch and limit ranges requested to those optimal for columns.

What changes are included in this PR?

1,. Uses reflection to load reflection-friendly bindings to the enhanced openFile method of apache/hadoop#6686 . Although openFile() has been present since Hadoop 3.3.0, because parquet still builds against hadoop 2.x reflection is required.

Are these changes tested?

Existing tests have been modified.
apache/hadoop#6686

Are there any user-facing changes?

no

Closes #${#2915}

in sync with ongoing hadoop pr, commit 3d7dc340c9a1 Change-Id: I868df6afb373d57179c9cb9d90164e71b0571faf

Change-Id: I7de43d8426b56800c540a520f1fb7fef21ae60ba

* eases future upgrades of hadoop dependencies. * updated uses of FileSystem.open() where the file read policy is known and/or exists()/getFileStatus() calls are executed immediately before. * use it in footer reading as well as file reading This looks like coverage of the core production use; ignoring CLI operations Change-Id: Id1c35619a04a500c7cccd131358b22eaa1e0f984

Got signature wrong. Change-Id: I2923fa0eb11b4cf779eb7b7fc79dcc7917d14db1

parthchandra · 2024-07-17T18:00:35Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+  /**
+   * Read policy for parquet files: {@value}.
+   */
+  public static final String PARQUET_READ_POLICIES = "parquet, columnar, vector, random";


Can you add a comment here on what effect this has? It's not immediately obvious why this would be better than a sequential read.

will do. key thing: it tells all prefetching/caching/range limiting logic what you will have to do, so avoids inefficiencies such as: aborting reads against s3, wasted prefetch on abfs etc.

The parquet option is to say 'do what you need in terms of footer prefetch/cache'; google gcs connector does this but not so explicitly

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/fsdatainputstreambuilder.html

steveloughran · 2024-07-24T12:05:31Z

parquet-hadoop/pom.xml

@@ -160,7 +160,11 @@
      <artifactId>zstd-jni</artifactId>
      <version>${zstd-jni.version}</version>
    </dependency>
-
+    <dependency>


need to revisit why this got in

anmolanmol1234 · 2024-08-29T11:39:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopFileIO.java

+    try {
+      commonFileStatus = fileSystem.getFileStatus(filePath);
+    } catch (FileNotFoundException e) {
+      // file does not exist


Maybe add a DEBUG log here if we are not throwing the exception.

anmolanmol1234 · 2024-08-29T11:43:15Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopFileIO.java

+   *
+   * @throws IOException failure to open the file.
+   */
+  public static FSDataInputStream openFile(final FileSystem fileSystem, final Path path, final boolean randomIO)


The readPolicies method's parameter randomIO could be confusing. It might be more clear if renamed to useRandomIO or isRandomIO.

or we could take a full string list of read policies, which is what is really happening underneath. Hadoop 3.4.1 explicitly adds "parquet" as an input format to tell the FS to optimise for that (footer caching, assume random IO everywhere else...)

anmolanmol1234 · 2024-08-29T11:44:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+   * false if the method is not loaded or the path lacks the capability.
+   * @throws IllegalArgumentException invalid arguments
+   */
+  public boolean pathCapabilities_hasPathCapability(Object fs, Path path, String capability) {


Method name can be shortened.

If you look at the design, we have pulled in a lot of interfaces and their methods, so the naming is designed to isolate them both. It's a bit clunky but it is intended to show where operations come from

anmolanmol1234 · 2024-08-29T11:48:17Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/BindingUtils.java

+      Class<?> source, Class<? extends T> returnType, String name, Class<?>... parameterTypes) {
+
+    final DynMethods.UnboundMethod method = loadInvocation(source, returnType, name, parameterTypes);
+    checkState(method.isStatic(), "Method is not static %s", method);


better to add class name as well in the log ?

anmolanmol1234 · 2024-08-29T11:53:03Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+   */
+  public List<Map.Entry<Path, String>> bulkDelete_delete(FileSystem fs, Path base, Collection<Path> paths) {
+    checkAvailable(bulkDeleteDeleteMethod);
+    return bulkDeleteDeleteMethod.invoke(null, fs, base, paths);


can be renamed to bulkDeleteMethod ?

yes. in the actual new api we've got an interface_method split, but here it looks clunky in the actual invocation

anmolanmol1234 · 2024-08-29T11:55:11Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+ * This class is derived from {@code org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO}.
+ * If a bug is found here, check to see if it has been fixed in hadoop trunk branch.
+ * If not: please provide a patch for that project alongside one here.
+ */


Generic nit :- found the variable and method names longer in the entire file

anmolanmol1234 · 2024-08-29T11:56:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+   * Note: that is the default behaviour of {@code FSDataInputStream#readFully(long, ByteBuffer)}.
+   */
+  public void byteBufferPositionedReadable_readFully(InputStream in, long position, ByteBuffer buf) {
+    checkAvailable(byteBufferPositionedReadableReadFullyMethod);


Can be combined into a single statement like the previous method

anmolanmol1234 · 2024-08-29T12:02:27Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/DynamicWrappedIO.java

+        long.class,
+        ByteBuffer.class);
+  }
+


Also a suggestion that we can cache the Method objects after they are first loaded. This can avoid repeated lookups using reflection.

steveloughran · 2024-11-26T16:02:14Z

Superceded by #3079 now reflection is not needed

steveloughran added 2 commits June 20, 2024 16:56

PARQUET-2493. Reflection based use of hadoop WrappedIO class

677d2fd

in sync with ongoing hadoop pr, commit 3d7dc340c9a1 Change-Id: I868df6afb373d57179c9cb9d90164e71b0571faf

PARQUET-2493. pull read policy declaration into reader code

530d6c3

Change-Id: I7de43d8426b56800c540a520f1fb7fef21ae60ba

steveloughran marked this pull request as draft July 15, 2024 19:00

steveloughran added 2 commits July 15, 2024 21:02

Fix up load of hasPathCapability

1cc039e

Got signature wrong. Change-Id: I2923fa0eb11b4cf779eb7b7fc79dcc7917d14db1

parthchandra reviewed Jul 17, 2024

View reviewed changes

steveloughran commented Jul 24, 2024

View reviewed changes

anmolanmol1234 suggested changes Aug 29, 2024

View reviewed changes

steveloughran closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HadoopInputFile to pass down FileStatus when opening file #2955

HadoopInputFile to pass down FileStatus when opening file #2955

steveloughran commented Jul 15, 2024

parthchandra Jul 17, 2024

steveloughran Jul 24, 2024

steveloughran Jul 24, 2024

anmolanmol1234 Aug 29, 2024

anmolanmol1234 Aug 29, 2024

steveloughran Aug 30, 2024

anmolanmol1234 Aug 29, 2024

steveloughran Aug 30, 2024

anmolanmol1234 Aug 29, 2024

steveloughran Sep 27, 2024

anmolanmol1234 Aug 29, 2024

steveloughran Aug 30, 2024

anmolanmol1234 Aug 29, 2024

anmolanmol1234 Aug 29, 2024

anmolanmol1234 Aug 29, 2024

steveloughran Aug 30, 2024

steveloughran commented Nov 26, 2024

HadoopInputFile to pass down FileStatus when opening file #2955

HadoopInputFile to pass down FileStatus when opening file #2955

Conversation

steveloughran commented Jul 15, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Nov 26, 2024