Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Footer parsing is broken on big-endian machines #44769

Open
QuLogic opened this issue Nov 18, 2024 · 1 comment
Open

[Parquet] Footer parsing is broken on big-endian machines #44769

QuLogic opened this issue Nov 18, 2024 · 1 comment

Comments

@QuLogic
Copy link
Contributor

QuLogic commented Nov 18, 2024

Describe the bug, including details regarding any error messages, version, and platform.

On a big-endian machine, several tests fail like this:

[----------] 2 tests from TestBooleanRLE
[ RUN      ] TestBooleanRLE.TestBooleanScanner
unknown file: Failure
C++ exception with description "Invalid: Parquet file size is 192 bytes, smaller than the size reported by footer's (1862270976bytes)" thrown in SetUp().

[  FAILED  ] TestBooleanRLE.TestBooleanScanner (0 ms)
[ RUN      ] TestBooleanRLE.TestBatchRead
unknown file: Failure
C++ exception with description "Invalid: Parquet file size is 192 bytes, smaller than the size reported by footer's (1862270976bytes)" thrown in SetUp().

[  FAILED  ] TestBooleanRLE.TestBatchRead (0 ms)
[----------] 2 tests from TestBooleanRLE (1 ms total)

This occurs for several tests, which I did not copy out, but they are all similar. This is pretty indicative of incorrect endianness, as byteswapping 1862270976 produces 111, which is smaller than the file size of 192 bytes.

Component(s)

Parquet

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 18, 2024

From the Parquest file format, it appears the file metadata length should always be little endian. With this patch:

diff --git a/cpp/src/parquet/file_reader.cc b/cpp/src/parquet/file_reader.cc
index 3e9eeea6c..7585afcc0 100644
--- a/cpp/src/parquet/file_reader.cc
+++ b/cpp/src/parquet/file_reader.cc
@@ -497,9 +497,10 @@ class SerializedFile : public ParquetFileReader::Contents {
           "is not a parquet file.");
     }
     // Both encrypted/unencrypted footers have the same footer length check.
-    uint32_t metadata_len = ::arrow::util::SafeLoadAs<uint32_t>(
-        reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
-        kFooterSize);
+    uint32_t metadata_len = ::arrow::bit_util::FromLittleEndian(
+        ::arrow::util::SafeLoadAs<uint32_t>(
+            reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
+            kFooterSize));
     if (metadata_len > source_size_ - kFooterSize) {
       throw ParquetInvalidOrCorruptedFileException(
           "Parquet file size is ", source_size_,
diff --git a/cpp/src/parquet/file_writer.cc b/cpp/src/parquet/file_writer.cc
index baa9e00da..695347d8c 100644
--- a/cpp/src/parquet/file_writer.cc
+++ b/cpp/src/parquet/file_writer.cc
@@ -539,6 +539,7 @@ void WriteFileMetaData(const FileMetaData& file_metadata, ArrowOutputStream* sin
   metadata_len = static_cast<uint32_t>(position) - metadata_len;
 
   // Write Footer
+  metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
   PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
   PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
 }
@@ -562,6 +563,7 @@ void WriteEncryptedFileMetadata(const FileMetaData& file_metadata,
     PARQUET_ASSIGN_OR_THROW(position, sink->Tell());
     metadata_len = static_cast<uint32_t>(position) - metadata_len;
 
+    metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
     PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
     PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
   }

I can fix most of these occurrences, except for tests for encryption, and I don't know where I've missed for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant