HPCC-30091 Add Support for the Parquet File Format #17675

jackdelv · 2023-08-14T16:15:06Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

github-actions · 2023-08-14T16:15:28Z

https://track.hpccsystems.com/browse/HPCC-30091
Jira updated

ghalliday

@jackdelv Broadly looks very good. A few comments on style, many of which can be ignored.
One question I haven't really covered (except when commenting on the critical section) is whether there are any multi-threading issues.
I haven't looked at the details of the arrow calls, mainly at potential leaks, and reading the code as is.

ghalliday · 2023-08-31T10:40:05Z

CMakeLists.txt

@@ -206,7 +207,7 @@ if(APPLE)
    HPCC_ADD_SUBDIRECTORY(package)
 endif(APPLE)

-if (NOT CLIENTTOOLS_ONLY)
+if (NOT (PLUGIN OR CLIENTTOOLS_ONLY))


question: why is this change is here?

That change was made to try and solve the issue with installing plugins on top of the cluster. This change doesn't seem to solve that issue, so I have reverted it.

ghalliday · 2023-08-31T11:31:41Z

plugins/parquet/parquetembed.hpp

+        : nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {}
+
+    // Copy constructor
+    PathTracker(const PathTracker &other)


If this is identical to the default copy constructor it is better not to define it, and leave it to the compiler.

ghalliday · 2023-08-31T11:32:57Z

plugins/parquet/parquetembed.hpp

+    unsigned int childrenProcessed;
+
+    PathTracker(const char *_nodeName, const arrow::Array *_struct, PathNodeType _nodeType)
+        : nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {}


style: It is slightly safer/clearer to use initializers on the members instead of int the constructor. Especially true when you have multiple constructors.

unsigned int childCount = 0; unsigned int childrenProcessed = 0;

Added initializers on the members rather than the constructor.

ghalliday · 2023-08-31T11:34:25Z

plugins/parquet/parquetembed.hpp

+        return arrow::Status::OK();
+    }
+
+    int type = 0; // 0 = null, 1 = bool, 2 = Int, 3 = UInt, 4 = Float


style: Clearer to use an enumeration

Changed to enumeration.

ghalliday · 2023-08-31T11:35:12Z

plugins/parquet/parquetembed.hpp

+{
+    const char *nodeName;
+    PathNodeType nodeType;
+    const arrow::Array *structPtr;


note to self. Check lifetime of nodeName and structPtr, especially with the copy constructor.

ghalliday · 2023-08-31T14:56:02Z

plugins/parquet/parquetembed.cpp

+        rtlUtf8ToDataX(len, result, p.resultChars, p.stringResult);
+        return;
+    }
+    if ((*array_visitor)->type == 6)


Example were using an enumeration would make it much easier to understand/check

ghalliday · 2023-08-31T14:56:44Z

plugins/parquet/parquetembed.cpp

+    {
+        auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow;
+        auto view = (*array_visitor)->large_bin_arr->GetView(i);
+        rtlUtf8ToDataX(len, result, rtlUtf8Length(view.size(), view.data()), view.data());


Not sure if you want this or
rtlDataToDataX(len, result, view.size(), view.data())

That would probably work, but I wasn't able to find the function rtlDataToDataX in the platform code. I tried with rtlDataToData and it couldn't copy to the target because it is a null pointer, which I assume is the difference between non-X copy methods and X copy methods.

You can use
rtlStrToDataX(len, result, view.size(), view.data())

there should be an rtlDataToDataX for completeness, but it would have an identical implementation

ghalliday · 2023-08-31T14:58:50Z

plugins/parquet/parquetembed.cpp

+    {
+        failx("Incorrect type for field %s.", field->name);
+    }
+    auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow;


This code is repeated in various functions. It would be better as a shared function.

Created currArrayIndex() to get the current index from either a set or row.

ghalliday · 2023-08-31T15:02:57Z

plugins/parquet/parquetembed.cpp

+ */
+bool ParquetRowBuilder::processNextSet(const RtlFieldInfo *field)
+{
+    return m_pathStack.back().childrenProcessed < m_pathStack.back().childCount;


minor: This would be more encapsulated if implemented as a member function of PathTracker::finishedChildren() or similar.

Created PathTracker::finishedChildren().

ghalliday · 2023-08-31T16:20:46Z

plugins/parquet/parquetembed.cpp

+ */
+void ParquetRecordBinder::processUtf8(unsigned chars, const char *value, const RtlFieldInfo *field)
+{
+    bindStringParam(chars, value, field, r_parquet);


This needs to call a bindUtfParameter - which performs no translation

Created bindUtf8Param().

ghalliday · 2023-08-31T16:26:11Z

@jackdelv some of the smoke tests are also failing.

jackdelv · 2023-09-25T14:10:35Z

@ghalliday Ok, back to you Gavin. I made some changes to the way the files are opened by the plugin to ensure that n number of workers can open m number of files. I also made various style changes to makes function names clearer and more descriptive of their behaviour.

ghalliday · 2023-09-27T14:46:29Z

@GordonSmith please can you check the vcpkg related changes (cmake files and vcpkg*)

ghalliday

@jackdelv I think this is very close.
@dcamper I think I would be prepared to merge as-is as long as there are jiras to revisit some of the issues. A couple are significant (e.g. incorrect results), but I think would only happen in unusual situations.

One change should be to rationalise the naming conventions. For instance using leading lower case when it is not overriding a virtual someone else has defined. Using a consistent convention for member variables, this code uses abc, abc_, m_abc and r_abc in different places. Similarly sometimes functions and variables are using underscores instead of camel case.

ghalliday · 2023-09-27T14:41:55Z

plugins/parquet/CMakeLists.txt

+        find_package(Arrow CONFIG REQUIRED)
+        find_package(Parquet CONFIG REQUIRED)
+        find_package(ArrowDataset CONFIG REQUIRED)
+        #find_library(ARROWDATASET arrow_dataset ${CMAKE_BINARY_DIR}/vcpkg_installed/x64-linux-dynamic/lib REQUIRED)


Should these be removed now, and the commented line at the end?

Yes, I have removed them.

ghalliday · 2023-09-27T14:55:52Z

plugins/parquet/parquetembed.hpp

+    /// \param rows vector of rows
+    /// \param path field names to enter
+    /// \param array_levels number of arrays to enter
+    DocValuesIterator(const std::vector<rapidjson::Document> &rows,


path should be a
std::vector<std::string> && path if you are using std::move

ghalliday · 2023-09-27T14:57:00Z

plugins/parquet/parquetembed.hpp

+    /// \param path field names to enter
+    /// \param array_levels number of arrays to enter
+    DocValuesIterator(const std::vector<rapidjson::Document> &rows,
+                        std::vector<std::string> path, int64_t array_levels)


Parameters and member variables with the same name. I am surprised that works. Our normal approach is to prefix the parameters with _ e.g., _rows, _path

Fixed parameter names.

ghalliday · 2023-09-27T15:17:47Z

plugins/parquet/parquetembed.hpp

+
+    ~DocValuesIterator() = default;
+
+    const rapidjson::Value *NextArrayOrRow(const rapidjson::Value *value, size_t *path_i,


style: In general we start function names with lower case letters. Since this is fairly independent code I am fine with it staying as it is (and less disruptive), but any code more tightly integrated into the platform should conform.

Opened JIRA for addressing style issues: HPCC-30381

ghalliday · 2023-09-27T15:19:23Z

plugins/parquet/parquetembed.hpp

+        // Can either start at document or at last array level
+        if (array_stack.size() > 0)
+        {
+            auto pos = array_stack.back();


Should this be auto &?

ghalliday · 2023-09-27T15:59:36Z

plugins/parquet/parquetembed.cpp

+    return count;
+}
+
+void addMember(std::shared_ptr<ParquetHelper> r_parquet, rapidjson::Value &key, rapidjson::Value &value)


These functions are better static - they are called more efficiently and don't get added to the global name space (so the dll/so loads more quickly)

Changed addMember to static.

ghalliday · 2023-09-27T16:01:21Z

plugins/parquet/parquetembed.cpp

+    size32_t utf8chars;
+    rtlDataAttr data;
+    rtlStrToUtf8X(utf8chars, data.refstr(), len, value);
+    size32_t utf8len = rtlUtf8Size(utf8chars, data.getstr());


Is this correct that it is going via utf8? I thought it mapped to a large binary. I think this will cause data corruption/invalid results.

Removed UTF-8 conversion.

ghalliday · 2023-09-27T16:06:06Z

plugins/parquet/parquetembed.cpp

+    rtlUnicodeToUtf8X(utf8chars, utf8, chars, value);
+
+    rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc);
+    rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc);


The utf8 string is not zero terminated - so this will not work.
Most efficient would be to using a std::string_view if rapidjson supports it.

Changed to pass in length to the value constructor.

ghalliday · 2023-09-27T16:06:48Z

plugins/parquet/parquetembed.cpp

+    rtlDataAttr data;
+    rtlStrToDataX(bytes, data.refdata(), len, value);
+
+    bindDataParam(bytes, data.getstr(), field, r_parquet);


Can pass len and value directory - no need for the rtlStrToDataX()

ghalliday · 2023-09-27T16:10:28Z

plugins/parquet/parquetembed.cpp

+            else if (stricmp(optName, "BatchSize") == 0)
+                batchSize = atoi(val);
+            else
+                failx("Unknown option %s", optName.str());


It may be better to warn on an unknown option. Failing catches mistyped options, but it also means if new options are added to ecl code the queries will fail on old builds. Happy to leave as-is though.

ghalliday · 2023-09-27T16:21:41Z

@jackdelv This should be changed from a draft PR (in the future change it when you think it is ready), and the commits will need squashing.

jackdelv · 2023-09-27T18:48:45Z

@ghalliday I have addressed the most major of your concerns and opened a jira to fix the more cosmetic issues.

ghalliday

@jackdelv a few fairly small comments.

ghalliday · 2023-09-28T16:58:34Z

plugins/parquet/parquetembed.cpp

@@ -1461,7 +1461,7 @@ void bindUnicodeParam(unsigned chars, const UChar *value, const RtlFieldInfo *fi
    rtlUnicodeToUtf8X(utf8chars, utf8, chars, value);

    rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc);
-    rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc);
+    rapidjson::Value val = rapidjson::Value(utf8, utf8chars, jsonAlloc);


Is this the number of code-points or the size? If the size you need
rtlUtf8Size(utf8chars, utf8)

ghalliday · 2023-09-28T17:00:32Z

plugins/parquet/parquetembed.cpp

    rapidjson::Value key;
    key.SetString(field->name, jsonAlloc);
    rapidjson::Value val;
-    val.SetString(data.getstr(), utf8len, jsonAlloc);
+    val.SetString(value, len, jsonAlloc);


Still need

size32_t utf8size = rtlUtf8Size(len, value);

Better to name it size rather than len to avoid confusion over whether it is a number of codepoints or a size.

ghalliday · 2023-09-28T17:03:36Z

plugins/parquet/parquetembed.hpp

-    DocValuesIterator(const std::vector<rapidjson::Document> &&rows,
-                        std::vector<std::string> path, int64_t array_levels)
-        : rows(rows), path(std::move(path)), array_levels(array_levels) {}
+    DocValuesIterator(const std::vector<rapidjson::Document> &&_rows,


I'm not sure you want a && for rows (or a call to std::move when passing it)

Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Fixed and Tested rebase. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. HPCC-28689 Refactor jhtree class structure to support additional node compression methods Sort out concerns about usage and semantics of isLeaf(). Also a few trivial changes suggested by earlier code review Signed-off-by: Richard Chapman <[email protected]> Changes after rebase. Not sure what this does. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Update FileReader usage to comply with version 12.0.0 standards. Add overlay for pulling down forked arrow directory. Add function to close file that resets unique_ptr when the plugin is done streaming rows to the engine. Add Arrow to vcpkg.json.in Add correct hash for current arrow commit. add PARQUETEMBED to plugins.cmake. Update arrow fork SHA512. Remove trailing whitespace Signed-off-by: Dan S. Camper <[email protected]> Explicit default destructors; make one ParquetHelper pointer private Signed-off-by: Dan S. Camper <[email protected]> ParquetHelper: Use single MemoryPool; explicitly release pool memory on destruction Signed-off-by: Dan S. Camper <[email protected]> Add vcpkg overlay for building with Arrow 12.0.0 release version. Add function for delegating row groups to any number of thor workers. Fix error in dividing up rows between workers and add some additional comments. Change writing parquet to use the same arrow::MemoryPool* Add code for reading from arrow dataset (partitioned files). Fix error in brind string params to rows when writing to parquet. Add better way of setting single vs multiple reads/writes. Implemented logic for scanning chunks of an arrow dataset. Format and cleanup source code. Add Unimplemented Tag to Scalar parameters for EmbedFunctionContext Bind functions. Change check in ParquetHelper::next() to correct value. Use PARQUET_ASSIGN_OR_THROW whenever an arrow::Result object is returned. Create example for creating and reading a partitioned dataset. Writing Datasets using HivePartitioning works. Currently the partioning schema is hardcoded to the language field in the github dataset. Streaming larger than memory datasets to a partitioned dataset is not working. Write each batch out to a different file to avoid collisions of file names. Clean up file name Change partitioning behaviour to delete the directory contents before creating a new partition. Implement thor functionality for writing parquet files. Directory gets created and emptied if it already existed on write. Increased default batch size for reading and writing to 10000. Added critical section to all uses of scanner file operations. Add WritePartition Macro call Make a better example. Start README for parquet plugin. Added support for Data and Unicode datatypes when writing to parquet. removed unnecessary variable. Change rapidjson allocator to thread_local storage specification to allow for multi threaded writing. Added arrow conversion methods for arrow binary type. Changed default RowGroup size to 20000 Added example ecl for reading and writing BLOB data Be more careful passing around length of the strings. Add additional conversion to incoming DATA types to UTF8. Update blob test Use getProp rather than queryProp to preserve the length of a string result being returned. Reimplemented ParquetRowBuilder to build rows from rapidjson::Value objects rather than an IPropertyTree. Fixed Implementation of set datatype. Update test files. update create partition example. Update Example files. Fix reading rapidjson::Array and add Clear calls to the rapidjson allocator. Remove calls to GetAllocator and instead pass jsonAlloc in. Add function for queryingRows from the dataset. Add function for querying rows from dataset. Gets a RecordBatchReader and a RecordBatchReaderIterator to iterate through the stream of record batches from the dataset. There is currently a large limitation in arrow where iterators for datasets do not support random reads, they must iterate from the beginning to the starting point, and they cannot end early, on destruction they are iterated until the end of the stream. Removed rapidjson conversion from RowBuilder implementation. RowBuilder now builds the fields directly from the arrow::Table that is read from the arrow::FileReader object. Implemented a ParquetVisitor class for getting the correct datatype from each scalar. Tested on all test ecl files. format Fix issue with Real datatype not being returned properly. Fixed some performance issues with Arrow FileReader interface. Cache chunks rather than read them in every time we build a field. Update test/example files. Update file reading structure to expect each thor worker to have its own parquet file. Fix compiler warnings. Clean up... Clean up... Clean up... Clean up ... Format source code. Refactor Fix Utf8 conversion when returning String Results. Change open file function call. update default row and batch sizes. Bump vcpkg version of Arrow to 13.0.0 Fix dependency installs. Fix decimal type. Remove PathTracker copy constructor. Change initialization of PathTracker members. Create ParquetArrayType enum. Move ParquetDatasetBinder methods to cpp file. Static jsonAlloc can now be moved out of the header file. minor change for clarity Remove default initializations from constructors. Fix partition condition for user input. Fix decimal datatype. Fix nested structures for all ECL types. Utf8 type no longer gets translated to string. Add utf8 characters to example file. Encapsulate children processing check. Create function currArrayIndex() from common code across source file. Change write() to queryWriter() to be more descriptive. Change return type to pointer to the FileWriter object. Return references rather than pointer where object cannot be a nullptr. Use consistent types, especially in comparisons. Remove countFields because it is a duplicate. Remove floating point ooperation from divide_row_groups. Thor nodes that don't recieve any rows will not open a file. Remove Critical section when writing. Each node writes to a unique file. Add override qualifier to virtual functions in ParquetRowBuilder and ParquetRowStream. Add default initializers and clean up member variables. Allow openReadFile to open any files matching the filename chosen by the user. The files will all be opened and the row counts will be recorded. This will allow for even division of work, and will keep order intact. Style: Make function names clearer and improve clarity. Revert plugin collision change. Fix non null terminated unicode parameters. Fix null characters in string types. Data datatype no longer gets converted to utf-8. Remove extra rtlStrToDataX call in processData. Add static qualifier to addMember function. Remove commented lines. Fix references in Next and DocValuesIterator Fix constructor argument names. Remove && for rows in DocValuesIterator constructor. Use UTF-8 size instead of code-points.

jackdelv changed the base branch from candidate-9.2.x to master August 14, 2023 16:16

jackdelv force-pushed the parquet-dev branch 2 times, most recently from bb93c53 to 13b524a Compare August 14, 2023 18:43

ghalliday self-requested a review August 23, 2023 15:24

ghalliday reviewed Aug 31, 2023

View reviewed changes

ghalliday self-requested a review September 25, 2023 15:24

jackdelv changed the base branch from master to candidate-9.4.x September 25, 2023 15:28

ghalliday requested a review from GordonSmith September 27, 2023 14:46

ghalliday reviewed Sep 27, 2023

View reviewed changes

jackdelv marked this pull request as ready for review September 27, 2023 16:34

jackdelv requested a review from ghalliday September 27, 2023 18:48

ghalliday reviewed Sep 28, 2023

View reviewed changes

jackdelv requested a review from ghalliday September 28, 2023 17:35

jackdelv force-pushed the parquet-dev branch from da5f891 to 0850494 Compare September 28, 2023 18:38

ghalliday merged commit a9d408d into hpcc-systems:candidate-9.4.x Sep 28, 2023
25 of 26 checks passed


		~DocValuesIterator() = default;

		const rapidjson::Value NextArrayOrRow(const rapidjson::Value value, size_t *path_i,

HPCC-30091 Add Support for the Parquet File Format #17675

HPCC-30091 Add Support for the Parquet File Format #17675

Conversation

jackdelv commented Aug 14, 2023 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Aug 14, 2023

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday commented Aug 31, 2023

jackdelv commented Sep 25, 2023

ghalliday commented Sep 27, 2023

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday commented Sep 27, 2023

jackdelv commented Sep 27, 2023

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackdelv commented Aug 14, 2023 •

edited

Loading