-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-30091 Add Support for the Parquet File Format #17675
Conversation
https://track.hpccsystems.com/browse/HPCC-30091 |
bb93c53
to
13b524a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackdelv Broadly looks very good. A few comments on style, many of which can be ignored.
One question I haven't really covered (except when commenting on the critical section) is whether there are any multi-threading issues.
I haven't looked at the details of the arrow calls, mainly at potential leaks, and reading the code as is.
CMakeLists.txt
Outdated
@@ -206,7 +207,7 @@ if(APPLE) | |||
HPCC_ADD_SUBDIRECTORY(package) | |||
endif(APPLE) | |||
|
|||
if (NOT CLIENTTOOLS_ONLY) | |||
if (NOT (PLUGIN OR CLIENTTOOLS_ONLY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is this change is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change was made to try and solve the issue with installing plugins on top of the cluster. This change doesn't seem to solve that issue, so I have reverted it.
plugins/parquet/parquetembed.hpp
Outdated
: nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {} | ||
|
||
// Copy constructor | ||
PathTracker(const PathTracker &other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is identical to the default copy constructor it is better not to define it, and leave it to the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
plugins/parquet/parquetembed.hpp
Outdated
unsigned int childrenProcessed; | ||
|
||
PathTracker(const char *_nodeName, const arrow::Array *_struct, PathNodeType _nodeType) | ||
: nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: It is slightly safer/clearer to use initializers on the members instead of int the constructor. Especially true when you have multiple constructors.
unsigned int childCount = 0;
unsigned int childrenProcessed = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added initializers on the members rather than the constructor.
plugins/parquet/parquetembed.hpp
Outdated
return arrow::Status::OK(); | ||
} | ||
|
||
int type = 0; // 0 = null, 1 = bool, 2 = Int, 3 = UInt, 4 = Float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Clearer to use an enumeration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to enumeration.
{ | ||
const char *nodeName; | ||
PathNodeType nodeType; | ||
const arrow::Array *structPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self. Check lifetime of nodeName and structPtr, especially with the copy constructor.
plugins/parquet/parquetembed.cpp
Outdated
rtlUtf8ToDataX(len, result, p.resultChars, p.stringResult); | ||
return; | ||
} | ||
if ((*array_visitor)->type == 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example were using an enumeration would make it much easier to understand/check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
plugins/parquet/parquetembed.cpp
Outdated
{ | ||
auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow; | ||
auto view = (*array_visitor)->large_bin_arr->GetView(i); | ||
rtlUtf8ToDataX(len, result, rtlUtf8Length(view.size(), view.data()), view.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you want this or
rtlDataToDataX(len, result, view.size(), view.data())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably work, but I wasn't able to find the function rtlDataToDataX in the platform code. I tried with rtlDataToData and it couldn't copy to the target because it is a null pointer, which I assume is the difference between non-X copy methods and X copy methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use
rtlStrToDataX(len, result, view.size(), view.data())
- there should be an rtlDataToDataX for completeness, but it would have an identical implementation
plugins/parquet/parquetembed.cpp
Outdated
{ | ||
failx("Incorrect type for field %s.", field->name); | ||
} | ||
auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is repeated in various functions. It would be better as a shared function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created currArrayIndex() to get the current index from either a set or row.
plugins/parquet/parquetembed.cpp
Outdated
*/ | ||
bool ParquetRowBuilder::processNextSet(const RtlFieldInfo *field) | ||
{ | ||
return m_pathStack.back().childrenProcessed < m_pathStack.back().childCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: This would be more encapsulated if implemented as a member function of PathTracker::finishedChildren() or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created PathTracker::finishedChildren().
plugins/parquet/parquetembed.cpp
Outdated
*/ | ||
void ParquetRecordBinder::processUtf8(unsigned chars, const char *value, const RtlFieldInfo *field) | ||
{ | ||
bindStringParam(chars, value, field, r_parquet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to call a bindUtfParameter - which performs no translation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created bindUtf8Param().
@jackdelv some of the smoke tests are also failing. |
@ghalliday Ok, back to you Gavin. I made some changes to the way the files are opened by the plugin to ensure that n number of workers can open m number of files. I also made various style changes to makes function names clearer and more descriptive of their behaviour. |
@GordonSmith please can you check the vcpkg related changes (cmake files and vcpkg*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackdelv I think this is very close.
@dcamper I think I would be prepared to merge as-is as long as there are jiras to revisit some of the issues. A couple are significant (e.g. incorrect results), but I think would only happen in unusual situations.
One change should be to rationalise the naming conventions. For instance using leading lower case when it is not overriding a virtual someone else has defined. Using a consistent convention for member variables, this code uses abc, abc_, m_abc and r_abc in different places. Similarly sometimes functions and variables are using underscores instead of camel case.
plugins/parquet/CMakeLists.txt
Outdated
find_package(Arrow CONFIG REQUIRED) | ||
find_package(Parquet CONFIG REQUIRED) | ||
find_package(ArrowDataset CONFIG REQUIRED) | ||
#find_library(ARROWDATASET arrow_dataset ${CMAKE_BINARY_DIR}/vcpkg_installed/x64-linux-dynamic/lib REQUIRED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be removed now, and the commented line at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have removed them.
plugins/parquet/parquetembed.hpp
Outdated
/// \param rows vector of rows | ||
/// \param path field names to enter | ||
/// \param array_levels number of arrays to enter | ||
DocValuesIterator(const std::vector<rapidjson::Document> &rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path should be a
std::vector<std::string> && path
if you are using std::move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
plugins/parquet/parquetembed.hpp
Outdated
/// \param path field names to enter | ||
/// \param array_levels number of arrays to enter | ||
DocValuesIterator(const std::vector<rapidjson::Document> &rows, | ||
std::vector<std::string> path, int64_t array_levels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters and member variables with the same name. I am surprised that works. Our normal approach is to prefix the parameters with _ e.g., _rows, _path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed parameter names.
|
||
~DocValuesIterator() = default; | ||
|
||
const rapidjson::Value *NextArrayOrRow(const rapidjson::Value *value, size_t *path_i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: In general we start function names with lower case letters. Since this is fairly independent code I am fine with it staying as it is (and less disruptive), but any code more tightly integrated into the platform should conform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened JIRA for addressing style issues: HPCC-30381
plugins/parquet/parquetembed.hpp
Outdated
// Can either start at document or at last array level | ||
if (array_stack.size() > 0) | ||
{ | ||
auto pos = array_stack.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be auto &
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
plugins/parquet/parquetembed.cpp
Outdated
return count; | ||
} | ||
|
||
void addMember(std::shared_ptr<ParquetHelper> r_parquet, rapidjson::Value &key, rapidjson::Value &value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are better static - they are called more efficiently and don't get added to the global name space (so the dll/so loads more quickly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed addMember to static.
plugins/parquet/parquetembed.cpp
Outdated
size32_t utf8chars; | ||
rtlDataAttr data; | ||
rtlStrToUtf8X(utf8chars, data.refstr(), len, value); | ||
size32_t utf8len = rtlUtf8Size(utf8chars, data.getstr()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct that it is going via utf8? I thought it mapped to a large binary. I think this will cause data corruption/invalid results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed UTF-8 conversion.
plugins/parquet/parquetembed.cpp
Outdated
rtlUnicodeToUtf8X(utf8chars, utf8, chars, value); | ||
|
||
rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc); | ||
rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utf8 string is not zero terminated - so this will not work.
Most efficient would be to using a std::string_view if rapidjson supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to pass in length to the value constructor.
plugins/parquet/parquetembed.cpp
Outdated
rtlDataAttr data; | ||
rtlStrToDataX(bytes, data.refdata(), len, value); | ||
|
||
bindDataParam(bytes, data.getstr(), field, r_parquet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can pass len and value directory - no need for the rtlStrToDataX()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
else if (stricmp(optName, "BatchSize") == 0) | ||
batchSize = atoi(val); | ||
else | ||
failx("Unknown option %s", optName.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to warn on an unknown option. Failing catches mistyped options, but it also means if new options are added to ecl code the queries will fail on old builds. Happy to leave as-is though.
@jackdelv This should be changed from a draft PR (in the future change it when you think it is ready), and the commits will need squashing. |
@ghalliday I have addressed the most major of your concerns and opened a jira to fix the more cosmetic issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackdelv a few fairly small comments.
plugins/parquet/parquetembed.cpp
Outdated
@@ -1461,7 +1461,7 @@ void bindUnicodeParam(unsigned chars, const UChar *value, const RtlFieldInfo *fi | |||
rtlUnicodeToUtf8X(utf8chars, utf8, chars, value); | |||
|
|||
rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc); | |||
rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc); | |||
rapidjson::Value val = rapidjson::Value(utf8, utf8chars, jsonAlloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the number of code-points or the size? If the size you need
rtlUtf8Size(utf8chars, utf8)
plugins/parquet/parquetembed.cpp
Outdated
rapidjson::Value key; | ||
key.SetString(field->name, jsonAlloc); | ||
rapidjson::Value val; | ||
val.SetString(data.getstr(), utf8len, jsonAlloc); | ||
val.SetString(value, len, jsonAlloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need
size32_t utf8size = rtlUtf8Size(len, value);
Better to name it size rather than len to avoid confusion over whether it is a number of codepoints or a size.
plugins/parquet/parquetembed.hpp
Outdated
DocValuesIterator(const std::vector<rapidjson::Document> &&rows, | ||
std::vector<std::string> path, int64_t array_levels) | ||
: rows(rows), path(std::move(path)), array_levels(array_levels) {} | ||
DocValuesIterator(const std::vector<rapidjson::Document> &&_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you want a && for rows (or a call to std::move when passing it)
Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Fixed and Tested rebase. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Added function instantiations for header functions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Rebase 8.12.x complete and tested. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. Changes after rebase. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Initial commit Add basic EmbedFunctionContext in cpp files. Add Arrow Includes. Fix Header conflicts before rebase. Fix Header conflicts before rebase. HPCC-28689 Refactor jhtree class structure to support additional node compression methods Sort out concerns about usage and semantics of isLeaf(). Also a few trivial changes suggested by earlier code review Signed-off-by: Richard Chapman <[email protected]> Changes after rebase. Not sure what this does. Added example file to show reading from and writing to a parquet file. Add function calls to ecl file. Added some basic options for embedding the parquet functions. I am not sure if these will stay the same. Added install calls so the shared library gets installed. Fixed type. Added function instantiations for header functions. Added some useful function definitions. Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions. Fix typo Added ParquetHelper class for holding the user inputs and opening the stream reader and writer. Added dataset to argument of write function to test writing ECL datasets. Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema. Fix error in getFieldTypes method. Fix addField method. Schema now gets built from the parameter dataset. Can write to parquet files. Set up ParquetRowStream to start reading parquet and output it to an ECL record. Minor changes. Can read a parquet table from a parquet file. Add RowBuilder methods for builder result rows. Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value. Add #DEFINE RAPIDJSON_HAS_STDSTRING 1. This allows for the use of some helpful GenericValue functions. Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson. Fixed error where reading from the file it reads one extra row and fails. Remove Credentials. Fix issue where utf8 encoded parameters were not properly converted to strings. Fix issues dealing with utf8 strings and reading and writing them. Add code for getting the schema for nested types. Add additional feature from apache arrow. Add code for reading partitioned files. Comment out Partition code Edit ECL files Linker not properly linking dataset shared library. Manually link Arrow Dataset shared library. Added part of the code for writing to partitioned files. Replace StreamWriter API with rapidjson api. Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing. Moved allocator for rapidjson documents to top level in the parquetembed namespace. Minor changes. Schema needs to be correctly passed through. Fix issue with writing to vector of rapidjson documents. Add visit method for HalfFloat datatype. Changed FileWriter creation to write in the parquet format. Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once. Changed write() to return the FileWriter instead of the FileOutputStream. Writing in batches now works! Butgit add parquetembed.hpp Further testing on edge cases and vector reuse is needed! Removed old code. Currently in working state though not tested very much. Read Files in RowGroups one at a time then iterate through each row. Added file for testing reading and writing large parquet files. Minor: Changes to simple example. rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup. Tidy up a bit... Created file for testing nested data. Added support for nested objects. Refactored code and moved function implementations to cpp file. Added comment. Added comment. Added new comment. Documentation. Minor. Fixup build and install Signed-off-by: Dan S. Camper <[email protected]> Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each. Add different byte sizes for testing type mapping. Refactor JsonValueConverter. Add additional methods for UInt64/32 and Int32 for explicit type mapping. Change interface for calling plugin. Added information for which thor worker we are running on. Divides the row groups (poorly) by the number of workers. Minor: Updated test files. Change which API is used for opening files. WIP Minor changes to opening parquet file Update FileReader usage to comply with version 12.0.0 standards. Add overlay for pulling down forked arrow directory. Add function to close file that resets unique_ptr when the plugin is done streaming rows to the engine. Add Arrow to vcpkg.json.in Add correct hash for current arrow commit. add PARQUETEMBED to plugins.cmake. Update arrow fork SHA512. Remove trailing whitespace Signed-off-by: Dan S. Camper <[email protected]> Explicit default destructors; make one ParquetHelper pointer private Signed-off-by: Dan S. Camper <[email protected]> ParquetHelper: Use single MemoryPool; explicitly release pool memory on destruction Signed-off-by: Dan S. Camper <[email protected]> Add vcpkg overlay for building with Arrow 12.0.0 release version. Add function for delegating row groups to any number of thor workers. Fix error in dividing up rows between workers and add some additional comments. Change writing parquet to use the same arrow::MemoryPool* Add code for reading from arrow dataset (partitioned files). Fix error in brind string params to rows when writing to parquet. Add better way of setting single vs multiple reads/writes. Implemented logic for scanning chunks of an arrow dataset. Format and cleanup source code. Add Unimplemented Tag to Scalar parameters for EmbedFunctionContext Bind functions. Change check in ParquetHelper::next() to correct value. Use PARQUET_ASSIGN_OR_THROW whenever an arrow::Result object is returned. Create example for creating and reading a partitioned dataset. Writing Datasets using HivePartitioning works. Currently the partioning schema is hardcoded to the language field in the github dataset. Streaming larger than memory datasets to a partitioned dataset is not working. Write each batch out to a different file to avoid collisions of file names. Clean up file name Change partitioning behaviour to delete the directory contents before creating a new partition. Implement thor functionality for writing parquet files. Directory gets created and emptied if it already existed on write. Increased default batch size for reading and writing to 10000. Added critical section to all uses of scanner file operations. Add WritePartition Macro call Make a better example. Start README for parquet plugin. Added support for Data and Unicode datatypes when writing to parquet. removed unnecessary variable. Change rapidjson allocator to thread_local storage specification to allow for multi threaded writing. Added arrow conversion methods for arrow binary type. Changed default RowGroup size to 20000 Added example ecl for reading and writing BLOB data Be more careful passing around length of the strings. Add additional conversion to incoming DATA types to UTF8. Update blob test Use getProp rather than queryProp to preserve the length of a string result being returned. Reimplemented ParquetRowBuilder to build rows from rapidjson::Value objects rather than an IPropertyTree. Fixed Implementation of set datatype. Update test files. update create partition example. Update Example files. Fix reading rapidjson::Array and add Clear calls to the rapidjson allocator. Remove calls to GetAllocator and instead pass jsonAlloc in. Add function for queryingRows from the dataset. Add function for querying rows from dataset. Gets a RecordBatchReader and a RecordBatchReaderIterator to iterate through the stream of record batches from the dataset. There is currently a large limitation in arrow where iterators for datasets do not support random reads, they must iterate from the beginning to the starting point, and they cannot end early, on destruction they are iterated until the end of the stream. Removed rapidjson conversion from RowBuilder implementation. RowBuilder now builds the fields directly from the arrow::Table that is read from the arrow::FileReader object. Implemented a ParquetVisitor class for getting the correct datatype from each scalar. Tested on all test ecl files. format Fix issue with Real datatype not being returned properly. Fixed some performance issues with Arrow FileReader interface. Cache chunks rather than read them in every time we build a field. Update test/example files. Update file reading structure to expect each thor worker to have its own parquet file. Fix compiler warnings. Clean up... Clean up... Clean up... Clean up ... Format source code. Refactor Fix Utf8 conversion when returning String Results. Change open file function call. update default row and batch sizes. Bump vcpkg version of Arrow to 13.0.0 Fix dependency installs. Fix decimal type. Remove PathTracker copy constructor. Change initialization of PathTracker members. Create ParquetArrayType enum. Move ParquetDatasetBinder methods to cpp file. Static jsonAlloc can now be moved out of the header file. minor change for clarity Remove default initializations from constructors. Fix partition condition for user input. Fix decimal datatype. Fix nested structures for all ECL types. Utf8 type no longer gets translated to string. Add utf8 characters to example file. Encapsulate children processing check. Create function currArrayIndex() from common code across source file. Change write() to queryWriter() to be more descriptive. Change return type to pointer to the FileWriter object. Return references rather than pointer where object cannot be a nullptr. Use consistent types, especially in comparisons. Remove countFields because it is a duplicate. Remove floating point ooperation from divide_row_groups. Thor nodes that don't recieve any rows will not open a file. Remove Critical section when writing. Each node writes to a unique file. Add override qualifier to virtual functions in ParquetRowBuilder and ParquetRowStream. Add default initializers and clean up member variables. Allow openReadFile to open any files matching the filename chosen by the user. The files will all be opened and the row counts will be recorded. This will allow for even division of work, and will keep order intact. Style: Make function names clearer and improve clarity. Revert plugin collision change. Fix non null terminated unicode parameters. Fix null characters in string types. Data datatype no longer gets converted to utf-8. Remove extra rtlStrToDataX call in processData. Add static qualifier to addMember function. Remove commented lines. Fix references in Next and DocValuesIterator Fix constructor argument names. Remove && for rows in DocValuesIterator constructor. Use UTF-8 size instead of code-points.
da5f891
to
0850494
Compare
a9d408d
into
hpcc-systems:candidate-9.4.x
Type of change:
Checklist:
Smoketest:
Testing: