Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30091 Add Support for the Parquet File Format #17675

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

jackdelv
Copy link
Contributor

@jackdelv jackdelv commented Aug 14, 2023

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@github-actions
Copy link

@jackdelv jackdelv changed the base branch from candidate-9.2.x to master August 14, 2023 16:16
@jackdelv jackdelv force-pushed the parquet-dev branch 2 times, most recently from bb93c53 to 13b524a Compare August 14, 2023 18:43
@ghalliday ghalliday self-requested a review August 23, 2023 15:24
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv Broadly looks very good. A few comments on style, many of which can be ignored.
One question I haven't really covered (except when commenting on the critical section) is whether there are any multi-threading issues.
I haven't looked at the details of the arrow calls, mainly at potential leaks, and reading the code as is.

CMakeLists.txt Outdated
@@ -206,7 +207,7 @@ if(APPLE)
HPCC_ADD_SUBDIRECTORY(package)
endif(APPLE)

if (NOT CLIENTTOOLS_ONLY)
if (NOT (PLUGIN OR CLIENTTOOLS_ONLY))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why is this change is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change was made to try and solve the issue with installing plugins on top of the cluster. This change doesn't seem to solve that issue, so I have reverted it.

: nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {}

// Copy constructor
PathTracker(const PathTracker &other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is identical to the default copy constructor it is better not to define it, and leave it to the compiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

unsigned int childrenProcessed;

PathTracker(const char *_nodeName, const arrow::Array *_struct, PathNodeType _nodeType)
: nodeName(_nodeName), nodeType(_nodeType), structPtr(_struct), childCount(0), childrenProcessed(0) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: It is slightly safer/clearer to use initializers on the members instead of int the constructor. Especially true when you have multiple constructors.

    unsigned int childCount = 0;
    unsigned int childrenProcessed = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added initializers on the members rather than the constructor.

return arrow::Status::OK();
}

int type = 0; // 0 = null, 1 = bool, 2 = Int, 3 = UInt, 4 = Float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Clearer to use an enumeration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to enumeration.

{
const char *nodeName;
PathNodeType nodeType;
const arrow::Array *structPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self. Check lifetime of nodeName and structPtr, especially with the copy constructor.

rtlUtf8ToDataX(len, result, p.resultChars, p.stringResult);
return;
}
if ((*array_visitor)->type == 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example were using an enumeration would make it much easier to understand/check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

{
auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow;
auto view = (*array_visitor)->large_bin_arr->GetView(i);
rtlUtf8ToDataX(len, result, rtlUtf8Length(view.size(), view.data()), view.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you want this or
rtlDataToDataX(len, result, view.size(), view.data())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably work, but I wasn't able to find the function rtlDataToDataX in the platform code. I tried with rtlDataToData and it couldn't copy to the target because it is a null pointer, which I assume is the difference between non-X copy methods and X copy methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use
rtlStrToDataX(len, result, view.size(), view.data())

  • there should be an rtlDataToDataX for completeness, but it would have an identical implementation

{
failx("Incorrect type for field %s.", field->name);
}
auto i = !m_pathStack.empty() && m_pathStack.back().nodeType == CPNTSet ? m_pathStack.back().childrenProcessed++ : currentRow;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is repeated in various functions. It would be better as a shared function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created currArrayIndex() to get the current index from either a set or row.

*/
bool ParquetRowBuilder::processNextSet(const RtlFieldInfo *field)
{
return m_pathStack.back().childrenProcessed < m_pathStack.back().childCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: This would be more encapsulated if implemented as a member function of PathTracker::finishedChildren() or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created PathTracker::finishedChildren().

*/
void ParquetRecordBinder::processUtf8(unsigned chars, const char *value, const RtlFieldInfo *field)
{
bindStringParam(chars, value, field, r_parquet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to call a bindUtfParameter - which performs no translation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created bindUtf8Param().

@ghalliday
Copy link
Member

@jackdelv some of the smoke tests are also failing.

@jackdelv
Copy link
Contributor Author

@ghalliday Ok, back to you Gavin. I made some changes to the way the files are opened by the plugin to ensure that n number of workers can open m number of files. I also made various style changes to makes function names clearer and more descriptive of their behaviour.

@ghalliday ghalliday self-requested a review September 25, 2023 15:24
@jackdelv jackdelv changed the base branch from master to candidate-9.4.x September 25, 2023 15:28
@ghalliday
Copy link
Member

@GordonSmith please can you check the vcpkg related changes (cmake files and vcpkg*)

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv I think this is very close.
@dcamper I think I would be prepared to merge as-is as long as there are jiras to revisit some of the issues. A couple are significant (e.g. incorrect results), but I think would only happen in unusual situations.

One change should be to rationalise the naming conventions. For instance using leading lower case when it is not overriding a virtual someone else has defined. Using a consistent convention for member variables, this code uses abc, abc_, m_abc and r_abc in different places. Similarly sometimes functions and variables are using underscores instead of camel case.

find_package(Arrow CONFIG REQUIRED)
find_package(Parquet CONFIG REQUIRED)
find_package(ArrowDataset CONFIG REQUIRED)
#find_library(ARROWDATASET arrow_dataset ${CMAKE_BINARY_DIR}/vcpkg_installed/x64-linux-dynamic/lib REQUIRED)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be removed now, and the commented line at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removed them.

/// \param rows vector of rows
/// \param path field names to enter
/// \param array_levels number of arrays to enter
DocValuesIterator(const std::vector<rapidjson::Document> &rows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path should be a
std::vector<std::string> && path if you are using std::move

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// \param path field names to enter
/// \param array_levels number of arrays to enter
DocValuesIterator(const std::vector<rapidjson::Document> &rows,
std::vector<std::string> path, int64_t array_levels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters and member variables with the same name. I am surprised that works. Our normal approach is to prefix the parameters with _ e.g., _rows, _path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed parameter names.


~DocValuesIterator() = default;

const rapidjson::Value *NextArrayOrRow(const rapidjson::Value *value, size_t *path_i,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: In general we start function names with lower case letters. Since this is fairly independent code I am fine with it staying as it is (and less disruptive), but any code more tightly integrated into the platform should conform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened JIRA for addressing style issues: HPCC-30381

// Can either start at document or at last array level
if (array_stack.size() > 0)
{
auto pos = array_stack.back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be auto &?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return count;
}

void addMember(std::shared_ptr<ParquetHelper> r_parquet, rapidjson::Value &key, rapidjson::Value &value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are better static - they are called more efficiently and don't get added to the global name space (so the dll/so loads more quickly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed addMember to static.

size32_t utf8chars;
rtlDataAttr data;
rtlStrToUtf8X(utf8chars, data.refstr(), len, value);
size32_t utf8len = rtlUtf8Size(utf8chars, data.getstr());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct that it is going via utf8? I thought it mapped to a large binary. I think this will cause data corruption/invalid results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed UTF-8 conversion.

rtlUnicodeToUtf8X(utf8chars, utf8, chars, value);

rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc);
rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utf8 string is not zero terminated - so this will not work.
Most efficient would be to using a std::string_view if rapidjson supports it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to pass in length to the value constructor.

rtlDataAttr data;
rtlStrToDataX(bytes, data.refdata(), len, value);

bindDataParam(bytes, data.getstr(), field, r_parquet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can pass len and value directory - no need for the rtlStrToDataX()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

else if (stricmp(optName, "BatchSize") == 0)
batchSize = atoi(val);
else
failx("Unknown option %s", optName.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to warn on an unknown option. Failing catches mistyped options, but it also means if new options are added to ecl code the queries will fail on old builds. Happy to leave as-is though.

@ghalliday
Copy link
Member

@jackdelv This should be changed from a draft PR (in the future change it when you think it is ready), and the commits will need squashing.

@jackdelv jackdelv marked this pull request as ready for review September 27, 2023 16:34
@jackdelv
Copy link
Contributor Author

@ghalliday I have addressed the most major of your concerns and opened a jira to fix the more cosmetic issues.

@jackdelv jackdelv requested a review from ghalliday September 27, 2023 18:48
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv a few fairly small comments.

@@ -1461,7 +1461,7 @@ void bindUnicodeParam(unsigned chars, const UChar *value, const RtlFieldInfo *fi
rtlUnicodeToUtf8X(utf8chars, utf8, chars, value);

rapidjson::Value key = rapidjson::Value(field->name, jsonAlloc);
rapidjson::Value val = rapidjson::Value(utf8, jsonAlloc);
rapidjson::Value val = rapidjson::Value(utf8, utf8chars, jsonAlloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the number of code-points or the size? If the size you need
rtlUtf8Size(utf8chars, utf8)

rapidjson::Value key;
key.SetString(field->name, jsonAlloc);
rapidjson::Value val;
val.SetString(data.getstr(), utf8len, jsonAlloc);
val.SetString(value, len, jsonAlloc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need

size32_t utf8size = rtlUtf8Size(len, value);

Better to name it size rather than len to avoid confusion over whether it is a number of codepoints or a size.

DocValuesIterator(const std::vector<rapidjson::Document> &&rows,
std::vector<std::string> path, int64_t array_levels)
: rows(rows), path(std::move(path)), array_levels(array_levels) {}
DocValuesIterator(const std::vector<rapidjson::Document> &&_rows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you want a && for rows (or a call to std::move when passing it)

@jackdelv jackdelv requested a review from ghalliday September 28, 2023 17:35
Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Added function instantiations for header functions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Rebase 8.12.x complete and tested.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Fixed and Tested rebase.

Minor.

Fixup build and install

Signed-off-by: Dan S. Camper <[email protected]>

Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each.

Add different byte sizes for testing type mapping.

Refactor JsonValueConverter.
Add additional methods for UInt64/32 and Int32 for explicit type mapping.

Change interface for calling plugin.

Added information for which thor worker we are running on.
Divides the row groups (poorly) by the number of workers.

Minor: Updated test files.

Change which API is used for opening files.

WIP

Minor changes to opening parquet file

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Added function instantiations for header functions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Rebase 8.12.x complete and tested.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

Changes after rebase.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Initial commit

Add basic EmbedFunctionContext in cpp files.

Add Arrow Includes.

Fix Header conflicts before rebase.

Fix Header conflicts before rebase.

HPCC-28689 Refactor jhtree class structure to support additional node compression methods

Sort out concerns about usage and semantics of isLeaf(). Also
a few trivial changes suggested by earlier code review

Signed-off-by: Richard Chapman <[email protected]>

Changes after rebase.

Not sure what this does.

Added example file to show reading from and writing to a parquet file.

Add function calls to ecl file.

Added some basic options for embedding the parquet functions.
I am not sure if these will stay the same.

Added install calls so the shared library gets installed.

Fixed type.

Added function instantiations for header functions.

Added some useful function definitions.

Added definitions for some of the ParquetEmbedFunctionContext and ParquetRecordBinder functions.

Fix typo

Added ParquetHelper class for holding the user inputs and opening the stream reader and writer.

Added dataset to argument of write function to test writing ECL datasets.

Add function to ParquetDatasetBinder to get datatypes from RtlFieldInfo object and build the schema.

Fix error in getFieldTypes method.

Fix addField method. Schema now gets built from the parameter dataset.

Can write to parquet files.

Set up ParquetRowStream to start reading parquet and output it to an ECL record.

Minor changes.

Can read a parquet table from a parquet file.

Add RowBuilder methods for builder result rows.

Add RowBatchBuilder for converting parquet tables to rows. There seems to be a namespace conflict with rapidjson::Value.

Add #DEFINE RAPIDJSON_HAS_STDSTRING 1.
This allows for the use of some helpful GenericValue functions.

Implemented Parquet::RowStream::nextRow which gets each result row from the parquet iterator and converts it to json using rapidjson.

Fixed error where reading from the file it reads one extra row and fails.

Remove Credentials.

Fix issue where utf8 encoded parameters were not properly converted to strings.

Fix issues dealing with utf8 strings and reading and writing them.

Add code for getting the schema for nested types.

Add additional feature from apache arrow.

Add code for reading partitioned files.

Comment out Partition code

Edit ECL files

Linker not properly linking dataset shared library.

Manually link Arrow Dataset shared library.

Added part of the code for writing to partitioned files.

Replace StreamWriter API with rapidjson api.
Add classes for converting rapidjson::Documents to arrow::RecordBatches for writing.

Moved allocator for rapidjson documents to top level in the parquetembed namespace.

Minor changes. Schema needs to be correctly passed through.

Fix issue with writing to vector of rapidjson documents.
Add visit method for HalfFloat datatype.

Changed FileWriter creation to write in the parquet format.

Moved FileWriter from ExecuteAll to openWriteFile. This is so it only gets called once.
Changed write() to return the FileWriter instead of the FileOutputStream.
Writing in batches now works! Butgit add parquetembed.hpp  Further testing on edge cases and vector reuse is needed!

Removed old code. Currently in working state though not tested very much.

Read Files in RowGroups one at a time then iterate through each row.

Added file for testing reading and writing large parquet files.

Minor: Changes to simple example.

rapidjson MemoryPoolAllocator is now cleared after writing each RowGroup.

Tidy up a bit...

Created file for testing nested data.

Added support for nested objects.

Refactored code and moved function implementations to cpp file.

Added comment.

Added comment.

Added new comment.

Documentation.

Minor.

Fixup build and install

Signed-off-by: Dan S. Camper <[email protected]>

Change type mapping. Unsigned and Signed are now mapped explicitly to 64 and 32 bit versions of each.

Add different byte sizes for testing type mapping.

Refactor JsonValueConverter.
Add additional methods for UInt64/32 and Int32 for explicit type mapping.

Change interface for calling plugin.

Added information for which thor worker we are running on.
Divides the row groups (poorly) by the number of workers.

Minor: Updated test files.

Change which API is used for opening files.

WIP

Minor changes to opening parquet file

Update FileReader usage to comply with version 12.0.0 standards.

Add overlay for pulling down forked arrow directory.

Add function to close file that resets unique_ptr when the plugin is done streaming rows to the engine.

Add Arrow to vcpkg.json.in

Add correct hash for current arrow commit.
add PARQUETEMBED to plugins.cmake.

Update arrow fork SHA512.

Remove trailing whitespace

Signed-off-by: Dan S. Camper <[email protected]>

Explicit default destructors; make one ParquetHelper pointer private

Signed-off-by: Dan S. Camper <[email protected]>

ParquetHelper: Use single MemoryPool; explicitly release pool memory on destruction

Signed-off-by: Dan S. Camper <[email protected]>

Add vcpkg overlay for building with Arrow 12.0.0 release version.

Add function for delegating row groups to any number of thor workers.

Fix error in dividing up rows between workers and add some additional comments.

Change writing parquet to use the same arrow::MemoryPool*

Add code for reading from arrow dataset (partitioned files).
Fix error in brind string params to rows when writing to parquet.
Add better way of setting single vs multiple reads/writes.

Implemented logic for scanning chunks of an arrow dataset.

Format and cleanup source code.

Add Unimplemented Tag to Scalar parameters for EmbedFunctionContext Bind functions.

Change check in ParquetHelper::next() to correct value.
Use PARQUET_ASSIGN_OR_THROW whenever an arrow::Result object is returned.

Create example for creating and reading a partitioned dataset.

Writing Datasets using HivePartitioning works.
Currently the partioning schema is hardcoded to the language field in the github dataset.
Streaming larger than memory datasets to a partitioned dataset is not working.

Write each batch out to a different file to avoid collisions of file names.

Clean up file name

Change partitioning behaviour to delete the directory contents before creating a new partition.

Implement thor functionality for writing parquet files.
Directory gets created and emptied if it already existed on write.
Increased default batch size for reading and writing to 10000.
Added critical section to all uses of scanner file operations.

Add WritePartition Macro call

Make a better example.

Start README for parquet plugin.

Added support for Data and Unicode datatypes when writing to parquet.
removed unnecessary variable.

Change rapidjson allocator to thread_local storage specification to allow for multi threaded writing.

Added arrow conversion methods for arrow binary type.

Changed default RowGroup size to 20000

Added example ecl for reading and writing BLOB data

Be more careful passing around length of the strings.
Add additional conversion to incoming DATA types to UTF8.

Update blob test

Use getProp rather than queryProp to preserve the length of a string result being returned.

Reimplemented ParquetRowBuilder to build rows from rapidjson::Value objects rather than an IPropertyTree.
Fixed Implementation of set datatype.

Update test files.

update create partition example.

Update Example files.

Fix reading rapidjson::Array and add Clear calls to the rapidjson allocator.

Remove calls to GetAllocator and instead pass jsonAlloc in.
Add function for queryingRows from the dataset.

Add function for querying rows from dataset. Gets a RecordBatchReader and a RecordBatchReaderIterator to iterate through the stream of record batches from the dataset.
There is currently a large limitation in arrow where iterators for datasets do not support random reads, they must iterate from the beginning to the starting point, and they cannot end early, on destruction they are iterated until the end of the stream.

Removed rapidjson conversion from RowBuilder implementation.
RowBuilder now builds the fields directly from the arrow::Table that is read from the arrow::FileReader object.
Implemented a ParquetVisitor class for getting the correct datatype from each scalar.
Tested on all test ecl files.

format

Fix issue with Real datatype not being returned properly.

Fixed some performance issues with Arrow FileReader interface.

Cache chunks rather than read them in every time we build a field.

Update test/example files.

Update file reading structure to expect each thor worker to have its own parquet file.

Fix compiler warnings.

Clean up...

Clean up...

Clean up...

Clean up ...

Format source code.

Refactor

Fix Utf8 conversion when returning String Results.

Change open file function call.
update default row and batch sizes.

Bump vcpkg version of Arrow to 13.0.0

Fix dependency installs.

Fix decimal type.

Remove PathTracker copy constructor.

Change initialization of PathTracker members.

Create ParquetArrayType enum.

Move ParquetDatasetBinder methods to cpp file. Static jsonAlloc can now be moved out of the header file.

minor change for clarity

Remove default initializations from constructors.

Fix partition condition for user input.

Fix decimal datatype.
Fix nested structures for all ECL types.

Utf8 type no longer gets translated to string.

Add utf8 characters to example file.

Encapsulate children processing check.

Create function currArrayIndex() from common code across source file.

Change write() to queryWriter() to be more descriptive.
Change return type to pointer to the FileWriter object.

Return references rather than pointer where object cannot be a nullptr.

Use consistent types, especially in comparisons.

Remove countFields because it is a duplicate.

Remove floating point ooperation from divide_row_groups.

Thor nodes that don't recieve any rows will not open a file.

Remove Critical section when writing.
Each node writes to a unique file.

Add override qualifier to virtual functions in ParquetRowBuilder and ParquetRowStream.

Add default initializers and clean up member variables.

Allow openReadFile to open any files matching the filename chosen by the user.
The files will all be opened and the row counts will be recorded. This will allow for even division of work, and will keep order intact.

Style: Make function names clearer and improve clarity.

Revert plugin collision change.

Fix non null terminated unicode parameters.

Fix null characters in string types.

Data datatype no longer gets converted to utf-8.

Remove extra rtlStrToDataX call in processData.

Add static qualifier to addMember function.

Remove commented lines.

Fix references in Next and DocValuesIterator

Fix constructor argument names.

Remove && for rows in DocValuesIterator constructor.

Use UTF-8 size instead of code-points.
@ghalliday ghalliday merged commit a9d408d into hpcc-systems:candidate-9.4.x Sep 28, 2023
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants