-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-33155 Revisit Parquet Test Suite #19405
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jack Del Vecchio <[email protected]>
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-33155 Jirabot Action Result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question about the contents of a comment block.
* ├── district=1 | ||
* │ ├── firstname=Alice | ||
* │ │ └── city=New%20York | ||
* │ │ └── part_0_of_table_0_from_worker_0.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These part... names are files? If so, they are all the same for every value combination. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments/questions.
DATASET([{0, 'Corrupt Parquet File', ''}], RECORDDEF), | ||
CORRUPT_PARQUET); | ||
|
||
OUTPUT(CORRUPT_RESULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the output this workunit will not do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -31,6 +36,3 @@ EMPTY_PARQUET := DATASET([], RECORDDEF); | |||
ParquetIO.Write(EMPTY_PARQUET, filePath, TRUE); | |||
|
|||
read_data := ParquetIO.Read(RECORDDEF, filePath); | |||
|
|||
OUTPUT(read_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this deleted it will not read the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
- createDirectoryIterator is used to list all matching Parquet files and the order is not guarenteed - Fix parquetEmpty.ecl and parquetCorrupt.ecl by adding the OUTPUT statements back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
@@ -430,7 +441,7 @@ arrow::Result<std::shared_ptr<arrow::Table>> ParquetReader::queryRows() | |||
{ | |||
// Start by getting the number of rows in the first group and checking if it includes this workers startRow | |||
__int64 offset = (*rbatchItr)->get()->num_rows(); | |||
while (offset < startRow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug fix - is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the changes to the openReadFile would be considered a bug fix as well. Should they be pulled out into separate PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions.
@@ -381,7 +392,7 @@ arrow::Status ParquetReader::processReadFile() | |||
rbatchItr = arrow::RecordBatchReader::RecordBatchReaderIterator(rbatchReader.get()); | |||
PARQUET_ASSIGN_OR_THROW(auto datasetRows, scanner->CountRows()); | |||
// Divide the work among any number of workers | |||
divide_row_groups(activityCtx, datasetRows, totalRowCount, startRowGroup); | |||
divide_row_groups(activityCtx, datasetRows, totalRowCount, startRow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change deliberate. If so, should line 409 also change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was deliberate. startRow is used for arrow partitioned datasets because the API returns rows rather than groups of rows. This was an error causing multi node thor workers to always start reading from the first row.
@jackdelv how important are the bug fixes? Should they be targeting an earlier build than 9.10.x? |
@jackdelv I raised a JIRA, the HPCC-33278 related to Test Suite. |
Change the plugin to throw an error if a file is not found when opening a file for reading
Change output filename of partitioned file parts. part_00_0.parquet -> part_0_of_table_0_from_worker_0.parquet
These files aren't meant to be read directly, but I think the new format makes more sense when trying to debug.
Update tests so that they are all being run and are able to use external files in testing/regress/download
Remove parquetString test and key file and move testing to parquetTypes.ecl
Update test cases in parquetTypes.ecl
Type of change:
Checklist:
Smoketest:
Testing: