Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a case to test ORC writing/reading with lots of nulls #8825

Merged
merged 1 commit into from
Jul 28, 2023

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jul 27, 2023

closes #8731

This is to test large number of nulls.

  • Add a case to test ORC writing/reading with lots of nulls.
  • Add large_data_test mark in pytest, by default, disable the tests with large_data_test mark.

Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

This case can pass, but causes this a new issue: #8826.
I suggest merge this PR and investigate the new issue

@@ -398,6 +398,10 @@ properly without it. These tests assume Delta Lake is not configured and are dis
If Spark has been configured to support Delta Lake then these tests can be enabled by adding the
`--delta_lake` option to the command.

### Enabling large data tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to switch this over to scale testing instead? #8811
I was just thinking as a follow on this fits much better there, then it does here. If so I am happy to file a follow on issue to move the test there, rather then try and file an issue to enable these tests in a nightly build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This adds the test, but I don't see any changes to CI scripts or other things that would actually run this test in practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not running. As discussed with Tim and Gary before, will create another CI job to run this kind of tests using:

 -m large_date_test --large_date_test

It's fine to move to scale test. The follow-up issue is: #8849

Comment on lines +923 to +927
sqls = ["SELECT * FROM my_large_table",
"SELECT * FROM my_large_table WHERE c2 = 5",
"SELECT COUNT(*) FROM my_large_table WHERE c3 IS NOT NULL",
"SELECT * FROM my_large_table WHERE c4 IS NULL",
"SELECT * FROM my_large_table WHERE c5 IS NULL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you! This covers the cases I was hoping for.

Per @revans2's advice, we can move this under the scale tests as a follow-on.

@res-life res-life merged commit 55b75f4 into NVIDIA:branch-23.08 Jul 28, 2023
28 checks passed
@sameerz sameerz added the test Only impacts tests label Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ORC reads at scale with all null values: Like OrcQuerySuite.scala#L173, but with large number of rows.
5 participants