Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Initial datasets content #159

Merged
merged 24 commits into from
Nov 8, 2022
Merged

[R] Initial datasets content #159

merged 24 commits into from
Nov 8, 2022

Conversation

thisisnic
Copy link
Member

This PR adds a chapter on working with Datasets, though is still in draft form right now.

@stephhazlitt
Copy link
Contributor

stephhazlitt commented May 18, 2022

@thisisnic following up on our chat re: the Datasets chapter. I think one content approach might be to follow the cheatsheet chunks, refactoring the current Reading & Writing Data chapter to Reading & Writing Individual Data Files, and moving the small amount of Dataset content from there into a stand-alone chapter on Reading & Writing Multi-file Datasets (which could include addressing #172, #152, #120). I then envision an additional chapter that compares and contrasts Tables and Datasets (#92). What do you think?

@thisisnic
Copy link
Member Author

@stephhazlitt Apologies, only saw your comment from May just now as there was some activity on this, but that sounds good to me.

@stephhazlitt
Copy link
Contributor

Thanks @thisisnic! After (finally) coming back to this, I decided my suggested approach fragmented the content too much. I have been working on your original PR, so stay tuned :)

@thisisnic
Copy link
Member Author

Thanks @thisisnic! After (finally) coming back to this, I decided my suggested approach fragmented the content too much. I have been working on your original PR, so stay tuned :)

OK, that's fine too, look forward to seeing it! :D

Copy link
Member Author

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this PR @stephhazlitt , this is looking great, and just a few changes to suggest here!

r/content/datasets.Rmd Outdated Show resolved Hide resolved
further advantages when using Arrow, as Arrow will only
read in the necessary partitioned files needed for any given analysis.

It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, given the discussions on the mailing list lately, we refer to it as "Arrow (formerly known as Feather)" or similar? Not sure what the latest with those discussions is though, and how it impacts us in R.

Copy link
Contributor

@stephhazlitt stephhazlitt Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the way to go given the thread, good catch. I think we should open a separate ticket and review + update the feather/arrow/arrow-ipc naming in the R package and the corresponding documentation.

https://arrow.apache.org/docs/r/reference/write_feather.html
https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#write-an-ipcfeather-v2-file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


It's possible to read in partitioned data in Parquet, Feather (aka Arrow), and CSV (or
other text-delimited) formats. If you are choosing a partitioned or multifile format, we
recommend Parquet or Feather, both of which can have improved performance
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we can link to which compares these formats and helps people pick which? If the answer is no, do we want to create a ticket somewhere to suggest someone write something on this topic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


expect_true(file.exists("starwars_data"))
expect_length(list.files("starwars_data"), 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to delete these directories we've created afterwards?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 95 to 97
Note that in the example above, when there was an `NA` value in the `homeworld`
column, these values are written to the `homeworld=__HIVE_DEFAULT_PARTITION__`
directory.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent detail to include

### Solution

```{r, write_dataset_csv}
# Need to update this example as we can't write list columns to CSV :(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we either make a subset of starwars dataset at the start of this chapter/section to use later, which doesn't include the list column, or just acknowledge the list column issue in the discussion section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored to stick with the airquality dataset, to be more consistent with the rest fo the read/write material in the R cookbook.

r/content/reading_and_writing_data.Rmd Outdated Show resolved Hide resolved
@stephhazlitt
Copy link
Contributor

@thisisnic Thanks for the review and great suggestions. I have incorporated them and/or opened tickets as placeholders for further work.

I was inspired by the single file API and Dataset API approach in @fmichonneau's blog post, and have tried to subtly weave in this framing by having two separate read+write chapters. The datasets.Rmd was already mostly read+write recipes, so I changed the title and pulled over some content from reading_and_writing_data.Rmd. Let me know if you think this approach is promising.

I wonder about getting what is here clean enough to merge, and then tackling improvements+adding more content in subsequent (and smaller) PRs?

@thisisnic thisisnic marked this pull request as ready for review November 8, 2022 12:47
Copy link
Member Author

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephhazlitt for taking over the task of getting this PR moving again! I'll merge this shortly! :D

@thisisnic thisisnic merged commit 7df8c28 into apache:main Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants