Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sparse data design document #79

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

EmilHvitfeldt
Copy link
Member

This is the first of two planned posts about sparse data in tidymodels.

this first post talks about how sparse data is being used across the different tidymodels packages

the next post will be a worked example

Copy link

netlify bot commented Oct 2, 2024

Deploy Preview for tidymodels-org ready!

Name Link
🔨 Latest commit 651cd1c
🔍 Latest deploy log https://app.netlify.com/sites/tidymodels-org/deploys/670d668106548900089e84ef
😎 Deploy Preview https://deploy-preview-79--tidymodels-org.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@simonpcouch
Copy link
Collaborator

Copy link
Collaborator

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass for grammar!

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
```r
library(sparsevctrs)
data_tbl <- coerce_to_sparse_tibble(data_mat)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there's a conversation somewhere in rsample about this, but—can we happy path this? If it's really just a matter of coercing, can we do so internally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about it in a team meeting. The current plan was to ask people to manually convert sparse matrices to sparse tibbles, for use in {rsample} functions.

We could let {rsample} functions take sparse matrices or matrices and turn the data to tibbles, but IMO it feels like we are doing too much with {rsample} functions. But I'm also not holding on to this belief to tightly

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.html.md Show resolved Hide resolved
Copy link
Collaborator

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah!🕺

```r
c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
```
The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an additional value for the default, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out that detail here. It is technically not needed, but something I have added specifically in {sparsevctrs} to allow for some cool interactions.

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Show resolved Hide resolved
learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

```r
library(sparsevctrs)
data_tbl <- coerce_to_sparse_tibble(data_mat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a comment in sparsevctrs but a better name would be coerce_matrix_to_sparse_tibble() or to make it an S3 method (the latter seems like a better option to me).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea (about s3 method). Added issue here: r-lib/sparsevctrs#78

Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the pair of articles and would like to suggest some restructuring motivated by the question "Who is the intended audience?".

I percieve articles on tidymodels.org to be primarily intented for tidymodels users, with the articles under the developer tools tag intentionally shifting on the user-developer spectrum.

I think all of the content in those two articles is super valuable to have, I would just reshuffle it:
What is now "Model tuning using a sparse matrix" could become "Sparse data in tidymodels" and gain the intro paragraph from the current design doc on what sparse data is, before jumping into the example. The info on what preserves sparsity and what doesn't, currently in the design doc, could be threaded into the application article as we move through resampling, preprocessing, model-fitting, and highlighting the benefits when doing it all many times while tuning. I would also enjoy it if the application was leaning a little bit more towards case study and talked a bit more about the specific problem. Is there a story to be told with this dataset?

The design notes, with details of the parsnip encodings such as allow_sparse_x, probably work better as a "for us, just public" document. Would the planning repo be a good place for that? Anything of the magnitude of "needs a wiggle around model.matrix()" should definitely be documented for future us, ideally with a few more notes. If we make the doc explicitly directed at us, those notes can assume advanced knowledge of tidymodels internals and therefore focus on the yet-to-solve problems without having to work on tidymodels context.

And lastly a question about sparsitiy in recipes: Does that mean that a sparse vector in a sparsetibble remains sparse while going through recipes if, and only if, it is untouched by the steps? I.e., if any processing happens on the sparse predictor, it is currently materialized again because the steps themselves don't know yet how to deal with sparsitiy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants