New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add sparse data design document #79

Open

EmilHvitfeldt wants to merge 6 commits into main from sparse-data-design

Member

EmilHvitfeldt commented Oct 2, 2024

This is the first of two planned posts about sparse data in tidymodels.

this first post talks about how sparse data is being used across the different tidymodels packages

the next post will be a worked example


          add sparse data design document

1ec4767

netlify bot commented Oct 2, 2024 •

edited

Loading

✅ Deploy Preview for tidymodels-org ready!

Name	Link
🔨 Latest commit	`651cd1c`
🔍 Latest deploy log	https://app.netlify.com/sites/tidymodels-org/deploys/670d668106548900089e84ef
😎 Deploy Preview	https://deploy-preview-79--tidymodels-org.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

EmilHvitfeldt mentioned this pull request

Sparse data modeling using a Matrix #80

Open

EmilHvitfeldt requested review from hfrick, simonpcouch and topepo

October 4, 2024 20:56

Collaborator

simonpcouch commented Oct 7, 2024

Deploy preview: https://deploy-preview-79--tidymodels-org.netlify.app/learn/develop/sparse-data/

simonpcouch reviewed

View reviewed changes

Collaborator

simonpcouch left a comment

Just a first pass for grammar!

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd

+              ```r
+              library(sparsevctrs)
+              data_tbl <- coerce_to_sparse_tibble(data_mat)
+              ```

Collaborator

simonpcouch Oct 7, 2024

Assuming there's a conversation somewhere in rsample about this, but—can we happy path this? If it's really just a matter of coercing, can we do so internally?

Member Author

EmilHvitfeldt Oct 14, 2024

We talked about it in a team meeting. The current plan was to ask people to manually convert sparse matrices to sparse tibbles, for use in {rsample} functions.

We could let {rsample} functions take sparse matrices or matrices and turn the data to tibbles, but IMO it feels like we are doing too much with {rsample} functions. But I'm also not holding on to this belief to tightly

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.html.md Show resolved Hide resolved

EmilHvitfeldt and others added 3 commits

October 7, 2024 09:43


          some more spell checking

e19d997


          Apply suggestions from code review

98f583b

Co-authored-by: Simon P. Couch <[email protected]>


          rerender

acf5ee1

simonpcouch approved these changes

View reviewed changes

Collaborator

simonpcouch left a comment

Huzzah!🕺

learn/develop/sparse-data/index.qmd

+              ```r
+              c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
+              ```
+              The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.

Collaborator

simonpcouch Oct 7, 2024

We need an additional value for the default, right?

Member Author

EmilHvitfeldt Oct 14, 2024

I left out that detail here. It is technically not needed, but something I have added specifically in {sparsevctrs} to allow for some cool interactions.

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Show resolved Hide resolved

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

topepo reviewed

View reviewed changes

learn/develop/sparse-data/index.qmd Outdated Show resolved Hide resolved

topepo reviewed

View reviewed changes

learn/develop/sparse-data/index.qmd

+              ```r
+              library(sparsevctrs)
+              data_tbl <- coerce_to_sparse_tibble(data_mat)

Member

topepo Oct 14, 2024

Maybe this should be a comment in sparsevctrs but a better name would be coerce_matrix_to_sparse_tibble() or to make it an S3 method (the latter seems like a better option to me).

Member Author

EmilHvitfeldt Oct 14, 2024

good idea (about s3 method). Added issue here: r-lib/sparsevctrs#78

EmilHvitfeldt and others added 2 commits

October 14, 2024 11:13


          Apply suggestions from code review

8b4498d

Co-authored-by: Max Kuhn <[email protected]>
Co-authored-by: Simon P. Couch <[email protected]>


          rerender with more suggestions

651cd1c

hfrick reviewed

View reviewed changes

Member

hfrick left a comment

I read through the pair of articles and would like to suggest some restructuring motivated by the question "Who is the intended audience?".

I percieve articles on tidymodels.org to be primarily intented for tidymodels users, with the articles under the developer tools tag intentionally shifting on the user-developer spectrum.

I think all of the content in those two articles is super valuable to have, I would just reshuffle it:
What is now "Model tuning using a sparse matrix" could become "Sparse data in tidymodels" and gain the intro paragraph from the current design doc on what sparse data is, before jumping into the example. The info on what preserves sparsity and what doesn't, currently in the design doc, could be threaded into the application article as we move through resampling, preprocessing, model-fitting, and highlighting the benefits when doing it all many times while tuning. I would also enjoy it if the application was leaning a little bit more towards case study and talked a bit more about the specific problem. Is there a story to be told with this dataset?

The design notes, with details of the parsnip encodings such as allow_sparse_x, probably work better as a "for us, just public" document. Would the planning repo be a good place for that? Anything of the magnitude of "needs a wiggle around model.matrix()" should definitely be documented for future us, ideally with a few more notes. If we make the doc explicitly directed at us, those notes can assume advanced knowledge of tidymodels internals and therefore focus on the yet-to-solve problems without having to work on tidymodels context.

And lastly a question about sparsitiy in recipes: Does that mean that a sparse vector in a sparsetibble remains sparse while going through recipes if, and only if, it is untouched by the steps? I.e., if any processing happens on the sparse predictor, it is currently materialized again because the steps themselves don't know yet how to deal with sparsitiy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet