Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse data modeling using a Matrix #80

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions _freeze/learn/work/sparse-matrix/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "6a463adcf3e02cbff0e48abd31c19465",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Model tuning using a sparse matrix\"\ncategories:\n - tuning\n - classification\n - sparse data\ntype: learn-subsection\nweight: 1\ndescription: | \n Fitting a model using tidymodels with a sparse matrix as the data.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: sparsevctrs and tidymodels.\n\nThis article demonstrates how we can use a sparse matrix in tidymodels.\n\n## Example data\n\nThe data we will be using in this article is a larger sample of the [small_fine_foods](https://modeldata.tidymodels.org/reference/small_fine_foods.html) data set from the [modeldata](https://modeldata.tidymodels.org) package. Data was downloaded from <https://snap.stanford.edu/data/web-FineFoods.html>, sliced down to 100,000 rows, tokenized, and saved as a sparse matrix. Data has been saved as [reviews.rds](reviews.rds) and the code to generate this data set is found at [generate-data.R](generate-data.R). This file takes up around 1MB compressed, and around 12MB once loaded into R.\n\nThis data set is encoded as a sparse matrix from the Matrix package. We are using this data for this article because if we were to turn it into a dense matrix it would take up 3GB which is a considerable size.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews <- readr::read_rds(\"reviews.rds\")\nreviews |> head()\n#> 6 x 24818 sparse Matrix of class \"dgCMatrix\"\n#> [[ suppressing 34 column names 'SCORE', 'a', 'all' ... ]]\n#> \n#> 1 1 2 1 3 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 ......\n#> 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . ......\n#> 3 . 4 . 6 . . . . . . . . . . . 1 5 3 . . . . . . . 2 . . . . . . . . ......\n#> 4 . . . 1 . . . . . . . . 1 1 1 4 1 1 . . . . . . . . . . . . . . . . ......\n#> 5 1 4 . . . . . . . . . . . . . . 1 . . . . . . . . 1 . . . . . . . . ......\n#> 6 . 3 1 2 . . . . . . . . . . . 2 1 1 . . . . . . 4 1 . . . . . . . . ......\n#> \n#> .....suppressing 24784 columns in show(); maybe adjust options(max.print=, width=)\n#> ..............................\n```\n:::\n\n\n\n\n## Modeling\n\nWe start by loading tidymodels and the sparsevctrs package. The sparsevctrs package includes some helper functions that will allow us to more easily work with sparse matrices in tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(sparsevctrs)\n```\n:::\n\n\n\n\nWhile sparse matrices now work parsnip, recipes, and workflows directly. If we turn it into a tibble we can use rsample's sampling functions as well. Calling `as_tibble()` would be uncomfortable as it would take up 3GB. We can however use the `coerce_to_sparse_tibble()` from the sparsevctrs package. This will create a tibble with sparse columns. We call that a **sparse tibble**.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- coerce_to_sparse_tibble(reviews)\nreviews_tbl\n#> # A tibble: 15,000 × 24,818\n#> SCORE a all and appreciates be better bought canned dog finicky\n#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 2 1 3 1 1 2 1 1 1 1\n#> 2 0 0 0 0 0 0 0 0 0 0 0\n#> 3 0 4 0 6 0 0 0 0 0 0 0\n#> 4 0 0 0 1 0 0 0 0 0 0 0\n#> 5 1 4 0 0 0 0 0 0 0 0 0\n#> 6 0 3 1 2 0 0 0 0 0 0 0\n#> 7 1 1 0 3 0 0 0 0 0 0 0\n#> 8 1 0 0 1 0 0 0 0 0 0 0\n#> 9 1 0 0 1 0 0 0 0 0 0 0\n#> 10 1 1 0 0 0 0 0 0 0 2 0\n#> # ℹ 14,990 more rows\n#> # ℹ 24,807 more variables: food <dbl>, found <dbl>, good <dbl>, have <dbl>,\n#> # i <dbl>, is <dbl>, it <dbl>, labrador <dbl>, like <dbl>, looks <dbl>,\n#> # meat <dbl>, more <dbl>, most <dbl>, my <dbl>, of <dbl>, processed <dbl>,\n#> # product <dbl>, products <dbl>, quality <dbl>, several <dbl>, she <dbl>,\n#> # smells <dbl>, stew <dbl>, than <dbl>, the <dbl>, them <dbl>, this <dbl>,\n#> # to <dbl>, vitality <dbl>, actually <dbl>, an <dbl>, arrived <dbl>, …\n```\n:::\n\n\n\n\nDespite this tibble contains 15,000 rows and a little under 25,000 columns it only takes up marginally more space than the sparse matrix.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlobstr::obj_size(reviews)\n#> 12.75 MB\nlobstr::obj_size(reviews_tbl)\n#> 18.27 MB\n```\n:::\n\n\n\n\nThe outcome `SCORE` is currently encoded as a double, but we want it to be a factor for it to work well with tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- reviews_tbl |>\n mutate(SCORE = factor(SCORE, levels = c(1, 0), labels = c(\"great\", \"other\")))\n```\n:::\n\n\n\n\nSince `reviews_tbl` is now a tibble, we can use `initial_split()` as we usually do.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(1234)\n\nreview_split <- initial_split(reviews_tbl)\nreview_train <- training(review_split)\nreview_test <- testing(review_split)\n\nreview_folds <- vfold_cv(review_train)\n```\n:::\n\n\n\n\nNext, we will specify our workflow. Since we are showcasing how sparse data works in tidymodels, we will stick to a simple lasso regression model. These models tend to work well with sparse predictors. `penalty` has been set to be tuned.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrec_spec <- recipe(SCORE ~ ., data = review_train)\n\nlm_spec <- logistic_reg(penalty = tune()) |>\n set_engine(\"glmnet\")\n\nwf_spec <- workflow(rec_spec, lm_spec)\n```\n:::\n\n\n\n\nWith everything in order, we can now fit the different models with `tune_grid()`.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntune_res <- tune_grid(wf_spec, review_folds)\n```\n:::\n\n\n\n\nThis should run in a reasonable amount of time. Once that is done, then we can look at the performance for different values of regularization, to make sure that the optimal value was within the range we searched.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(tune_res)\n```\n\n::: {.cell-output-display}\n![](figs/autoplot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nIt appears that it did, so we finalized the workflows and fit the final model on the training data set.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwf_final <- finalize_workflow(\n wf_spec, \n select_best(tune_res, metric = \"roc_auc\")\n )\n\nwf_fit <- fit(wf_final, review_train)\n```\n:::\n\n\n\n\nWith this fitted model, we can now predict with a sparse tibble.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npredict(wf_fit, review_test)\n#> # A tibble: 3,750 × 1\n#> .pred_class\n#> <fct> \n#> 1 other \n#> 2 great \n#> 3 great \n#> 4 great \n#> 5 great \n#> 6 great \n#> 7 great \n#> 8 great \n#> 9 great \n#> 10 great \n#> # ℹ 3,740 more rows\n```\n:::\n\n\n\n\n## Session information {#session-info}\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.4.0 (2024-04-24)\n#> os macOS 15.0\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2024-10-07\n#> pandoc 2.17.1.1 @ /opt/homebrew/bin/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)\n#> dials * 1.3.0.9000 2024-09-23 [1] local\n#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n#> infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)\n#> parsnip * 1.2.1.9002 2024-10-02 [1] local\n#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n#> recipes * 1.1.0.9000 2024-10-04 [1] local\n#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n#> rsample * 1.2.1.9000 2024-09-18 [1] Github (tidymodels/rsample@77fc1fe)\n#> sparsevctrs * 0.1.0.9002 2024-09-30 [1] Github (r-lib/sparsevctrs@b29b723)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n#> tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)\n#> tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)\n#> workflows * 1.1.4.9000 2024-09-24 [1] local\n#> yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
2 changes: 2 additions & 0 deletions installs.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ packages <- c(
"kernlab",
"klaR",
"leaflet",
"lobstr",
"mda",
"mlbench",
"modeldata",
Expand Down Expand Up @@ -56,6 +57,7 @@ packages <- c(
"sessioninfo",
"readmission",
"skimr",
"sparsevctrs",
"spatialsample",
"stacks",
"stopwords",
Expand Down
1 change: 1 addition & 0 deletions learn/index-listing.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"/learn/statistics/infer/index.html",
"/learn/work/bayes-opt/index.html",
"/learn/statistics/k-means/index.html",
"/learn/work/sparse-matrix/index.html",
"/learn/work/tune-svm/index.html",
"/learn/models/time-series/index.html",
"/learn/models/pls/index.html",
Expand Down
1 change: 1 addition & 0 deletions learn/index.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,6 @@ listing:




After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.

Loading