tidymodels · EmilHvitfeldt · Oct 3, 2024 · Oct 4, 2024 · Oct 7, 2024 · Oct 14, 2024
diff --git a/_freeze/learn/work/sparse-matrix/index/execute-results/html.json b/_freeze/learn/work/sparse-matrix/index/execute-results/html.json
@@ -0,0 +1,15 @@
+{
+  "hash": "6a463adcf3e02cbff0e48abd31c19465",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Model tuning using a sparse matrix\"\ncategories:\n - tuning\n - classification\n - sparse data\ntype: learn-subsection\nweight: 1\ndescription: | \n  Fitting a model using tidymodels with a sparse matrix as the data.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article,  you will need to install the following packages: sparsevctrs and tidymodels.\n\nThis article demonstrates how we can use a sparse matrix in tidymodels.\n\n## Example data\n\nThe data we will be using in this article is a larger sample of the [small_fine_foods](https://modeldata.tidymodels.org/reference/small_fine_foods.html) data set from the [modeldata](https://modeldata.tidymodels.org) package. Data was downloaded from <https://snap.stanford.edu/data/web-FineFoods.html>, sliced down to 100,000 rows, tokenized, and saved as a sparse matrix. Data has been saved as [reviews.rds](reviews.rds) and the code to generate this data set is found at [generate-data.R](generate-data.R). This file takes up around 1MB compressed, and around 12MB once loaded into R.\n\nThis data set is encoded as a sparse matrix from the Matrix package. We are using this data for this article because if we were to turn it into a dense matrix it would take up 3GB which is a considerable size.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews <- readr::read_rds(\"reviews.rds\")\nreviews |> head()\n#> 6 x 24818 sparse Matrix of class \"dgCMatrix\"\n#>   [[ suppressing 34 column names 'SCORE', 'a', 'all' ... ]]\n#>                                                                             \n#> 1 1 2 1 3 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 ......\n#> 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . ......\n#> 3 . 4 . 6 . . . . . . . . . . . 1 5 3 . . . . . . . 2 . . . . . . . . ......\n#> 4 . . . 1 . . . . . . . . 1 1 1 4 1 1 . . . . . . . . . . . . . . . . ......\n#> 5 1 4 . . . . . . . . . . . . . . 1 . . . . . . . . 1 . . . . . . . . ......\n#> 6 . 3 1 2 . . . . . . . . . . . 2 1 1 . . . . . . 4 1 . . . . . . . . ......\n#> \n#>  .....suppressing 24784 columns in show(); maybe adjust options(max.print=, width=)\n#>  ..............................\n```\n:::\n\n\n\n\n## Modeling\n\nWe start by loading tidymodels and the sparsevctrs package. The sparsevctrs package includes some helper functions that will allow us to more easily work with sparse matrices in tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(sparsevctrs)\n```\n:::\n\n\n\n\nWhile sparse matrices now work parsnip, recipes, and workflows directly. If we turn it into a tibble we can use rsample's sampling functions as well. Calling `as_tibble()` would be uncomfortable as it would take up 3GB. We can however use the `coerce_to_sparse_tibble()` from the sparsevctrs package. This will create a tibble with sparse columns. We call that a **sparse tibble**.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- coerce_to_sparse_tibble(reviews)\nreviews_tbl\n#> # A tibble: 15,000 × 24,818\n#>    SCORE     a   all   and appreciates    be better bought canned   dog finicky\n#>    <dbl> <dbl> <dbl> <dbl>       <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>   <dbl>\n#>  1     1     2     1     3           1     1      2      1      1     1       1\n#>  2     0     0     0     0           0     0      0      0      0     0       0\n#>  3     0     4     0     6           0     0      0      0      0     0       0\n#>  4     0     0     0     1           0     0      0      0      0     0       0\n#>  5     1     4     0     0           0     0      0      0      0     0       0\n#>  6     0     3     1     2           0     0      0      0      0     0       0\n#>  7     1     1     0     3           0     0      0      0      0     0       0\n#>  8     1     0     0     1           0     0      0      0      0     0       0\n#>  9     1     0     0     1           0     0      0      0      0     0       0\n#> 10     1     1     0     0           0     0      0      0      0     2       0\n#> # ℹ 14,990 more rows\n#> # ℹ 24,807 more variables: food <dbl>, found <dbl>, good <dbl>, have <dbl>,\n#> #   i <dbl>, is <dbl>, it <dbl>, labrador <dbl>, like <dbl>, looks <dbl>,\n#> #   meat <dbl>, more <dbl>, most <dbl>, my <dbl>, of <dbl>, processed <dbl>,\n#> #   product <dbl>, products <dbl>, quality <dbl>, several <dbl>, she <dbl>,\n#> #   smells <dbl>, stew <dbl>, than <dbl>, the <dbl>, them <dbl>, this <dbl>,\n#> #   to <dbl>, vitality <dbl>, actually <dbl>, an <dbl>, arrived <dbl>, …\n```\n:::\n\n\n\n\nDespite this tibble contains 15,000 rows and a little under 25,000 columns it only takes up marginally more space than the sparse matrix.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlobstr::obj_size(reviews)\n#> 12.75 MB\nlobstr::obj_size(reviews_tbl)\n#> 18.27 MB\n```\n:::\n\n\n\n\nThe outcome `SCORE` is currently encoded as a double, but we want it to be a factor for it to work well with tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- reviews_tbl |>\n  mutate(SCORE = factor(SCORE, levels = c(1, 0), labels = c(\"great\", \"other\")))\n```\n:::\n\n\n\n\nSince `reviews_tbl` is now a tibble, we can use `initial_split()` as we usually do.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(1234)\n\nreview_split <- initial_split(reviews_tbl)\nreview_train <- training(review_split)\nreview_test <- testing(review_split)\n\nreview_folds <- vfold_cv(review_train)\n```\n:::\n\n\n\n\nNext, we will specify our workflow. Since we are showcasing how sparse data works in tidymodels, we will stick to a simple lasso regression model. These models tend to work well with sparse predictors. `penalty` has been set to be tuned.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrec_spec <- recipe(SCORE ~ ., data = review_train)\n\nlm_spec <- logistic_reg(penalty = tune()) |>\n  set_engine(\"glmnet\")\n\nwf_spec <- workflow(rec_spec, lm_spec)\n```\n:::\n\n\n\n\nWith everything in order, we can now fit the different models with `tune_grid()`.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntune_res <- tune_grid(wf_spec, review_folds)\n```\n:::\n\n\n\n\nThis should run in a reasonable amount of time. Once that is done, then we can look at the performance for different values of regularization, to make sure that the optimal value was within the range we searched.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(tune_res)\n```\n\n::: {.cell-output-display}\n![](figs/autoplot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nIt appears that it did, so we finalized the workflows and fit the final model on the training data set.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwf_final <- finalize_workflow(\n wf_spec, \n  select_best(tune_res, metric = \"roc_auc\")\n )\n\nwf_fit <- fit(wf_final, review_train)\n```\n:::\n\n\n\n\nWith this fitted model, we can now predict with a sparse tibble.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npredict(wf_fit, review_test)\n#> # A tibble: 3,750 × 1\n#>    .pred_class\n#>    <fct>      \n#>  1 other      \n#>  2 great      \n#>  3 great      \n#>  4 great      \n#>  5 great      \n#>  6 great      \n#>  7 great      \n#>  8 great      \n#>  9 great      \n#> 10 great      \n#> # ℹ 3,740 more rows\n```\n:::\n\n\n\n\n## Session information {#session-info}\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#>  setting  value\n#>  version  R version 4.4.0 (2024-04-24)\n#>  os       macOS 15.0\n#>  system   aarch64, darwin20\n#>  ui       X11\n#>  language (EN)\n#>  collate  en_US.UTF-8\n#>  ctype    en_US.UTF-8\n#>  tz       America/Los_Angeles\n#>  date     2024-10-07\n#>  pandoc   2.17.1.1 @ /opt/homebrew/bin/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#>  package     * version    date (UTC) lib source\n#>  broom       * 1.0.6      2024-05-17 [1] CRAN (R 4.4.0)\n#>  dials       * 1.3.0.9000 2024-09-23 [1] local\n#>  dplyr       * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)\n#>  ggplot2     * 3.5.1      2024-04-23 [1] CRAN (R 4.4.0)\n#>  infer       * 1.0.7      2024-03-25 [1] CRAN (R 4.4.0)\n#>  parsnip     * 1.2.1.9002 2024-10-02 [1] local\n#>  purrr       * 1.0.2      2023-08-10 [1] CRAN (R 4.4.0)\n#>  recipes     * 1.1.0.9000 2024-10-04 [1] local\n#>  rlang         1.1.4      2024-06-04 [1] CRAN (R 4.4.0)\n#>  rsample     * 1.2.1.9000 2024-09-18 [1] Github (tidymodels/rsample@77fc1fe)\n#>  sparsevctrs * 0.1.0.9002 2024-09-30 [1] Github (r-lib/sparsevctrs@b29b723)\n#>  tibble      * 3.2.1      2023-03-20 [1] CRAN (R 4.4.0)\n#>  tidymodels  * 1.2.0      2024-03-25 [1] CRAN (R 4.4.0)\n#>  tune        * 1.2.1      2024-04-18 [1] CRAN (R 4.4.0)\n#>  workflows   * 1.1.4.9000 2024-09-24 [1] local\n#>  yardstick   * 1.3.1      2024-03-21 [1] CRAN (R 4.4.0)\n#> \n#>  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/installs.R b/installs.R
@@ -29,6 +29,7 @@ packages <- c(
   "kernlab",
   "klaR",
   "leaflet",
+  "lobstr",
   "mda",
   "mlbench",
   "modeldata",
@@ -56,6 +57,7 @@ packages <- c(
   "sessioninfo",
   "readmission",
   "skimr",
+  "sparsevctrs",
   "spatialsample",
   "stacks",
   "stopwords",

diff --git a/learn/index-listing.json b/learn/index-listing.json
@@ -24,6 +24,7 @@
     "/learn/statistics/infer/index.html",
     "/learn/work/bayes-opt/index.html",
     "/learn/statistics/k-means/index.html",
+    "/learn/work/sparse-matrix/index.html",
     "/learn/work/tune-svm/index.html",
     "/learn/models/time-series/index.html",
     "/learn/models/pls/index.html",

diff --git a/learn/index.html.md b/learn/index.html.md
@@ -18,5 +18,6 @@ listing:
 
 
 
+
 After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.