From 1761261e432ec923f8750e929d986d398bb60d31 Mon Sep 17 00:00:00 2001 From: Daniel Rizk <124117406+drizk1@users.noreply.github.com> Date: Sat, 7 Sep 2024 06:48:34 -0400 Subject: [PATCH] Add TidierData to frameworks docs page (#3447) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * add tidierdata to frameworks * adds TidierData to docs toml * change from begin end block * add @kdpsingh edits * Apply suggestions from code review --------- Co-authored-by: Bogumił Kamiński --- docs/Project.toml | 1 + docs/src/man/querying_frameworks.md | 139 ++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+) diff --git a/docs/Project.toml b/docs/Project.toml index f6a9f940e..d821a4f08 100755 --- a/docs/Project.toml +++ b/docs/Project.toml @@ -9,6 +9,7 @@ Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28" Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1" Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c" +TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80" [compat] Documenter = "1" diff --git a/docs/src/man/querying_frameworks.md b/docs/src/man/querying_frameworks.md index abda7ec6f..dad7471b2 100644 --- a/docs/src/man/querying_frameworks.md +++ b/docs/src/man/querying_frameworks.md @@ -8,6 +8,145 @@ DataFramesMeta.jl, DataFrameMacros.jl and Query.jl. They implement a functionali These frameworks are designed both to make it easier for new users to start working with data frames in Julia and to allow advanced users to write more compact code. +## TidierData.jl +[TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/), part of +the [Tidier](https://tidierorg.github.io/Tidier.jl/dev/) ecosystem, is a macro-based +data analysis interface that wraps DataFrames.jl. The instructions below are for version +0.16.0 of TidierData.jl. + +First, install the TidierData.jl package: + +```julia +using Pkg +Pkg.add("TidierData") +``` + +TidierData.jl enables clean, readable, and fast code for all major data transformation +functions including +[aggregating](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/summarize/), +[pivoting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/pivots/), +[nesting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/nesting/), +and [joining](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/joins/) +data frames. TidierData re-exports `DataFrame` from DataFrames.jl, `@chain` from Chain.jl, and +Statistics.jl to streamline data operations. + +TidierData.jl is heavily inspired by the `dplyr` and `tidyr` R packages (part of the R +`tidyverse`), which it aims to implement using pure Julia by wrapping DataFrames.jl. While +TidierData.jl borrows conventions from the `tidyverse`, it is important to note that the +`tidyverse` itself is often not considered idiomatic R code. TidierData.jl brings +data analysis conventions from `tidyverse` into Julia to have the best of both worlds: +tidy syntax and the speed and flexibility of the Julia language. + +TidierData.jl has two major differences from other macro-based packages. First, TidierData.jl +uses tidy expressions. An example of a tidy expression is `a = mean(b)`, where `b` refers +to an existing column in the data frame, and `a` refers to either a new or existing column. +Referring to variables outside of the data frame requires prefixing variables with `!!`. +For example, `a = mean(!!b)` refers to a variable `b` outside the data frame. Second, +TidierData.jl aims to make broadcasting mostly invisible through +[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/). TidierData.jl currently uses a lookup table to decide which functions not to +vectorize; all other functions are automatically vectorized. This allows for +writing of concise expressions: `@mutate(df, a = a - mean(a))` transforms the `a` column +by subtracting each value by the mean of the column. Behind the scenes, the right-hand +expression is converted to `a .- mean(a)` because `mean()` is in the lookup table as a +function that should not be vectorized. Take a look at the +[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/) documentation for details. + +One major benefit of combining tidy expressions with auto-vectorization is that +TidierData.jl code (which uses DataFrames.jl as its backend) can work directly on +databases using [TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl), +which converts tidy expressions into SQL, supporting DuckDB and several other backends. + +```jldoctest tidierdata +julia> using TidierData + +julia> df = DataFrame( + name = ["John", "Sally", "Roger"], + age = [54.0, 34.0, 79.0], + children = [0, 2, 4] + ) +3×3 DataFrame + Row │ name age children + │ String Float64 Int64 +─────┼─────────────────────────── + 1 │ John 54.0 0 + 2 │ Sally 34.0 2 + 3 │ Roger 79.0 4 + +julia> @chain df begin + @filter(children != 2) + @select(name, num_children = children) + end +2×2 DataFrame + Row │ name num_children + │ String Int64 +─────┼────────────────────── + 1 │ John 0 + 2 │ Roger 4 +``` + +Below are examples showcasing `@group_by` with `@summarize` or `@mutate` - analagous to the split, apply, combine pattern. + +```jldoctest tidierdata +julia> df = DataFrame( + groups = repeat('a':'e', inner = 2), + b_col = 1:10, + c_col = 11:20, + d_col = 111:120 + ) +10×4 DataFrame + Row │ groups b_col c_col d_col + │ Char Int64 Int64 Int64 +─────┼───────────────────────────── + 1 │ a 1 11 111 + 2 │ a 2 12 112 + 3 │ b 3 13 113 + 4 │ b 4 14 114 + 5 │ c 5 15 115 + 6 │ c 6 16 116 + 7 │ d 7 17 117 + 8 │ d 8 18 118 + 9 │ e 9 19 119 + 10 │ e 10 20 120 + +julia> @chain df begin + @filter(b_col > 2) + @group_by(groups) + @summarise(median_b = median(b_col), + across((b_col:d_col), mean)) + end +4×5 DataFrame + Row │ groups median_b b_col_mean c_col_mean d_col_mean + │ Char Float64 Float64 Float64 Float64 +─────┼────────────────────────────────────────────────────── + 1 │ b 3.5 3.5 13.5 113.5 + 2 │ c 5.5 5.5 15.5 115.5 + 3 │ d 7.5 7.5 17.5 117.5 + 4 │ e 9.5 9.5 19.5 119.5 + +julia> @chain df begin + @filter(b_col > 4 && c_col <= 18) + @group_by(groups) + @mutate( + new_col = b_col + maximum(d_col), + new_col2 = c_col - maximum(d_col), + new_col3 = case_when(c_col >= 18 => "high", + c_col > 15 => "medium", + true => "low")) + @select(starts_with("new")) + @ungroup # required because `@mutate` does not ungroup + end +4×4 DataFrame + Row │ groups new_col new_col2 new_col3 + │ Char Int64 Int64 String +─────┼───────────────────────────────────── + 1 │ c 121 -101 low + 2 │ c 122 -100 medium + 3 │ d 125 -101 medium + 4 │ d 126 -100 high +``` + +For more examples, please visit the [TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/) documentation. + ## DataFramesMeta.jl The [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl) package