Skip to content

Commit

Permalink
Pivot post
Browse files Browse the repository at this point in the history
  • Loading branch information
gregates committed Jul 31, 2024
1 parent 9cd26cb commit 0807cb9
Show file tree
Hide file tree
Showing 8 changed files with 168 additions and 0 deletions.
168 changes: 168 additions & 0 deletions content/pivot-shorter-fatter-groupby.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
+++
title = "Pivot is just a shorter, fatter group by"
date = 2024-07-31

[extra.og]
image = "baby_names_pivot_config.webp"
+++

I first heard the term "pivot table" in, I think, 2012. At the time, I had no idea what a pivot
table was. I made some desultory efforts to understand what they are and how one uses them, but I
found Excel's interface inscrutable. Excel's documentation and Google were no help, either. Even
today all you get is a bunch of "pivot tables are useful for aggregating data to make decisions" and
suggestions to play around with them in Excel until you figure it out. I quickly decided it wasn't
worth understanding.

Eleven years later I took up the task of implementing pivot tables for [Row
Zero](https://rowzero.io). It was only then that I was finally forced to understood them. Here, I
take up the task of attempting to explain pivot tables in a way that would have been helpful to me a
decade ago.

The headline: pivot is just a shorter, fatter group by. But, what's a "group by", and what would it
mean for such a thing to be fat or short. The rest of this article will attempt to clarify by first
discussing other operations that pivot resembles: unique, and group by. If we understand those,
it's only a short distance from there to the place where the title of this article will make sense.

All of these operations are operations on tabular data. That means data which is arranged in
**columns** and **rows**. And these operations also output a table. Table in, different table out.
In Excel, which is where pivot tables come from, a table can be represented as a range of data, like
A1:D2000. But the operations aren't specific to spreadsheets and can be done on any data that's
arranged as a table.

I'm going to be using the Social Security Administration's baby name dataset for my examples. This
data is a table with 4 columns and 2,116,840 rows. If you want to see the data yourself, you can
make a copy of my [Row Zero workbook](https://rowzero.io/workbook/644C5FB13CA060884AC9172C/3?copy)
by clicking that link. Or you can just refer to the screenshot below to see a sample of the data.

![several rows of data in a table with columns Year, Name, Sex, and Count](/baby_names_table.png)

Let's call Year, Name, Sex, and Count "fields" of our data (because we're going to use the word
"column" for something else).

## Column Uniques
A common operation on a column of data is to remove duplicates to produce a shorter column of values
that are all unique, with no repetition. For example, in our baby name data, the Sex field only has
two values: it's always either M or F. If we were to compute the unique values for that field, we'd
get just two rows.

You can also compute uniques for multiple fields of a table. This produces one row per unique
**combination** of values from the different fields. For example, uniques for Year & Sex in our
baby names table would produce two rows per year, like so:

![a table with two columns, Year and Sex, where each year has two rows, one with "M" in Sex and one
with "F"](/baby_names_unique_year_sex.png)

Each row in the output table corresponds to one or more rows in the input table. This operation only
gives you the actually occuring combinations, not all possible combinations of values from the
fields. Say, for example, you added a row to the table that was `1880,Edna,U,5`. Since only this
row has the value U for Sex, you'd get a row for `1880,U` in the output table, but not one for
`1980,U`.

In pseudo-code, we can imagine that we're calling a method on the table and specifying which fields
we want to compute uniques for. Something like: `table.uniques(["Year", "Sex"])`.

Computing uniques for Year, Name, Sex, and Count in our baby names data would just give us back the same
table, since there are no duplicate rows in the input table.

## Group By

Our column uniques operation on our table required specifying a list of fields to compute uniques
for.
Now imagine that we extend the operation so that, in addition to
these parameters, you have to tell us what to do with the **other** fields in the table.

Let's define some options. Once thing you could do is **omit** a field from the output table. That's
implicitly what we did in the previous section. Our pseudo-code function call would now look
like `table.uniques(["Year", "Sex"], { "Name": omit, "Count": omit })`.

But we could imagine other ways of including a column. Remember that each row in our output table
corresponds to one or more rows of the input table. So we could say something like, "Add up the
value of the Count field for all corresponding rows of the input table". This is a "group by" operation. We call it that because we're aggregating the values of the Count
column *by* Year and Sex. It's a way of summarizing the data in our input table.

For example, if we wanted to sum the counts for each name by year and sex, we would do `table.groupby(["Year", "Sex"], { "Name": omit, "Count": sum })`. This will tell us how many babies of each sex there are for each year. The output would look like:

![a table with three columns, Year, Sex, and Sum of Count, where each year has two rows, one with "M" in Sex and one
with "F"](/baby_names_sum_of_count_by_year_and_sex.png)

This table has the same number of rows as the example in the previous section, it just adds an
additional column summarizing the Count field for each row.

You can see how we could easily extend this with different aggregation options on the fields:
average, count, median, etc. in addition to sum. You could also allow specifying the same field
multiple times with *different* aggregations, if you wanted both Sum of Count and Average Count in
in the output table, for example.

## Pivot

So now we're in a position to define the pivot operation. Just as column uniques can be thought of
as a special case of the groupby operation, where all other columns are omitted, groupby can be
thought of as a special case of the pivot operation.

A pivot differs from a groupby just in that each of the fields we're grouping by must be put into
one of two buckets: "rows", or "columns". The special case that's equivalent to groupby is when
they're all "rows". So we could rewrite our groupby in pseudo-code as `table.pivot(rows: ["Year",
"Sex"], columns: [], { "Name": omit, "Count": sum })`.

The obvious question is what happens if we move one of our rows to columns? Let's look at an example
where we move "Sex" to columns. Here's what you get:

![a table with three columns, Year, F, and M, where each year has a single row, and the sum of count
for each sex has been put in the corresponding
column](/baby_names_sum_of_count_by_year_and_sex_pivoted.png)

We no longer have a Sex column on our output table. Instead, we have one column per unique value
of the Sex field. We also no longer have a Sum of Count column. Instead, the "M" and "F" columns are
implicitly showing the Sum of Count for the corresponding Sex, for each row. The table also has
fewer rows — in fact it's precisely half the size, because we now have just one row per year
instead of two.

The thing to see is that the data in the output table hasn't changed. It's just been
rearranged. The Sum of Count for `1955,F` is still there, it's just now the value of the F column
for the row 1955. And the `1955,M` Sum of Count is also still there, also on the 1955 row (in the M
column, naturally).

In this particular case, our pivot table is the same width as the groupby from the last section, but
that's only because the Sex field has just two unique values — if it had more, our pivot
output would be wider. And it's shorter, because the output table only needs one row per year
instead of one row per year/sex combo. So this pivot is just a shorter, fatter version of the
groupby we already did.

It gets a little more complicated when you have more than one aggregation, or more than one field in
columns, but the basic idea is the same. We just do a groupby, and then rearrange the table to make
it shorter and fatter. No new information is contained in the pivot. It's just presented differently
and is (maybe) easier to read.

So if you understand groupby, you also understand pivot. In fact, all of the screenshots in this
article are screenshots of pivot tables. Since groupby is just a special case of pivot, and uniques
is just a special case of groupby, all of these operations can be expressed as pivots. It's the same fundamental summarization of
the input data. For me, that was the crucial insight that finally make pivot tables understandable.

## Why Pivot?

When I've tried this explanation out on people in the past, a common response has been, "What's the
point of pivot, then?" If it's the same summary as groupby, why pivot?

One obvious answer is that you just prefer the table layout that the pivot produces, or find it more
legible. A less obvious answer is that it's *very* useful for constructing graphs. If you select the
whole pivot table from the last section and click the "Insert chart" button in [Row
Zero](https://rowzero.io), you instantly get a chart showing count of baby boys vs. baby girls over
time.

![a line chart showing the change over time of the number of baby boys and baby girls in the SSA
data](/baby_names_pivot_chart.png)

Each column becomes its own graphable series of data in the pivot. Pivot tables are especially
powerful when combined with charts in a spreadsheet.

## Conclusion
When you're staring at the inscrutable pivot table configuration pane in Excel, or the hopefully
slightly less inscrutable version in [Row Zero](https://rowzero.io), just remember that the "Rows"
and "Columns" sections are just diffent ways of specifying which fields you want to group by. And
"Values" is where you specify your aggregations for the other fields. ("Filters" just let you
filter the input data *before* applying the pivot transformation.) For me, the simplest way to
construct a pivot is to imagine the groupby I want to perform, and construct that first, using
"Rows" only and no "Columns". Then I maybe move a Row or two to Columns, to get the final shape I want.

And that's pivot explained as simply as I know how.

Binary file added static/baby_names_pivot_chart.webp
Binary file not shown.
Binary file added static/baby_names_pivot_config.webp
Binary file not shown.
Binary file added static/baby_names_sum_of_count_by_year.webp
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added static/baby_names_table.webp
Binary file not shown.
Binary file added static/baby_names_unique_year_sex.webp
Binary file not shown.

0 comments on commit 0807cb9

Please sign in to comment.