Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proprow and rownumber #2556

Closed
wants to merge 6 commits into from
Closed

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 21, 2020

This should be merged after #2554 is merged and then NEWS.md should be updated.

Other than that the PR should be ready.

@bkamins bkamins requested a review from pdeffebach November 21, 2020 12:59
@bkamins bkamins added this to the 1.0 milestone Nov 21, 2020
@bkamins
Copy link
Member Author

bkamins commented Nov 21, 2020

@nalimilan - as usual please comment on the naming style 😄.

@bkamins bkamins changed the title add proprow add proprow and rownumber Nov 22, 2020
@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2020

I came to the conclusion that instead of adding RowNumber selector as @pdeffebach suggested earlier it is betted to define the rownumber function for AbstractDataFrame that is just axes(df, 1), and then add a special syntax to it for combine et. al.

@pdeffebach - would this meet your use cases? (it is consistent with SQL row_number)

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2020

Also the good thing (I hope you agree) is that we have rownumber already defined for DataFrameRow in a consistent way.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2020

As suggested by @nalimilan instead of proprow we could introduce: prop ∘ nrow with exactly the same syntax instead.

Then we could allow :src => prop ∘ fun => :dst where fun would allow any aggregation function fun.

For this to work in DataAPI.jl we probably should introduce prop(vec::AbstactVector) = vec ./ sum(vec).

The only problem is that FreqTables.jl defines function prop(tbl::AbstractArray{<:Number,N}; margins=nothing) where N which is conflicting.

@nalimilan - can you comment here please what you think before I post it on #data in Slack?

EDIT: we could move the whole definition of function prop(tbl::AbstractArray{<:Number,N}; margins=nothing) where N to DataAPI.jl maybe?

EDIT2: actually in DataFrames.jl we will not use the default implementation of prop anyway so maybe it is OK to just have function prop end defined there.

@nalimilan
Copy link
Member

Yes, the definition of prop in FreqTables is quite simple so I guess it could live in DataAPI. It could even make sense to have it in Statistics but I'm not sure it would be accepted.

@pdeffebach
Copy link
Contributor

Then we could allow :src => prop ∘ fun => :dst where fun would allow any aggregation function fun.

For this to work in DataAPI.jl we probably should introduce prop(vec::AbstactVector) = vec ./ sum(vec).

I don't think this proposal has a very clear mental model. The reason to special case nrow is that it doesn't follow the normal rules of combine, right? Normally, in :x => f => :y, f only has access to the subarray for of :x corresponding to a given group.

with prop ∘ nrow it's not clear what the order of operations is. It seems like you are saying we do combine and then take the whole vector, not per group, and apply prop to it. That isn't consistent with other operations in combine.

rownumber as a function is good and It makes sense to add.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2020

That is why prop ∘ nrow was just an afterthought and rownumber was an initial idea.

with prop ∘ nrow it's not clear what the order of operations is.

This is a good point as in general :x => prop ∘ sum => :y is a well defined operation now so changing its meaning would be breaking. We could go for e.g. Prop(sum) (with Prop being a special type) though. The point that @nalimilan raised is that people often do:

julia> df = DataFrame(id=rand('a':'d', 1000), val=rand(1000));

julia> select(combine(groupby(df, :id), :val => sum => :sum), :id, :sum => (x -> x ./ sum(x)) => :prop)
4×2 DataFrame
 Row │ id    prop
     │ Char  Float64
─────┼────────────────
   1 │ b     0.243178
   2 │ d     0.246003
   3 │ a     0.251521
   4 │ c     0.259298

and it requires two steps (not the end of the world but still).

Actually a crazy idea would be to allow:

:src => by_group_processing_function => whole_result_posprocessing => :dst

then one would write:

:src => sum => prop => :prop

@nalimilan
Copy link
Member

Ah, right, prop ∘ sum would be an abuse as there's no prop(sum(x)) with that meaning. So either Prop(sum) or :src => sum => prop => :prop would be more appropriate. If we wanted to, the first could be made as general as the second one using e.g. :src => Postprocess(sum, prop) => :prop.

I guess the choice between these should depend on whether we expect users to want to apply a variety of post-processing operations, or whether Prop (and maybe Percent) are enough. Two other important considerations are 1) how complex the implementation would be, and 2) what the DataFramesMeta syntax would look like.

@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2020

  1. how complex the implementation would be

I would say that I have it implemented already (would need to test the corner cases, but in general I have worked it out hot to do it safely and efficiently). The major question is if post-processing should be applied before column expansion (this is what I do now) or after it. But I think that before. This is how I implemented proprow now.

However, my hesitation is that if proprow is 95% of use cases, then adding a complex mechanism might be an overkill as you can cover these 5% of cases by doing a call to e.g. combine twice (and with @pipe it would be quite easy to read). But if we wanted to go for it I would be ok with Postprocess. The only restriction we would have to make is that we expect a scalar returned from the first function and likewise from the second function I think. As combining the Postprocess with the mechanism of handling vectors/tables will probably be very tricky and hard to get 100% right in all cases.

  1. what the DataFramesMeta syntax would look like.

Here probably @pdeffebach is the best one to answer.

@pdeffebach
Copy link
Contributor

My reading of this PR is that in the long term, I would like singleton structs that have dispatch. Something like

transform(df, [:x, RowNumber()] => ((t, s) -> t .+ s) => :y)

I worry that adding rownumber now might complicated an implemenation of the above in the 1.X, 2.X timeline.

w.r.t nrow and proprow in DataFramesMeta, there are two scenarios

  1. When @transform allows non- y = f(:x) expressions, the user just does @tranfform(gd, nrow => :y)
  2. When we check that functions are just expressed as Symbols in the syntax tree, then @transform(gd, y = nrow), we can special case nrow so it gets lowered to nrow => :y.

Same with proprow

@bkamins
Copy link
Member Author

bkamins commented Dec 8, 2020

Given everyone has slept over this issue a bit what is our current thinking?

There are four things that are in general requested:

  1. function for calculating row proportions (proprow) in the current proposal - I think it is a common enough pattern that it is worth to add it;
  2. a way to produce row-number per group (here @pdeffebach suggests using a custom selector)
  3. a way to produce group number as a constant column (here using a custom selector would be also a proposal I assume - right)
  4. a way to invoke post-processing step (a generalization of proprow) - here the question is if it would not overcomplicate the mini-language (that is already quite complex)

My thinking is that I would narrow down this PR to point 1. (adding proprow) and drop points 2. do 4. to be decided for later (if we really need them) as all these points can be relatively easily achieved using other means.

Please comment what you think. Thank you!

@pdeffebach
Copy link
Contributor

Agree, proprow would be a nice convenience that might make out lives easier a bit, other options are too heavy.

@nalimilan
Copy link
Member

What about Prop(nrow)? It seems that Prop(sum) could also be relatively common too. That would avoid adding two functions, and we could also support e.g. Prop(nrow, percent=true) for convenience in the future, as it's a very common need which generally requires adding a full additional transform step just to multiply by 100.

@bkamins
Copy link
Member Author

bkamins commented Dec 10, 2020

It could actually be prop(nrow). later we could add other options as they follow a different syntax (prop(nrow) would not take a source column). Then we only should prop from FreqTables.jl to DataAPI.jl.

@nalimilan
Copy link
Member

Yes, I'm not sure which is best. We currently have ByRow(f); also things like AsTable(...) and Cols(...) use CamelCase and could be considered as similar. OTOH using prop(f) would be more consistent with FreqTables; but it also requires more coordination, and even if the function is the same the meaning would be quite different.

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

OK Prop is also OK and easier to manage.

@kobusherbst
Copy link

I would find rownumber helpful to create a ranking within a group

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

I would find rownumber helpful to create a ranking within a group

Thank you for commenting. Actually, after thinking about it I would not add rownumber, but rather add a special path for eachindex like we have for nrow.

So passing as arg:

  1. just eachindex would create a column with :row name (here we could discuss what default name would be best, but eachindex is not a very nice name I think)
  2. and eachindex => :colname would create column with name :colname

@nalimilan - what do you think of this? eachindex is a natural function name to use here I think.

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2021

Also for

a way to produce group number as a constant colum

we could use groupindices function in a similar way.

@bkamins
Copy link
Member Author

bkamins commented Jan 4, 2021

To summary I would add in this PR the following given the feedback:

  • passing eachindex and eachindex => :colname would be allowed
  • passing groupindices and groupindices => :colname would be allowed
  • passing Prop(nrow), Prop(nrow) => :colname would be allowed
  • passing src => Prop(fun), src => Prop(fun) => :colname would be allowed, which would calculate the proportion of what fun returns when applied to groups (here we would use the slow path initially and would check that fun returns Union{Number. Missing} - should we allow Missing)

The only reservation I have is the following - do we need to expand the API, or it is enough to add tutorials showing how to do these things. In sequence.

eachindex can be just written as 1 => eachindex => :colname (so essentially what we would do is allow to drop 1 from the syntax and correctly handle the 0-column case which is not very useful anyway I think)

groupindices is least intuitive to get as now you have do to something like:

gdf = groupby(df, cols)
parent(gdf).groupindicescolumn = groupindices(gdf)

and now you have a column in your gdf with group number for each group. The only problem is that df would not be allowed ot be SubDataFrame in this case, so the scenario is a bit limited.

The generality of Prop(fun) is appealing, but is it really worth to complicate the API by adding it? Maybe it would be enough to add proprow as in the original version of this PR to cover the most common case? I am not sure what is best, as I fear of over-complicating the API to the point where it would take too long to learn (it is already complicated). Maybe we even should dump proprow (as it is easy to get what it does in post-processing step using transform!)?

So in summary - do we feel that it is worth to expand the API with the new options?

@pdeffebach
Copy link
Contributor

Maybe it's worth figuring out how a user might add functions that operate like this, on the fly? Having an API for fun => dest where the user can "register" fun might make things simpler in the long run.

@bkamins
Copy link
Member Author

bkamins commented Jan 4, 2021

Having an API for fun => dest

We might extend the syntax to allow fun => dest (currently this is not supported) assuming that fun takes a data frame as an input (this can be done any time as it is non-breaking). Still the special treatment of "bare" fun would have to be hardcoded.

@nalimilan
Copy link
Member

Using eachindex sounds good. I guess that even if we decide to support vectors with custom indices at some point we will require all columns to use the same indices. But having the column use a different name from the function is weird and inconsistent... We could also support keys, but the name isn't super logical either.

I'd say it's also OK to special-case groupindices.

Regarding Prop(nrow), I think it's preferable to proprow, which would also be a special case anyway. But we don't necessarily have to decide whether to add this immediately.

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2021

even if we decide to support vectors with custom indices at some point we will require all columns to use the same indices

I was also thinking about this. Realistically I think it is highly unlikely we will start allowing such columns in the short term.

We could also support keys, but the name isn't super logical either.

we could use axes. The only problem is that it would unwrap the Tuple, but I think it is OK.

But we don't necessarily have to decide whether to add this immediately.

Do you see any of these important to be added before 1.0 release?

@nalimilan
Copy link
Member

we could use axes. The only problem is that it would unwrap the Tuple, but I think it is OK.

But is axes really better than eachindex or keys as a column name? To me it's even weirder.

Do you see any of these important to be added before 1.0 release?

I'd say 1.0 is only about API stability. No features are really essential to have in that release. Some are nice to have for communication purposes (e.g. multithreading is shiny), but features discussed here can wait IMO.

@bkamins bkamins modified the milestones: 1.0, 1.x Jan 8, 2021
@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2021

OK. I am moving a milestone on this.

@piever
Copy link

piever commented Aug 6, 2021

Sorry if I'm injecting noise into the discussion, but I was wondering, could it be more general to add a OnIndices (name is optional) wrapper, such that

combine(groupby(df, cols), OnIndices(f) => dest) # same for `transform` and `select`

applies f to the list of indices of each group and writes it in the column dest? This way, the various number of rows, proportion of rows, row indices can all be computed as OnIndices(length), OnIndices(idxs -> length(idxs) / nrow(df)), OnIndices(eachindex).

Or is the idea that there are so few relevant functions one may wish to compute with group indices that it's possible to special case all of them?

@bkamins
Copy link
Member Author

bkamins commented Aug 6, 2021

Thank you for commenting.

Or is the idea that there are so few relevant functions one may wish to compute with group indices that it's possible to special case all of them?

This was the idea. The objective is to make the most common functions easy to access for newcomers.
But exactly because it is not clear what is best this issue was on hold. Let us see what other people think.

I have this issue on my radar and eventually it will be resolved.

@bkamins
Copy link
Member Author

bkamins commented Jan 31, 2022

I would need to rewrite this PR from scratch so I would close it, but before I do it let us discuss what extra features in the mini-language we need? in particular do we need proprow and rownumber.

@nalimilan
Copy link
Member

rownumber is probably useful, having to use 1 => eachindex => :rownumber isn't ideal for such a basic operation, and special-casing it like nrow sounds natural. I also still tend to think that Prop(row) and Prop(sum) would be nice to have, but clearly we can leave without them as dplyr still doesn't offer any convenience functions for this.

@bkamins
Copy link
Member Author

bkamins commented Feb 7, 2022

I also still tend to think that Prop(row) and Prop(sum) would be nice to have, but clearly we can leave without them as dplyr still doesn't offer any convenience functions for this.

The difference vs. dplyr is that our groupby can drop some groups therefore it is non-trivial to do Prop(fun).
Also then the question is how Prop(fun) in general should be defined. I assume that we want to restrict Prop(fun) to functions that take only one column and return a Number (do we want to handle Missing as return value additionally somehow?). And then define Prop(fun) as dividing the result of fun per group by sum of results per group. Or rather divide it by the result of fun applied to a vector being vcat of all group vectors. For length and sum this is the same. But it would not be the same e.g. for mean etc..

In general - defining Prop properly is non trivial. Maybe for start we should restrict ourselves to length, length∘sikipmissing, sum, and sum∘skipmissing and only allow these four functions to be wrapped in Prop?

@bkamins bkamins modified the milestones: 1.x, 1.4 Feb 7, 2022
@nalimilan
Copy link
Member

Starting with a restricted version would be OK, but I'm not sure that's really needed. I'd say the definition of Prop ("proportion") would be to divide all per-group results by their sum. Dividing them by the result of fun applied to all values would be a different operation than computing the proportion so that would be a different name. Also, the number of input columns and the return type shouldn't matter, right? We can just take whatever is returned a compute the sum.

I'm not sure about skipping missing values in the per-group results. The only case where that can happen is if you do Prop(sum) and would like to have missing for groups containing missing values, but a non-missing value for other groups. I don't know whether that can be useful. I guess if somebody requests it we could allow Prop(sum)∘skipmissing as a syntax for it? Or just tell them to use a separate combine step. Anyway we don't have to decide now.

@bkamins
Copy link
Member Author

bkamins commented Feb 7, 2022

I'd say the definition of Prop ("proportion") would be to divide all per-group results by their sum.

This is not that easy unfortunately. Consider the following:

julia> df = DataFrame(id=[1,1,2,2], v=1:4)
4×2 DataFrame
 Row │ id     v
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     2      3
   4 │     2      4

julia> combine(groupby(df, :id), :v => identity)
4×2 DataFrame
 Row │ id     v_identity
     │ Int64  Int64
─────┼───────────────────
   1 │     1           1
   2 │     1           2
   3 │     2           3
   4 │     2           4

would you expect combine(groupby(df, :id), :v => Prop(identity)) to work and then the v_identity column would be [1/4, 1/3, 3/4, 2/3]?

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2022

Having thought about it I propose to add the following syntaxes:

  • passing eachindex and eachindex => :colname
  • passing groupindices and groupindices => :colname
  • passing proprow and proprow => :colname

The rationale for this is that they all follow the same rule as nrow (i.e. do not require passing source column name).
We would then have four syntaxes that are special cased with not requiring passing source column name.

I would not add Prop(fun) for now as it requires passing source column name, so it is different syntax.

In the future we can re-consider adding general Prop(fun) wrapper if users request it as a separate feature.

@nalimilan + @pdeffebach + @jkrumbiegel -> can you please comment if you accept it? I will implement this PR next week.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2022

In #3001 I propose the implementation, so I am closing this PR.

@bkamins bkamins closed this Feb 11, 2022
@bkamins bkamins deleted the add_proprow branch February 11, 2022 21:45
@nalimilan
Copy link
Member

FWIW, I just discovered that ggplot2 has added an after_stat function wrapper which is similar to Postprocess we discussed above. To compute proportions of counts in each group one does after_stat(prop) (previously this was special-cased as ..prop..).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants