Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupindices as special source argument in minilanguage #2683

Closed
jkrumbiegel opened this issue Mar 27, 2021 · 12 comments
Closed

Add groupindices as special source argument in minilanguage #2683

jkrumbiegel opened this issue Mar 27, 2021 · 12 comments
Labels
Milestone

Comments

@jkrumbiegel
Copy link
Contributor

I often want to know a group's index in transform statements, especially when I have sorted while grouping and need to continue doing something with this order. There is currently no easy way to access the group indices.

data.table has the special variable .GRP for this purpose. There's also .NGRP which returns the number of groups, which could also be useful.

I'm thinking that one could elevate groupindices to the same status as nrow so that one can write this:

transform(groupdf, groupindices => do_something_with_the_indices)
@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

groupcols => do_something_with_the_indices would do, but is not very useful I think. (I assume you meant groupcols not groupindices)

I would assume you would want groupcols .=> do_something_with_the_indices and this is something we can add in the future.
for now just use groupcols(gdf), which is not much longer I think.

@bkamins bkamins added this to the 1.x milestone Mar 27, 2021
@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

Regarding .GRP the question how would you want it to be a source of columns in groupindices -> do_something_with_the_indices? Could you maybe write what result are you trying to achieve so that we can discuss the best wat to achieve it?

@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

OK, now I get what you wanted. Adding this has been discussed in the past here: #2556 (comment).

@pdeffebach
Copy link
Contributor

I'm actually googling for how to do this right now in Stata and coming up short (so far). So definitely in favor of this feature.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2021

To be clear. We currently have this feature. Just do:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a     
     │ Int64 
─────┼───────
   1 │     3

julia> gdf2 = gdf[[3, 1]] # reorder and drop groups to show all works OK
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
 Row │ a     
     │ Int64 
─────┼───────
   1 │     3
⋮
Last Group (1 row): a = 1
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

julia> df.groups = groupindices(gdf2)
3-element Vector{Union{Missing, Int64}}:
 2
  missing
 1

julia> df
3×2 DataFrame
 Row │ a      groups  
     │ Int64  Int64?  
─────┼────────────────
   1 │     1        2
   2 │     2  missing 
   3 │     3        1

and now you can use this column freely in all further computations.

The point here is to avoid allocation of this extra column if I understand the request correctly.

@pdeffebach
Copy link
Contributor

No, I don't think that's the request.

I think the request is to special case the function, exactly like nrow in [] => nrow.

But I could be mistaken. My current use case is I have many variables and want to make a unique group number for each group. Then I can just work with that group number instead of keeping track of all the variables I need to group by all the time. So the point is to allocate a new column for the group indices.

My understanding is that this was also @jkrumbiegel 's request.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2021

So the point is to allocate a new column for the group indices.

But I have just shown above how to do it now.

I think the request is to special case the function

Yes, but it is only useful if you do not want to add this column to a data frame. If you want to add it then just do something like:

df.groups = groupindices(gdf2)

@pdeffebach
Copy link
Contributor

I'm thinking that one could elevate groupindices to the same status as nrow so that one can write this:

transform(groupdf, groupindices => do_something_with_the_indices)

Yeah, I think OP is just asking for a convenience function to make it easier in piping.

@jkrumbiegel
Copy link
Contributor Author

jkrumbiegel commented Mar 30, 2021

The example above uses only three rows and three groups, so the fact that one can assign the output of groupindices to a new column there is kind of artificial. Also it would not work easily if the grouping was sorted specifically, different from the dataframe.

I do want this available in the context of piping and the transform framework exactly so I don't have to break out of that mental paradigm, which I can already do almost everything I need with. It's the natural place for the group index information to be available, right when I decide how to process the groups. It could be something as simple as trying to concatenate a string for each group member that includes the group index, or using the index as some multiplicative factor in an aggregation computation. Or I want to index to an external variable that has n_groups entries.

@pdeffebach
Copy link
Contributor

pdeffebach commented Mar 30, 2021

I think one point of confusion is that there are two different behaviors we could be asking for.

  1. Making it easy to construct a group_id variable in transform. i.e.
transform(gd, [] => groupindices => :group_id)

This is exactly how nrow works now and would be easy to do. This doesn't let you work with the group indices, it just makes it easier to create them.

  1. Working with the group indices
transform(gd, groupindices => do_stuff => :new_var)

We don't currently have this behavior for anything. So it would be a big change in the mini-language to add (I've discussed adding more complicated inputs as source here).

I wouldn't mind seeing option 1. get added.

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

I assumed we were discussing option 2 in this issue. Indeed we could consider also adding option 1, which would be rather expressed as:

transform(gd, groupindices => :group_id)

@bkamins bkamins modified the milestones: 1.x, 1.4 Feb 11, 2022
@bkamins
Copy link
Member

bkamins commented Feb 17, 2022

Fixed with #3001

@bkamins bkamins closed this as completed Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants