Add `groupindices` as special source argument in minilanguage #2683

jkrumbiegel · 2021-03-27T18:20:44Z

I often want to know a group's index in transform statements, especially when I have sorted while grouping and need to continue doing something with this order. There is currently no easy way to access the group indices.

data.table has the special variable .GRP for this purpose. There's also .NGRP which returns the number of groups, which could also be useful.

I'm thinking that one could elevate groupindices to the same status as nrow so that one can write this:

transform(groupdf, groupindices => do_something_with_the_indices)

The text was updated successfully, but these errors were encountered:

bkamins · 2021-03-27T18:41:30Z

groupcols => do_something_with_the_indices would do, but is not very useful I think. (I assume you meant groupcols not groupindices)

I would assume you would want groupcols .=> do_something_with_the_indices and this is something we can add in the future.
for now just use groupcols(gdf), which is not much longer I think.

bkamins · 2021-03-27T18:45:10Z

Regarding .GRP the question how would you want it to be a source of columns in groupindices -> do_something_with_the_indices? Could you maybe write what result are you trying to achieve so that we can discuss the best wat to achieve it?

bkamins · 2021-03-27T18:49:03Z

OK, now I get what you wanted. Adding this has been discussed in the past here: #2556 (comment).

pdeffebach · 2021-03-29T20:54:42Z

I'm actually googling for how to do this right now in Stata and coming up short (so far). So definitely in favor of this feature.

bkamins · 2021-03-29T20:58:04Z

To be clear. We currently have this feature. Just do:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
⋮
Last Group (1 row): a = 3
 Row │ a     
     │ Int64 
─────┼───────
   1 │     3

julia> gdf2 = gdf[[3, 1]] # reorder and drop groups to show all works OK
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
 Row │ a     
     │ Int64 
─────┼───────
   1 │     3
⋮
Last Group (1 row): a = 1
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1

julia> df.groups = groupindices(gdf2)
3-element Vector{Union{Missing, Int64}}:
 2
  missing
 1

julia> df
3×2 DataFrame
 Row │ a      groups  
     │ Int64  Int64?  
─────┼────────────────
   1 │     1        2
   2 │     2  missing 
   3 │     3        1

and now you can use this column freely in all further computations.

The point here is to avoid allocation of this extra column if I understand the request correctly.

pdeffebach · 2021-03-29T21:01:00Z

No, I don't think that's the request.

I think the request is to special case the function, exactly like nrow in [] => nrow.

But I could be mistaken. My current use case is I have many variables and want to make a unique group number for each group. Then I can just work with that group number instead of keeping track of all the variables I need to group by all the time. So the point is to allocate a new column for the group indices.

My understanding is that this was also @jkrumbiegel 's request.

bkamins · 2021-03-29T22:00:44Z

So the point is to allocate a new column for the group indices.

But I have just shown above how to do it now.

I think the request is to special case the function

Yes, but it is only useful if you do not want to add this column to a data frame. If you want to add it then just do something like:

df.groups = groupindices(gdf2)

pdeffebach · 2021-03-30T02:43:21Z

I'm thinking that one could elevate groupindices to the same status as nrow so that one can write this:

transform(groupdf, groupindices => do_something_with_the_indices)

Yeah, I think OP is just asking for a convenience function to make it easier in piping.

jkrumbiegel · 2021-03-30T07:28:30Z

The example above uses only three rows and three groups, so the fact that one can assign the output of groupindices to a new column there is kind of artificial. Also it would not work easily if the grouping was sorted specifically, different from the dataframe.

I do want this available in the context of piping and the transform framework exactly so I don't have to break out of that mental paradigm, which I can already do almost everything I need with. It's the natural place for the group index information to be available, right when I decide how to process the groups. It could be something as simple as trying to concatenate a string for each group member that includes the group index, or using the index as some multiplicative factor in an aggregation computation. Or I want to index to an external variable that has n_groups entries.

pdeffebach · 2021-03-30T15:44:15Z

I think one point of confusion is that there are two different behaviors we could be asking for.

Making it easy to construct a group_id variable in transform. i.e.

transform(gd, [] => groupindices => :group_id)

This is exactly how nrow works now and would be easy to do. This doesn't let you work with the group indices, it just makes it easier to create them.

Working with the group indices

transform(gd, groupindices => do_stuff => :new_var)

We don't currently have this behavior for anything. So it would be a big change in the mini-language to add (I've discussed adding more complicated inputs as source here).

I wouldn't mind seeing option 1. get added.

bkamins · 2021-03-30T15:50:38Z

I assumed we were discussing option 2 in this issue. Indeed we could consider also adding option 1, which would be rather expressed as:

transform(gd, groupindices => :group_id)

bkamins · 2022-02-17T22:10:09Z

Fixed with #3001

bkamins added the feature label Mar 27, 2021

bkamins added this to the 1.x milestone Mar 27, 2021

bkamins modified the milestones: 1.x, 1.4 Feb 11, 2022

bkamins closed this as completed Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `groupindices` as special source argument in minilanguage #2683

Add `groupindices` as special source argument in minilanguage #2683

jkrumbiegel commented Mar 27, 2021

bkamins commented Mar 27, 2021 •

edited

Loading

bkamins commented Mar 27, 2021

bkamins commented Mar 27, 2021

pdeffebach commented Mar 29, 2021

bkamins commented Mar 29, 2021

pdeffebach commented Mar 29, 2021

bkamins commented Mar 29, 2021

pdeffebach commented Mar 30, 2021

jkrumbiegel commented Mar 30, 2021 •

edited

Loading

pdeffebach commented Mar 30, 2021 •

edited

Loading

bkamins commented Mar 30, 2021

bkamins commented Feb 17, 2022

Add groupindices as special source argument in minilanguage #2683

Add groupindices as special source argument in minilanguage #2683

Comments

jkrumbiegel commented Mar 27, 2021

bkamins commented Mar 27, 2021 • edited Loading

bkamins commented Mar 27, 2021

bkamins commented Mar 27, 2021

pdeffebach commented Mar 29, 2021

bkamins commented Mar 29, 2021

pdeffebach commented Mar 29, 2021

bkamins commented Mar 29, 2021

pdeffebach commented Mar 30, 2021

jkrumbiegel commented Mar 30, 2021 • edited Loading

pdeffebach commented Mar 30, 2021 • edited Loading

bkamins commented Mar 30, 2021

bkamins commented Feb 17, 2022

Add `groupindices` as special source argument in minilanguage #2683

Add `groupindices` as special source argument in minilanguage #2683

bkamins commented Mar 27, 2021 •

edited

Loading

jkrumbiegel commented Mar 30, 2021 •

edited

Loading

pdeffebach commented Mar 30, 2021 •

edited

Loading