-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aggr-] allow ranking rows by key column #2417
base: develop
Are you sure you want to change the base?
Conversation
The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become |
Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column? |
Since it's not obvious which type to pick, the user can convert the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love what this is adding, and I think with a few tweaks it would be even more powerful!
There are two kinds of ranking operations people may want.
What is a good name for these two aggregators? |
8078fb6
to
c6c608e
Compare
Okay, I implemented a command that adds a column and applies an aggregator to rows after grouping them by key columns. It's To get this to work with list aggregators, I made a new class I also tried making a I'm a bit unsure about the new behavior of the
|
There is one detail about the grouping in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move most of this into features/addcol_sheetrank.py
? It's a fair amount of code and I'd like to make it self-contained as much as possible.
It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.
Okay, I moved the I had to move the One notable consequence of this move is that the new visidata/visidata/features/rank.py Line 41 in ec28446
|
for aggr in aggrs: | ||
rows = aggregate_groups(sheet, col, sheet.rows, aggr) | ||
if isinstance(aggr, ListAggregator): | ||
t = aggr.listtype or col.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate listtype
? Seems like we could just use the same aggr.type
in both cases, and remove this isinstance
(which is usually a code smell for me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way list aggregators work now, there is a need for two distinct types, type
and listtype
. type
is for the result of the aggregator. For example, this is used by memo-aggregate
. That's why type
is anytype
for ListAggregators
. This type would be used whenever we want to hold the entire result (a list) in a cell.
But we also need a separate type for the elements of the list. This is for when the aggregator result goes in a column, like for addcol-aggregate
, where each cell holds not the result itself, but an element of the list result.
If I try to get rid of one or the other types, I run into problems. For RankAggregator
, if I get rid of the listtype=int
switch to type=int
, I get an error in the statusbar for z+
rank
:
'''
text_rank=int() argument must be a string, a bytes-like object or a real number, not 'list'
'''
But if instead I make RankAggregator
use type=anytype
, the column added by addcol-aggregate
rank
does not get the type int
.
The need for two types is awkward. And I see your point about isinstance
being a code smell. (That is a helpful heuristic, and I'll use it in the future.) It's accurately pointing out strain in the design: most aggregators produce a single value, list aggregators produce a list.
Maybe rank
should not be an aggregator. It's unlikely people want a list object holding the ranks. Most people want a column holding the ranks. What if we replace addcol-aggregate
+rank
with an equivalent command addcol-grouprank
(in addition to the existing addcol-sheetrank
)? And we would reserve addcol-aggregate
for finding group values like sum
, mean
, median
, as you suggested earlier. What do you think?
(I would also consider changing the name addcol-aggregate
. Maybe to addcol-group-aggregate
.)
Okay, this one is quite old and quite big at this point! I gave it another pass through, and my remaining questions have to do with |
@midichef I think your PR successfully meets your goals. This is such thoughtfully considered work. Once you have your fnal changes in, I am going to add an update to the guide in this PR, with documentation. I have one question for my own clarity: what is rank's intended behaviour when there are multiple key columns? Say keycol keycol_2 Item |
When there are multiple key columns, But if the key columns were not unique, the results would be more interesting. Here the groups are
|
On review of this feature, I found that after several months, I found it difficult to make sense of the command names I introduced. The three operations added in this PR as it stands now are:
Operations 1 and 2 do not have suggestive enough names. If I couldn't keep them straight in my mind, nobody else will either. Any suggestions on better names? |
The rank aggregator is not helpful when applied to a key column. Every row is given rank 1. A warning is shown, to hint to the user that they may not understand rank operations.
Okay I submitted two minor changes, 20ba354 (which corrects an accidental reversion of a line from a previous commit) and 9ed9f4b (which adds a few errors/warnings as guardrails, since I expect these operations to be difficult for users to understand at first). |
@midichef Yes, this is why I was going to update the docs as part of the merging of this PR. I think it's really hard to understand what these commands are solely from their name and one-line description (which isn't your fault, these are inherently complicated concepts). But, I will mull over if I have better names at hand....Maybe a good question for the VisiData discord. |
This PR adds a
rank
aggregator that returns a list, and a commandaddcol-rank
, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.It also fixes a bug in
memo-aggregate
where long output takes an extremely long time to show up in the statusbar.For example:
seq 1222333 |vd -
, thenz+
list
. After the list is calculated, visidata will get stuck for many seconds showingprocessing…
, because it's very slow to runformat()
on a long sequence.I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?
Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:
What's the standard way to handle sorting mixed types for Visidata?