-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject duplicate results when handling efficiencies #65
Conversation
Previously, the validity of the efficiency column was being checked by the matplotlib and pgfplots backends instead of by the dispatch code. Centralizing the checks will ensure they stay in sync. Signed-off-by: John Pennycook <[email protected]>
Our previous attempt at solving this problem (see intel#22) had unexpected knock-on effects, since removing data from a user-supplied DataFrame might impact certain properties of the data (e.g., the order in which applications, platforms, and/or problems appear). Rather than complicate our implementation with workarounds that might not address every possible use-case, we can simply detect and reject problematic data. This change slightly complicates the process of working with large data, but ensures that users are always in control over which data is plotted. Signed-off-by: John Pennycook <[email protected]>
Since we're changing the way that duplicates are handled by PP, the newest version fails the original test. Signed-off-by: John Pennycook <[email protected]>
By default, groupby sorts the DataFrame. This leads to weird reordering effects when pp is used in conjunction with cascade plots and navcharts. Signed-off-by: John Pennycook <[email protected]>
The previous "expected" test result had actually been chosen based on the empirical behavior of the library. If we expect the output DataFrame to remain unsorted, we should test for that. Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks reasonable to me and I think its good that it also cleans up some duplicated backend code to a single place.
I actually like the idea of a "keep=best" type argument also. I would be in favour of that in a separate issue with, error, best, latest, etc as options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - not sure if you wanted to explain/document why the ValueError
is raised when pairs are not unique
Signed-off-by: John Pennycook <[email protected]>
Removing data from a user-supplied DataFrame might impact certain properties of the data (e.g., the order in which applications, platforms, and/or problems appear).
Rather than complicate our implementation with workarounds that might not address every possible use-case, we can simply detect and reject problematic data.
Related issues
This effectively reverts #22. It's an alternative solution to the one proposed in #63.
Proposed changes
pp
from sorting implicitly during itsgroupby
operation.The upshot of the changes here is intended to be:
ValueError
.In my own offline testing of complex P3 workflows, I've found that I need to insert an additional line to prepare data the way I typically want it to be plotted:
I don't think this is too bad, and it only shows up in complicated cases. If we wanted to simplify this workflow, we could consider introducing something like:
...but I'd want to explore that separately, to make sure that we design and test it properly.