Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject duplicate results when handling efficiencies #65

Merged
merged 6 commits into from
Aug 29, 2024

Conversation

Pennycook
Copy link
Contributor

Removing data from a user-supplied DataFrame might impact certain properties of the data (e.g., the order in which applications, platforms, and/or problems appear).

Rather than complicate our implementation with workarounds that might not address every possible use-case, we can simply detect and reject problematic data.

Related issues

This effectively reverts #22. It's an alternative solution to the one proposed in #63.

Proposed changes

  • Reject duplicate (application, platform) pairs when calculating PP and plotting cascades.
  • Prevent pp from sorting implicitly during its groupby operation.
  • Update tests to match new expected behavior.

The upshot of the changes here is intended to be:

  • For simple datasets with no duplicate results, there is no change in behavior.
  • For complex datasets with duplicate results (which may occur after projection), we throw a ValueError.
  • All of our calculation and plotting functions should now respect the order of the data given to them, so the user regains the ability to control the order of applications and platforms in the legend simply by sorting their data.

In my own offline testing of complex P3 workflows, I've found that I need to insert an additional line to prepare data the way I typically want it to be plotted:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher")
eff_df = eff_df.sort_values("app eff").drop_duplicates(["application", "platform"], keep="last").sort_index()
cascade = p3.plot.cascade(eff_df)

I don't think this is too bad, and it only shows up in complicated cases. If we wanted to simplify this workflow, we could consider introducing something like:

eff_df = p3.metrics.application_efficiency(projected_df, foms="higher", keep="best") # or keep="all", keep="latest"
cascade = p3.plot.cascade(eff_df)

...but I'd want to explore that separately, to make sure that we design and test it properly.

Previously, the validity of the efficiency column was being checked by
the matplotlib and pgfplots backends instead of by the dispatch code.

Centralizing the checks will ensure they stay in sync.

Signed-off-by: John Pennycook <[email protected]>
Our previous attempt at solving this problem (see intel#22) had unexpected
knock-on effects, since removing data from a user-supplied DataFrame
might impact certain properties of the data (e.g., the order in which
applications, platforms, and/or problems appear).

Rather than complicate our implementation with workarounds that might
not address every possible use-case, we can simply detect and reject
problematic data.

This change slightly complicates the process of working with large data,
but ensures that users are always in control over which data is plotted.

Signed-off-by: John Pennycook <[email protected]>
Since we're changing the way that duplicates are handled by PP, the
newest version fails the original test.

Signed-off-by: John Pennycook <[email protected]>
By default, groupby sorts the DataFrame. This leads to weird reordering
effects when pp is used in conjunction with cascade plots and navcharts.

Signed-off-by: John Pennycook <[email protected]>
The previous "expected" test result had actually been chosen based on
the empirical behavior of the library. If we expect the output DataFrame
to remain unsorted, we should test for that.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added bug Something isn't working enhancement New feature or request labels Aug 22, 2024
Copy link
Collaborator

@swright87 swright87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks reasonable to me and I think its good that it also cleans up some duplicated backend code to a single place.

I actually like the idea of a "keep=best" type argument also. I would be in favour of that in a separate issue with, error, best, latest, etc as options?

Copy link
Contributor

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - not sure if you wanted to explain/document why the ValueError is raised when pairs are not unique

p3/metrics/_pp.py Outdated Show resolved Hide resolved
p3/plot/_cascade.py Outdated Show resolved Hide resolved
@Pennycook Pennycook merged commit ef826b2 into intel:main Aug 29, 2024
7 checks passed
@Pennycook Pennycook deleted the reject-duplicate-results branch August 29, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants