Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of pandas in pyrealm.demography #292

Merged
merged 8 commits into from
Sep 23, 2024

Conversation

davidorme
Copy link
Collaborator

@davidorme davidorme commented Sep 21, 2024

Description

In #277, I used pandas.Dataframe within Flora to provide an array-like view onto the plant functional type data. That then led to using it for Community.cohort_data (#282) and also typing the inputs to the T Model functions as pandas.Series. That turns out to be awkward for a number of core use cases:

This PR starts to replace the usage with a pure numpy alternative. There is code in the currently open PR #288 that will also need updating, but this seems like the right way to go now.

The code:

  • Replaces the use of pandas structures throughout.
  • Adds a prototype validation function and unit test for that function to validate array inputs to T model functions.
  • I have not yet added that validation to the functions - wanted to wait for feedback before moving on.

Fixes #291 (issue)

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@davidorme davidorme linked an issue Sep 21, 2024 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.22%. Comparing base (1f315ba) to head (97ad0ed).
Report is 88 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #292      +/-   ##
===========================================
- Coverage    95.29%   95.22%   -0.07%     
===========================================
  Files           28       32       +4     
  Lines         1720     2115     +395     
===========================================
+ Hits          1639     2014     +375     
- Misses          81      101      +20     
Flag Coverage Δ
95.22% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@j-emberton j-emberton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks fine to me overall.

One issue to fix:
There's a straggler issue in that the community.py module docstring still makes reference to using pandas as the internal data vehicle.

Internally, the cohort data in the Community class is represented as a pandas dataframe, which makes it possible to update cohort attributes in parallel across all cohorts but also provide a clean interface for adding and removing cohorts to a Community.

And one query:
I see the type hints for the new NDArrays are now float32. Is there a specific reason for this? Will Python type promotion not mean that this doesn't survive contact with any float64 data?

@davidorme
Copy link
Collaborator Author

One issue to fix: There's a straggler issue in that the community.py module docstring still makes reference to using pandas as the internal data vehicle.

Fixed

And one query: I see the type hints for the new NDArrays are now float32. Is there a specific reason for this? Will Python type promotion not mean that this doesn't survive contact with any float64 data?

That's a good point - the typing of arrays has got cleaner and I think the earlier code was written when this was not such a transparent thing to do. I don't think we need the precision of float64 so we use less RAM by doing this, but equally I don't know if that has speed implications (good or bad) with 64 bit architecture.

So - not sure. I guess I'd like to leave this PR as is and tackle the array typing more widely as another issue.

Copy link
Collaborator

@j-emberton j-emberton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting the docstring. Happy for the float32 types hints to be left in place for the time being. My gut feel is that any data read in from csv/toml etc will default to float64, and would need to be explicitly converted to float32.

@davidorme davidorme merged commit 8793252 into develop Sep 23, 2024
12 checks passed
@davidorme davidorme deleted the 291-look-at-replacing-pandas-in-pyrealmdemography branch September 23, 2024 13:30
@davidorme davidorme mentioned this pull request Sep 23, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Look at replacing pandas in pyrealm.demography
3 participants