Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

51 reduce memory burden of pipeline #95

Merged
merged 24 commits into from
Sep 18, 2024

Conversation

geoffwoollard
Copy link
Collaborator

#51

I incorporated a test validating the numerical identity of the output, so I am confident there are no errors on that side.

Before there was batch normalization and masking for all the gt maps. Since only one map is loaded in at a time, this is done on-the-fly inside the local distance computation.

@geoffwoollard geoffwoollard linked an issue Sep 11, 2024 that may be closed by this pull request
@geoffwoollard geoffwoollard self-assigned this Sep 11, 2024
@geoffwoollard geoffwoollard added the enhancement New feature or request label Sep 11, 2024
@geoffwoollard geoffwoollard changed the base branch from main to dev September 11, 2024 00:59
Copy link
Collaborator

@DSilva27 DSilva27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change how the low memory stuff works, doing things in a for loop seems really unnecessary. Here is what I would do

@DSilva27
Copy link
Collaborator

DSilva27 commented Sep 16, 2024

The metrics should not care if the input is low memory or not, they should only care about volumes1, volumes2, and some arguments if necessary. The whole low memory stuff should come at the pipeline level. Here is what I would do

  1. Implement functions that normalize volumes the way you want it, they should take a stack of volumes or a single volumes. Don't do this things inside an if-statement.

  2. Load a chunk of the gt volumes, run all the preprocessing you need, and then get the submatrix of the full distance matrix for that chunk of volumes. This should use the distance matrices you already have implemented. Populate the full distance matrix on-the-fly. This way you can do a vmap to compute a submatrix, and then a simple stack operation to get the full matrix.

I would also create a low_memory_pipeline and a regular memory pipeline. All those if statements make the code very messy, and I think it'd be easy to miss something being wrong.

@geoffwoollard
Copy link
Collaborator Author

Yes, good point! I forgot that the getitem takes care of indexing into sub-batches

I did the masking and normalization in the subbatch. This prevents a duplication of reading it into memory (once for normalization / masking, another right before doing the m2m distance call)

There is still some if low memory mode control flow. To reduce code duplication.

@geoffwoollard
Copy link
Collaborator Author

Also, now the gt maps should be a .npy file, not a .pt file.

I hope that torch.from_numpy(np.load(fname.npy)) is not a lot slower than torch.load(fname.pt) for the big 160 GB file.

@DSilva27 DSilva27 merged commit faf7e29 into dev Sep 18, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reduce memory burden of pipeline
3 participants