Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce memory burden of pipeline #51

Open
geoffwoollard opened this issue Jul 13, 2024 · 5 comments · Fixed by #95
Open

reduce memory burden of pipeline #51

geoffwoollard opened this issue Jul 13, 2024 · 5 comments · Fixed by #95
Assignees

Comments

@geoffwoollard
Copy link
Collaborator

Especially large memory burden from gt flat files (160 GB). Perhaps include a smaller version with some averaging...

Would also be good to benchmark memory useage.

@geoffwoollard geoffwoollard self-assigned this Jul 13, 2024
@DSilva27 DSilva27 self-assigned this Jul 15, 2024
@DSilva27
Copy link
Collaborator

Figured out a way to fix this issue combining the mmap_mode="r" of numpy files with torch dataloaders.

@geoffwoollard
Copy link
Collaborator Author

The big gt flat file is used in the map 2 map step. Let's see if that unit test can pass...

For the distances implemented so far, things are done in a parallelizable way - so technically there only needs to be access to one map at a time for the computation. The program holds all the files in memory only to speed up the computation.

@geoffwoollard
Copy link
Collaborator Author

The gt volume flat file also takes a long time to read in (15-30min) - which is inconvenient for developing - because if there are bugs after then it takes a long time to see.

@geoffwoollard
Copy link
Collaborator Author

@geoffwoollard
Copy link
Collaborator Author

I don't see a point in using memmap for the submission aligned .pt files. The volumes are under the 'volume' key. When this key is called, it loads all submitted maps (all indices) into memory, and can't be indexed out.

lazy = torch.load(fname_same_format_as_pt_but_save_as_npz, mmap_mode='r') # or 'r+'
all_indices = lazy['volumes']
idx = 0 # some index, e.g. 0, 1, 2, ..., n-1
the_same_time = lazy['volumes'][idx]

The volumes would have to be saved as a .npz flat file, which could then be indexed into.

@DSilva27 made the point that they are not that big (~7 GB?), so there is not much point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants