reduce memory burden of pipeline #51

geoffwoollard · 2024-07-13T19:43:41Z

Especially large memory burden from gt flat files (160 GB). Perhaps include a smaller version with some averaging...

Would also be good to benchmark memory useage.

DSilva27 · 2024-07-15T15:41:57Z

Figured out a way to fix this issue combining the mmap_mode="r" of numpy files with torch dataloaders.

geoffwoollard · 2024-07-16T03:04:03Z

The big gt flat file is used in the map 2 map step. Let's see if that unit test can pass...

For the distances implemented so far, things are done in a parallelizable way - so technically there only needs to be access to one map at a time for the computation. The program holds all the files in memory only to speed up the computation.

geoffwoollard · 2024-08-13T02:36:24Z

The gt volume flat file also takes a long time to read in (15-30min) - which is inconvenient for developing - because if there are bugs after then it takes a long time to see.

geoffwoollard · 2024-09-03T22:33:00Z

I plan to write the preprocessed submissions to .npz format instead of .pt, and then use a numpy based memmap.

Therefore would resolve: #79

Links for reference
https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/51-reduce-memory-burden-of-pipeline/src/cryo_challenge/data/_dataloaders/gt_dataloader.py
https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/51-reduce-memory-burden-of-pipeline/src/cryo_challenge/data/_io/svd_io_utils.py

geoffwoollard · 2024-09-11T01:22:33Z

I don't see a point in using memmap for the submission aligned .pt files. The volumes are under the 'volume' key. When this key is called, it loads all submitted maps (all indices) into memory, and can't be indexed out.

lazy = torch.load(fname_same_format_as_pt_but_save_as_npz, mmap_mode='r') # or 'r+'
all_indices = lazy['volumes']
idx = 0 # some index, e.g. 0, 1, 2, ..., n-1
the_same_time = lazy['volumes'][idx]

The volumes would have to be saved as a .npz flat file, which could then be indexed into.

@DSilva27 made the point that they are not that big (~7 GB?), so there is not much point.

geoffwoollard self-assigned this Jul 13, 2024

DSilva27 self-assigned this Jul 15, 2024

geoffwoollard added the high priority label Aug 13, 2024

geoffwoollard mentioned this issue Sep 11, 2024

51 reduce memory burden of pipeline #95

Merged

geoffwoollard linked a pull request Sep 11, 2024 that will close this issue

51 reduce memory burden of pipeline #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce memory burden of pipeline #51

reduce memory burden of pipeline #51

geoffwoollard commented Jul 13, 2024

DSilva27 commented Jul 15, 2024

geoffwoollard commented Jul 16, 2024

geoffwoollard commented Aug 13, 2024

geoffwoollard commented Sep 3, 2024

geoffwoollard commented Sep 11, 2024

reduce memory burden of pipeline #51

reduce memory burden of pipeline #51

Comments

geoffwoollard commented Jul 13, 2024

DSilva27 commented Jul 15, 2024

geoffwoollard commented Jul 16, 2024

geoffwoollard commented Aug 13, 2024

geoffwoollard commented Sep 3, 2024

geoffwoollard commented Sep 11, 2024