Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_ctc_data makes two copies of the loaded array #165

Open
1 of 3 tasks
DragaDoncila opened this issue Sep 19, 2024 · 0 comments
Open
1 of 3 tasks

load_ctc_data makes two copies of the loaded array #165

DragaDoncila opened this issue Sep 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@DragaDoncila
Copy link
Collaborator

Description

Our current _load_tiffs function temporarily spawns two copies of the data when loading from disk, before the function returns and one copy (the list) is garbage collected. For medium to large datasets, this might lead to a memory error, even though the dataset alone fits into memory just fine. I've confirmed this behaviour with python profiler.

def _load_tiffs(data_dir):
    """Load a directory of individual frames into a stack.

    Args:
        data_dir (Path): Path to directory of tiff files

    Raises:
        FileNotFoundError: No tif files found in data_dir

    Returns:
        np.array: 4D array with dims TYXC
    """
    files = np.sort(glob.glob(f"{data_dir}/*.tif*"))
    if len(files) == 0:
        raise FileNotFoundError(f"No tif files were found in {data_dir}")

    ims = []
    for f in tqdm(files, "Loading TIFFs"):
        ims.append(imread(f))
	# ims now holds a full sized copy of the data

    mov = np.stack(ims)
	# both ims and mov hold full sized copies of the data, before we return

    return mov

We should update this code to peek at the first frame and check its size, then allocate a numpy array we assign into. Roughly, as below:

def _load_tiffs(data_dir):
    files = np.sort(glob.glob(f"{data_dir}/*.tif*"))
    if len(files) == 0:
        raise FileNotFoundError(f"No tif files were found in {data_dir}")

    first_im = imread(all_tiffs[0])
    shape = (len(all_tiffs), *first_im.shape)
    dtype = first_im.dtype
    stack = np.zeros(shape=shape, dtype=dtype)
    stack[0] = first_im

    for i, f in enumerate(tqdm(all_tiffs[1:], "Loading TIFFs")):
        imread(f, out=stack[i + 1])

    return stack

Minimal example to reproduce the bug

Best way to reproduce is to load a dataset that is more than half your RAM size. I've usually noticed this when running pipelines for multiple datasets, but as I mentioned above, have confirmed this with python profile (will try reproduce the profile at some stage if we want, but I don't think I have a copy anymore...).

Severity

  • Unusable
  • Annoying, but still functional
  • Very minor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

1 participant