Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loader #81

Open
pmelchior opened this issue Oct 14, 2024 · 0 comments
Open

Data loader #81

pmelchior opened this issue Oct 14, 2024 · 0 comments

Comments

@pmelchior
Copy link
Owner

We need to get data into scarlet, and the problem is that it's often lying around in many different files. That's hard to manage and a massive hit on distributed file systems. So, we should package up everything we need to run scarlet into one or a small number of files and write a data loader to provide it to scarlet. I'm thinking of the following contents:

  1. Observation, i.e. fluxes
  2. Weight maps (inverse variance of flux)
  3. PSF model
  4. WCS and band and observation/instrument definitions

This brings up an interesting question: What's the size of these? We could say "full image", but that's a weird concept in the age of survey-sized coadds and ill-defined in multi-observation cases. So, it's probably better if we think of the fundamental unit as the scene, i.e. one or multiple source (components) that are supposed to be modeled together. That does mean we need code to run beforehand to chop up larger images into scene-size bites. LSST will do something like that and have PSF models for these small "cells", but other data sources may not. At this point we might as well utilize the detection code that must have run to define these cells, so I'm going to add:

  1. Detection coordinates and (possibly) source type

What I have in mind is a UI like this:

loader = scarlet2.make_dataloader(path)

for data in loader:
    obs = Observation.from_loader(data)
    model_frame = Frame.from_observations(obs)
    detections = data.detections
    with Scene(model_frame) as scene:
        for detection in detections:
            center = detection.center
            # initialize source components and create sources
            source = Source(...)
    # define parameters and fit
    scene.fit(obs)

In the multi-observation case, it would look like this:

loader = scarlet2.make_dataloader(path_lsst_with_euclid)
for data in loader:
    obs1, obs2 = Observation.from_loader(data)
    model_frame = Frame.from_observations(obs1, obs2)
    detections = data.detections
    ...
    scene.fit(obs1, obs2)

Note a few things here: There are multiple sets of observations but only one loader. Doing so ensures that the observations are synchronized: it's the same scene in obs1 and obs2. It also has only one set of detections: joint detections.
So, there's a fair bit of code that needs to run before we can create a data loader like this. The main point is that we can split the creation of the scene-level data products from the fitting of those products. The result of these prior calls should be stored in a file format like HDF5 or parquet for fast access. Alternatively we can stick to existing ML loader formats and use an interface like jax-dataloader. One thing to note is that we require fast row-level access, while most ML dataloaders streamline fast column-level access to create batches of the same structure.

It's conceivable that we have preprocessed data with some form of scene-level synchronization. Then one could this call in a more modular fashion:

loader = scarlet2.make_dataloader(path_lsst, path_euclid)

and possibly do some operations (like joint detection) on the fly. That would be my preference, but may be impractically slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant