Data loader #81

pmelchior · 2024-10-14T17:02:56Z

We need to get data into scarlet, and the problem is that it's often lying around in many different files. That's hard to manage and a massive hit on distributed file systems. So, we should package up everything we need to run scarlet into one or a small number of files and write a data loader to provide it to scarlet. I'm thinking of the following contents:

Observation, i.e. fluxes
Weight maps (inverse variance of flux)
PSF model
WCS and band and observation/instrument definitions

This brings up an interesting question: What's the size of these? We could say "full image", but that's a weird concept in the age of survey-sized coadds and ill-defined in multi-observation cases. So, it's probably better if we think of the fundamental unit as the scene, i.e. one or multiple source (components) that are supposed to be modeled together. That does mean we need code to run beforehand to chop up larger images into scene-size bites. LSST will do something like that and have PSF models for these small "cells", but other data sources may not. At this point we might as well utilize the detection code that must have run to define these cells, so I'm going to add:

Detection coordinates and (possibly) source type

What I have in mind is a UI like this:

loader = scarlet2.make_dataloader(path)

for data in loader:
    obs = Observation.from_loader(data)
    model_frame = Frame.from_observations(obs)
    detections = data.detections
    with Scene(model_frame) as scene:
        for detection in detections:
            center = detection.center
            # initialize source components and create sources
            source = Source(...)
    # define parameters and fit
    scene.fit(obs)

In the multi-observation case, it would look like this:

loader = scarlet2.make_dataloader(path_lsst_with_euclid)
for data in loader:
    obs1, obs2 = Observation.from_loader(data)
    model_frame = Frame.from_observations(obs1, obs2)
    detections = data.detections
    ...
    scene.fit(obs1, obs2)

Note a few things here: There are multiple sets of observations but only one loader. Doing so ensures that the observations are synchronized: it's the same scene in obs1 and obs2. It also has only one set of detections: joint detections.
So, there's a fair bit of code that needs to run before we can create a data loader like this. The main point is that we can split the creation of the scene-level data products from the fitting of those products. The result of these prior calls should be stored in a file format like HDF5 or parquet for fast access. Alternatively we can stick to existing ML loader formats and use an interface like jax-dataloader. One thing to note is that we require fast row-level access, while most ML dataloaders streamline fast column-level access to create batches of the same structure.

It's conceivable that we have preprocessed data with some form of scene-level synchronization. Then one could this call in a more modular fashion:

loader = scarlet2.make_dataloader(path_lsst, path_euclid)

and possibly do some operations (like joint detection) on the fly. That would be my preference, but may be impractically slow.

The text was updated successfully, but these errors were encountered:

pmelchior mentioned this issue Oct 21, 2024

hdf5 save routine #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loader #81

Data loader #81

pmelchior commented Oct 14, 2024

Data loader #81

Data loader #81

Comments

pmelchior commented Oct 14, 2024