save pipeline structure with parameters for reproduction #92

dschick · 2024-01-05T23:15:28Z

Hi scliline team,

First of all, many thanks for this great package. We were currently thinking of designing a similar system for pipeline data processing based on xarray data containers and luckily found your work before writing a line of code.

In some of our data analysis tasks, we have some rather expensive produces (e.g. phase retrieval methods for X-ray holography) and in addition to the actual result of a pipeline, we would also like to save how we got there.

Obviously, we could save the list of producers and parameters on our own, but how about a dedicated method of the pipeline class similar to visualize(), which could return not only the structure of the graph but also the actual parameter values?

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.
This feature could be extremely helpful during beamtimes, when code is often changed during online analysis.

Best

Daniel

The text was updated successfully, but these errors were encountered:

jl-wynen · 2024-01-08T09:07:30Z

This was on our list of requirements when designing Sciline but it had a low priority lately. So thanks for the reminder!

As you said, we will likely have to write the graph and parameter values. There are some open questions, though:

What graph format do we use? Are dot files (which visualize effectively uses) enough? This may link to the discussion in Detect unused parameters #43.
Do we write the parameters into a separate file? If so, what format? And keep in mind that this needs to work for all parameter types, scipp, builtin, xarray, etc.

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.

This is an interesting idea. But it would be an incomplete solution because providers typically call additional functions and we couldn't reasonably write their source code or hash, too.

In our case, we expect to have a script or Jupyter notebook that defines the graph and possibly some specialised providers as well as one or more packages that define most providers. My assumption was that we at least write the precise version of al relevant packages (or a full pip freeze or conda list). And, at least for code controlled by us, the full script or notebook that defines the pipeline. Those files can then be archived in SciCat together with the processed data.
But I admit that this requires some work from the pipeline author and only really works when we can associate all files with each other with a catalogue like SciCat.

dschick · 2024-01-08T15:23:36Z

It's great that you have a similar interest here :)
And obviously, your thoughts are already more advanced than mine.

I'll be happy to discuss this further at any point.
Feel free to close this issue for now.

Best

Daniel

jl-wynen · 2024-01-08T15:26:53Z

I'll keep it open as a reminder.

I'd be happy to hear your insights into how you and your users want to handle provenance and what requirements you have!

SimonHeybrock · 2024-01-09T06:44:37Z

My idea so far was to store the graph in a Sciline-independent manner. "Producers" and "Parameters" are strictly speaking an implementation detail of Sciline, so one would not want to rely on this for long-term archiving of data, FAIR data, ...

The computational graph is hopefully more meaningful (when combined with input parameters). So we should look into how this can be stored in a generic manner. I don't know if studying, e.g., how Snakemake handles this can provide some guidange.

SimonHeybrock · 2024-02-07T14:21:28Z

Conclusion for now:

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.
As a first step, let us implement a simple way of serializing/storing the graph, this should not be blocked by item 1.)
Secondly, think about/implement a "good enough for now" solution for parameters, maybe large parameters are uncommon and we can simply ignore that problem for now.

jl-wynen · 2024-02-07T14:34:13Z

To get an overview of some formats used in practice, take a look at https://networkx.org/documentation/stable/reference/readwrite/index.html

From this list, I'd prefer json or possibly adjacency list / multiline adjacency list. The former in particular, because it makes it easy to also store parameter values without inventing a new format.

jl-wynen · 2024-02-07T14:34:54Z

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

SimonHeybrock · 2024-02-08T06:43:05Z

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

A parameter can be anything. For example you can process an intermediate result, set it as a parameter, and create a new task graph. Parameters can thus be anything, can have arbitrary size, and they might not be serializable at all.

jl-wynen · 2024-02-08T08:29:17Z

Are there any objections to using the json format described by networkx? If not, I'll implement that.

SimonHeybrock · 2024-02-08T08:34:17Z

JSON sounds good!

jl-wynen · 2024-03-14T08:10:08Z

First part done in #124. Now we need to figure out how to handle parameters.

jl-wynen mentioned this issue Jan 8, 2024

order of producers are neglected #93

Closed

jl-wynen mentioned this issue Jan 22, 2024

Support for graph operations #107

Closed

jl-wynen self-assigned this Feb 8, 2024

jl-wynen mentioned this issue Feb 13, 2024

JSON serializer for task graphs #124

Merged

jl-wynen mentioned this issue Mar 14, 2024

Deserialize and run task graphs #147

Open

jl-wynen removed their assignment Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save pipeline structure with parameters for reproduction #92

save pipeline structure with parameters for reproduction #92

dschick commented Jan 5, 2024

jl-wynen commented Jan 8, 2024

dschick commented Jan 8, 2024

jl-wynen commented Jan 8, 2024 •

edited

Loading

SimonHeybrock commented Jan 9, 2024

SimonHeybrock commented Feb 7, 2024 •

edited

Loading

jl-wynen commented Feb 7, 2024

jl-wynen commented Feb 7, 2024

SimonHeybrock commented Feb 8, 2024

jl-wynen commented Feb 8, 2024

SimonHeybrock commented Feb 8, 2024

jl-wynen commented Mar 14, 2024

save pipeline structure with parameters for reproduction #92

save pipeline structure with parameters for reproduction #92

Comments

dschick commented Jan 5, 2024

jl-wynen commented Jan 8, 2024

dschick commented Jan 8, 2024

jl-wynen commented Jan 8, 2024 • edited Loading

SimonHeybrock commented Jan 9, 2024

SimonHeybrock commented Feb 7, 2024 • edited Loading

jl-wynen commented Feb 7, 2024

jl-wynen commented Feb 7, 2024

SimonHeybrock commented Feb 8, 2024

jl-wynen commented Feb 8, 2024

SimonHeybrock commented Feb 8, 2024

jl-wynen commented Mar 14, 2024

jl-wynen commented Jan 8, 2024 •

edited

Loading

SimonHeybrock commented Feb 7, 2024 •

edited

Loading