Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save pipeline structure with parameters for reproduction #92

Open
dschick opened this issue Jan 5, 2024 · 11 comments
Open

save pipeline structure with parameters for reproduction #92

dschick opened this issue Jan 5, 2024 · 11 comments

Comments

@dschick
Copy link

dschick commented Jan 5, 2024

Hi scliline team,

First of all, many thanks for this great package. We were currently thinking of designing a similar system for pipeline data processing based on xarray data containers and luckily found your work before writing a line of code.

In some of our data analysis tasks, we have some rather expensive produces (e.g. phase retrieval methods for X-ray holography) and in addition to the actual result of a pipeline, we would also like to save how we got there.

Obviously, we could save the list of producers and parameters on our own, but how about a dedicated method of the pipeline class similar to visualize(), which could return not only the structure of the graph but also the actual parameter values?

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.
This feature could be extremely helpful during beamtimes, when code is often changed during online analysis.

Best

Daniel

@jl-wynen
Copy link
Member

jl-wynen commented Jan 8, 2024

This was on our list of requirements when designing Sciline but it had a low priority lately. So thanks for the reminder!

As you said, we will likely have to write the graph and parameter values. There are some open questions, though:

  • What graph format do we use? Are dot files (which visualize effectively uses) enough? This may link to the discussion in Detect unused parameters #43.
  • Do we write the parameters into a separate file? If so, what format? And keep in mind that this needs to work for all parameter types, scipp, builtin, xarray, etc.

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.

This is an interesting idea. But it would be an incomplete solution because providers typically call additional functions and we couldn't reasonably write their source code or hash, too.

In our case, we expect to have a script or Jupyter notebook that defines the graph and possibly some specialised providers as well as one or more packages that define most providers. My assumption was that we at least write the precise version of al relevant packages (or a full pip freeze or conda list). And, at least for code controlled by us, the full script or notebook that defines the pipeline. Those files can then be archived in SciCat together with the processed data.
But I admit that this requires some work from the pipeline author and only really works when we can associate all files with each other with a catalogue like SciCat.

@dschick
Copy link
Author

dschick commented Jan 8, 2024

It's great that you have a similar interest here :)
And obviously, your thoughts are already more advanced than mine.

I'll be happy to discuss this further at any point.
Feel free to close this issue for now.

Best

Daniel

@jl-wynen
Copy link
Member

jl-wynen commented Jan 8, 2024

I'll keep it open as a reminder.

I'd be happy to hear your insights into how you and your users want to handle provenance and what requirements you have!

@SimonHeybrock
Copy link
Member

My idea so far was to store the graph in a Sciline-independent manner. "Producers" and "Parameters" are strictly speaking an implementation detail of Sciline, so one would not want to rely on this for long-term archiving of data, FAIR data, ...

The computational graph is hopefully more meaningful (when combined with input parameters). So we should look into how this can be stored in a generic manner. I don't know if studying, e.g., how Snakemake handles this can provide some guidange.

@SimonHeybrock
Copy link
Member

SimonHeybrock commented Feb 7, 2024

Conclusion for now:

  1. Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.
  2. As a first step, let us implement a simple way of serializing/storing the graph, this should not be blocked by item 1.)
  3. Secondly, think about/implement a "good enough for now" solution for parameters, maybe large parameters are uncommon and we can simply ignore that problem for now.

@jl-wynen
Copy link
Member

jl-wynen commented Feb 7, 2024

To get an overview of some formats used in practice, take a look at https://networkx.org/documentation/stable/reference/readwrite/index.html

From this list, I'd prefer json or possibly adjacency list / multiline adjacency list. The former in particular, because it makes it easy to also store parameter values without inventing a new format.

@jl-wynen
Copy link
Member

jl-wynen commented Feb 7, 2024

  1. Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

@SimonHeybrock
Copy link
Member

  1. Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

A parameter can be anything. For example you can process an intermediate result, set it as a parameter, and create a new task graph. Parameters can thus be anything, can have arbitrary size, and they might not be serializable at all.

@jl-wynen
Copy link
Member

jl-wynen commented Feb 8, 2024

Are there any objections to using the json format described by networkx? If not, I'll implement that.

@SimonHeybrock
Copy link
Member

JSON sounds good!

@jl-wynen
Copy link
Member

First part done in #124. Now we need to figure out how to handle parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

3 participants