Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request]: introduce a process action + associated configuration #454

Open
pearsonca opened this issue Jan 8, 2025 · 5 comments
Labels
cli Relating to command line interfaces enhancement Request for improvement or addition of new feature(s). medium priority Medium priority. post-processing Concern the post-processing.

Comments

@pearsonca
Copy link
Contributor

Label

enhancement, meta/workflow, post-processing

Priority Label

medium priority

Is your feature request related to a problem? Please describe.

Users currently perform post-processing steps (e.g. rendering notebooks, running transformation scripts) manually outside of the pipeline. To make these steps more efficient for power-users and more accessible for lay-users, flepimop should support doing post-processing in pipeline.

Is your feature request related to a new application, scenario round, pathogen? Please describe.

No response

Describe the solution you'd like

Processing steps are likely to be highly specific to any given analysis, so this is not a request to implement lots of generalized code for post processing.

Rather, should support a workflow like:

$ flepimop simulate someconfig.yml # do the simulation + produce some results
$ flepimop process someconfig.yml # execute some post processing analysis, e.g. render a notebook

with a someconfig.yml section like

process:
  jupyter:
    file: somenotebook.ipynb
  rmarkdown:
    file: someothernotebook.rmd
    args: etc

which reads as "for the process action, use the 'jupyter' module (which knows to look for a file + other arguments) to render X, and the rmarkdown module (again, which knows to look for a file + other arguments) to render Y"

In general, process modules should support executing some standard action (e.g. jupyter notebook rendering) on a target file (e.g. a notebook) providing the configuration information in a standard way (likely as the path to the file itself) + additional arguments as specified in the configuration file.

For typical use cases (e.g. notebooks, R or python post-processing scripts), we should include in the library some examples / templates that show argument parsing so users don't have to reinvent that every time they make a new notebook.

I think the obvious initial "modules" are:

  • execute a particular bash command (series of bash commands?)
  • run a bash script
  • run an R script (series of?)
  • run a python script (series of?)
  • render Rmarkdown
  • render ipynb

Dry-running on process should report what will be executed. Probably report what other steps in the configuration appear incomplete (as in, if simulate hasn't been run => no output results) - I don't think we want to have this specifying what processing steps depend on what being done (and definitely not trying to introspect that out), but we should alert users that something else in this configuration doesn't appear to have happened, so if processing depends on that being done, its not going to work.

This issue depends upon having completed #451

@pearsonca
Copy link
Contributor Author

For process, likely to want to specify multiple steps, but not necessarily always want to run them all.

I imagine the default being: run all specified modules, in specified order.

Might also want to support a steps or stages key (basically, the "scenarios" equivalent), which allows an order specification. Something like:

process:
  steps: [jupyter, rmarkdown, exec]

which reads as run all of the jupyter steps, then rmarkdown, then exec. If we want have modules support multiple internal steps (e.g. multiple notebooks to render), then we could do something like [jupyter::1, rmarkdown, jupyter::2] to express jupyter step 1 first, then all the rmarkdown step(s), then jupyter step 2.

We'd also need to support dynamically overriding the steps from the command line:

$ flepimop process someconfig.yml steps=jupyter::2 # just run the second jupyter module step

@pearsonca
Copy link
Contributor Author

pearsonca commented Jan 8, 2025

question to potential users @saraloo @MacdonaldJoshuaCaleb @alsnhll: do we want to also / instead support syntax like

$ flepimop simulate someconfig.yml --render=somenotebook.rmd

which reads as "simulate the model in someconfig.yml and then render the notebook somenotebook.rmd"?

I see that as a likely typical use case, and it should be relatively easy to implement for very low-flexibility options (that is, just render rmd or ipynb with the configuration file as an argument and no other customizability).

We can likely do that as "sugar" syntax to replace the distinct simulate (or whatever) / process steps. But we're also likely to support in the future:

$ flepimop simulate someconfig.yml | flepimop process

i.e. simulate this configuration and then pipe the workflow output (a configuration file), to a processing action

@TimothyWillard TimothyWillard added enhancement Request for improvement or addition of new feature(s). post-processing Concern the post-processing. medium priority Medium priority. labels Jan 10, 2025
@TimothyWillard TimothyWillard added the cli Relating to command line interfaces label Jan 10, 2025
@pearsonca
Copy link
Contributor Author

Noted in 13 Jan developer meeting, likely also some appetite for doing process steps at the start of a pipeline (e.g. to estimate some feature in a non-gempyor model, and then use that estimate as a parameter).

@anjalika-nande
Copy link
Collaborator

question to potential users @saraloo @MacdonaldJoshuaCaleb @alsnhll: do we want to also / instead support syntax like

$ flepimop simulate someconfig.yml --render=somenotebook.rmd

which reads as "simulate the model in someconfig.yml and then render the notebook somenotebook.rmd"?

I see that as a likely typical use case, and it should be relatively easy to implement for very low-flexibility options (that is, just render rmd or ipynb with the configuration file as an argument and no other customizability).

We can likely do that as "sugar" syntax to replace the distinct simulate (or whatever) / process steps. But we're also likely to support in the future:

$ flepimop simulate someconfig.yml | flepimop process

i.e. simulate this configuration and then pipe the workflow output (a configuration file), to a processing action

I think including this syntax in addition to the above will be useful

@saraloo
Copy link
Contributor

saraloo commented Jan 13, 2025

Thanks for thinking through all this. I agree, the render syntax from CLI would be useful.
ie

$ flepimop simulate someconfig.yml --render=somenotebook.rmd

I can imagine this super useful for testing purposes, when notebooks and configs are going through iterative changes and what not.

I don't immediately see needing the sequential and module support. Might be useful for longterm but I don't think it's super necessary right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Relating to command line interfaces enhancement Request for improvement or addition of new feature(s). medium priority Medium priority. post-processing Concern the post-processing.
Projects
None yet
Development

No branches or pull requests

4 participants