Parametrizing Solids with Inputs vs Config vs Resource #3064

szeleeteo · 2020-10-09T10:34:47Z

szeleeteo
Oct 9, 2020

Hi,

While the Basics of Solids tutorial made a quite clear distinction between Input vs Config, when it comes to actual implementation, I am still unclear, particularly when the initial parameter to be passed into the initial solid is something more complex beyond a primitive type (e.g. a dict)

I've noticed in a few examples (this and this, config seems to be the preferred way of passing in the initial input. It would have been fine to pass those as input I guess.

Secondly, in a long pipeline that is made up of many solids in serial, there are use cases where the input parameters would have to be part of the output for the downstream solids repeatedly, because some of the attributes are required. Does that mean it would be better to decorate the solid with my input as resource instead?

As an example, let's say I have a list of 10 different APIs with different schemas to be ingested on hourly basis.

Create a single pipeline to run the ingestion of these 10 APIs on hourly basis
Materialize the data ingested as original raw data in json file
Upload raw json files to S3
Run some transformation on raw data
Upload the transformed data to S3 again as json file
Post the transformed data to DB assuming all the previous stages passed

Some of the API metadata such as name or id need to be used in a few solids downstream.

The main use case of this pipeline is that sometimes downstream tasks might failed, but as long as the raw json in S3 in the first place, I can always backfill later.

So my question is, how can I best describe my input for this kind of pipeline - as an input, config or resource?

Thank you!

Answered by sryza

Oct 9, 2020

@szeleeteo this is a good question. For your situation, it sounds to me like config or resource makes the most sense.

Loading via "input" corresponds to defining a dagster_type_loader . This is useful when you have multiple different solid definitions that do different things, but operate on inputs with the same logical type, and you want a common way of loading those inputs. If I'm understanding correctly, this is not your situation, because you have at most a single solid definition for operating on each input source.

Resources are useful when you want to be able to configure a set of solids all at once. If that's the case for you, and it sounds like it might be, then it might make sense.

View full answer

sryza · 2020-10-09T19:17:05Z

sryza
Oct 9, 2020

@szeleeteo this is a good question. For your situation, it sounds to me like config or resource makes the most sense.

Loading via "input" corresponds to defining a dagster_type_loader . This is useful when you have multiple different solid definitions that do different things, but operate on inputs with the same logical type, and you want a common way of loading those inputs. If I'm understanding correctly, this is not your situation, because you have at most a single solid definition for operating on each input source.

Resources are useful when you want to be able to configure a set of solids all at once. If that's the case for you, and it sounds like it might be, then it might make sense.

Does that answer your question? Happy to go into more detail if helpful.

0 replies

szeleeteo · 2020-10-10T03:25:48Z

szeleeteo
Oct 10, 2020
Author

Thanks @sryza for the clarification on input, it makes more sense now you've mentioned dagster_type_loader and the context where input is more useful.

So the gist is, "input" is to solid as "resource" is to pipeline.

To clarify:

By using "input" as parameter - a fixed list of N input data sources, I was able to create a single pipeline with fan-out-fan-in structure. The approach worked except the fact that I had to pass in the original input param repeatedly downstream as described earlier. Furthermore, the original list of data sources has to available before runtime.
By using "resources", it seems that my pipeline's ModeDefinition require N modes, each mode attached with one of the N resources. So during scheduled runtime, instead of a single pipeline, I will have N number of serial pipelines created and running in parallel? Correct me if my understanding is wrong here.
Does that also mean that the pipeline's ModeDefinition can be populated dynamically during runtime with arbitrary number of data sources that can be pulled from somewhere else?
One of the initial challenges I've had (also seeing this being asked over and over in the Slack) is the ability to add solids to execution graphs dynamically which I understand is currently not possible. If the answer to Question 3 is yes, can I say using resources to create pipelines dynamically is a viable solution to this problem? It seems that we ended up with more pipelines during runtime or should I not look at it as a problem?

Thank you

3 replies

szeleeteo Oct 11, 2020
Author

To further illustrate my point, here's the DAG I've built using a config/input method.

get_all_sources the root solid gets a list of source dict as input param via config loaded at startup. (Ideally there could be a solid that loads this list dynamically during runtime, but I understand this currently not possible)
Each source dict is passed to the downstream solids via output->input.
At each of the solid, existing attributes of the source is read and new attributes are appended as well.

The main code for the pipeline can be found here

Will changing the source dict as a resource be more suitable in this case?

schrockn Oct 11, 2020
Maintainer

Given the source is being mutated and pass down the pipeline it doesn't make sense to model them as a resource. Resources are generally reconstructed at every step of the pipeline so mutations to that resource will not be saved unless you persist the results in a database or another external system.

Re: dynamic orchestration graphs. We do not currently support this and there is no definitive timetable to doing so. It's large charge which changes assumptions throughout the system. However, we fully to expect to support at some point in the "medium term" but I can't promise anything more specific than that.

sryza Oct 12, 2020

By using "resources", it seems that my pipeline's ModeDefinition require N modes, each mode attached with one of the N resources. So during scheduled runtime, instead of a single pipeline, I will have N number of serial pipelines running in parallel? Correct me if my understanding is wrong here.

When you launch a pipeline, you select a single mode for it to launch in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrizing Solids with Inputs vs Config vs Resource #3064

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Parametrizing Solids with Inputs vs Config vs Resource #3064

szeleeteo Oct 9, 2020

Replies: 2 comments · 3 replies

sryza Oct 9, 2020

szeleeteo Oct 10, 2020 Author

szeleeteo Oct 11, 2020 Author

schrockn Oct 11, 2020 Maintainer

sryza Oct 12, 2020

szeleeteo
Oct 9, 2020

Replies: 2 comments 3 replies

sryza
Oct 9, 2020

szeleeteo
Oct 10, 2020
Author

szeleeteo Oct 11, 2020
Author

schrockn Oct 11, 2020
Maintainer