-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for choosing between multiprocess and inprocess executors via cli flag #2895
Conversation
When --max-parallelism=1, use in-process executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to just always use the multi-process executor to avoid having to handle the different asset caching behavior between in-process and multi-process execution that came up in #2470, and ensure that the tests always use an isolated $DAGSTER_HOME
that keeps the user's files safe?
Other than that just minor help message stuff.
return { | ||
"execution": { | ||
"config": { | ||
"in_process": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch away from using the in process executor will also affect the behavior in #2470 I think.
Do we need to make sure that the tests have their own isolated $DAGSTER_HOME
so that they don't clobber the user's environment when they run? Also if we're running with a real $DAGSTER_HOME
and writing the pickled dataframes out to disk rather than holding everything in memory, can we re-enable the Jupyter Notebook tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this behavior controls the execution when CLI tools are executed directly, and should not really change how tests are invoked. I'm not sure I fully understand the problem here, but my guess is that tests are currently not fully isolated and leak some materialization into shared $DAGSTER_HOME
in which case, I agree that we should probably spin up temporary directory when running tests whenever we are not running with --live-dbs
. With that flag, no ETL should really be run anyways, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the two concerns (how we run CLIs and how we run tests) should be orthogonal concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed to me like they overlap a bit because in this PR we're using the CLI to run the ETL for the tests rather than using the test fixture.
With the in-process executor the default IO Manager doesn't write anything to disk in $DAGSTER_HOME
, it keeps the assets in memory (the assets that are written to the database still get written to the database, wherever the PUDL DB path says that DB is). And currently the tests inherit the user's $DAGSTER_HOME
so if anything in the tests attempts to read from that directory, it's reading from the user's preexisting outputs rather than outputs that were generated by the tests, so a test that relies on an intermediate output that was written to disk can work when the user runs the tests locally (since they've got a $DAGSTER_HOME
full of pickled dataframes) and then fail in the GitHub CI, where no intermediate dataframes have ever been written to disk, since the ETL has only ever been run in-process there.
So my guess is that what will happen now when the tests that use the CLI to build the DB are run locally is that they will overwrite the user's pickled dataframes, probably replacing etl_full
outputs with etl_fast
outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the current plan was to use CLI for ETL for integration tests only in the context of CI infrastructure and there it should be using multiprocess with no limits. This change should not affect the default behavior, but if people want to run single threaded, they can (e.g. for perf analysis).
Do we also want to make use of separating ETL for the purpose of locally run integration tests? If so, we probably need to do some additional work.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #2895 +/- ##
=======================================
- Coverage 88.7% 88.6% -0.1%
=======================================
Files 89 89
Lines 11023 11040 +17
=======================================
+ Hits 9785 9790 +5
- Misses 1238 1250 +12 ☔ View full report in Codecov by Sentry. |
For more information, see https://pre-commit.ci
This change allows picking appropriate dagster executor based on how many workers are used. With 1 worker, use inprocess executor, with N max workers, set the limit.
The original flag
--max-concurrent
is renamed to--dagster-workers
to make it more explicit. There are many kinds of parallelism/concurrency at play so I thought the new name would be more appropriate.Added this flag to
ferc_to_sqlite
as well to make both tools uniform.