Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for choosing between multiprocess and inprocess executors via cli flag #2895

Merged
merged 16 commits into from
Dec 1, 2023

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Sep 26, 2023

This change allows picking appropriate dagster executor based on how many workers are used. With 1 worker, use inprocess executor, with N max workers, set the limit.

The original flag --max-concurrent is renamed to --dagster-workers to make it more explicit. There are many kinds of parallelism/concurrency at play so I thought the new name would be more appropriate.

Added this flag to ferc_to_sqlite as well to make both tools uniform.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to just always use the multi-process executor to avoid having to handle the different asset caching behavior between in-process and multi-process execution that came up in #2470, and ensure that the tests always use an isolated $DAGSTER_HOME that keeps the user's files safe?

Other than that just minor help message stuff.

src/pudl/ferc_to_sqlite/cli.py Outdated Show resolved Hide resolved
return {
"execution": {
"config": {
"in_process": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch away from using the in process executor will also affect the behavior in #2470 I think.

Do we need to make sure that the tests have their own isolated $DAGSTER_HOME so that they don't clobber the user's environment when they run? Also if we're running with a real $DAGSTER_HOME and writing the pickled dataframes out to disk rather than holding everything in memory, can we re-enable the Jupyter Notebook tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this behavior controls the execution when CLI tools are executed directly, and should not really change how tests are invoked. I'm not sure I fully understand the problem here, but my guess is that tests are currently not fully isolated and leak some materialization into shared $DAGSTER_HOME in which case, I agree that we should probably spin up temporary directory when running tests whenever we are not running with --live-dbs. With that flag, no ETL should really be run anyways, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that the two concerns (how we run CLIs and how we run tests) should be orthogonal concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me like they overlap a bit because in this PR we're using the CLI to run the ETL for the tests rather than using the test fixture.

With the in-process executor the default IO Manager doesn't write anything to disk in $DAGSTER_HOME, it keeps the assets in memory (the assets that are written to the database still get written to the database, wherever the PUDL DB path says that DB is). And currently the tests inherit the user's $DAGSTER_HOME so if anything in the tests attempts to read from that directory, it's reading from the user's preexisting outputs rather than outputs that were generated by the tests, so a test that relies on an intermediate output that was written to disk can work when the user runs the tests locally (since they've got a $DAGSTER_HOME full of pickled dataframes) and then fail in the GitHub CI, where no intermediate dataframes have ever been written to disk, since the ETL has only ever been run in-process there.

So my guess is that what will happen now when the tests that use the CLI to build the DB are run locally is that they will overwrite the user's pickled dataframes, probably replacing etl_full outputs with etl_fast outputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current plan was to use CLI for ETL for integration tests only in the context of CI infrastructure and there it should be using multiprocess with no limits. This change should not affect the default behavior, but if people want to run single threaded, they can (e.g. for perf analysis).

Do we also want to make use of separating ETL for the purpose of locally run integration tests? If so, we probably need to do some additional work.

@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (f456126) 88.7% compared to head (cc81e2b) 88.6%.
Report is 1 commits behind head on dev.

Files Patch % Lines
src/pudl/ferc_to_sqlite/cli.py 33.3% 6 Missing ⚠️
src/pudl/helpers.py 16.6% 5 Missing ⚠️
src/pudl/cli/etl.py 25.0% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2895     +/-   ##
=======================================
- Coverage   88.7%   88.6%   -0.1%     
=======================================
  Files         89      89             
  Lines      11023   11040     +17     
=======================================
+ Hits        9785    9790      +5     
- Misses      1238    1250     +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rousik rousik requested a review from zaneselvans October 4, 2023 15:21
@rousik rousik merged commit 7f1fb1c into dev Dec 1, 2023
7 of 9 checks passed
@zaneselvans zaneselvans deleted the inprocess branch January 2, 2024 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants