Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lockfile to specify a reproducible python environment #2896

Closed
zaneselvans opened this issue Sep 27, 2023 · 1 comment · Fixed by #2901 or #2968
Closed

Use lockfile to specify a reproducible python environment #2896

zaneselvans opened this issue Sep 27, 2023 · 1 comment · Fixed by #2901 or #2968
Assignees
Labels
dependencies Pull requests that update a dependency file

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 27, 2023

We've had a spate of random breakage due to downstream dependencies so maybe it is finally time to figure out lockfiles. See #2140 for some prior discussion.

Since we are treating PUDL like an application and not a library, we just need one set of dependencies & their versions which is guaranteed to work, not an expansive range of every dependency. See #1669 for some rationale and discussion.

There are several possibilities with various pros and cons. Earlier this year @lwasser wrote this post about putting together the @pyOpenSci Python Package Guide which might be helpful.

Considerations:

  • Do we want catalystcoop.pudl to continue being distributed as a package, or will it only be installable from the git repository?
  • Does it matter how the downloaded package are cached, e.g. for building Docker containers or on GitHub runners.
  • How easily can we automatically update our dependency versions, regenerate the lockfile, and see if the tests still pass? Will it be compatible with the dependabot?
  • How will we manage non-Python package dependencies like pandoc, nodejs, sqlite, libsnappy? Conda and Docker seem to be the most common solutions here.
  • How will we manage the version & installation of Python itself?

conda-lock

Pros

  • Conda can install and manage versions of non-python dependencies.
  • Conda can manage and install various versions of Python itself.
  • Allows locking of platform-specific binary distributions including hashes.
  • Conda ensures shared compiled libraries have compatible versions too, which is nice for the geospatial stack.
  • The mamba solver is extremely fast and reliable.
  • We're already using conda/mamba to manage our environments.
  • Dependabot is happy to update project dependencies & optional dependencies in pyproject.toml
  • conda-lock and dependabot both work with Poetry, configured in pyproject.toml
  • You can specify that particular packages that aren't available on conda-forge be installed from PyPI.

Cons

  • Dependabot doesn't cover environment.yml files, so we'd like all to read all dependencies in pyproject.toml.
  • Doesn't appear to pick up on package[extras] specified in pyproject.toml
  • Involves 2 layers of packaging (PyPI + conda-forge) which are often done by different people, and this can lead to stale or abandoned packages. Taking on responsibility for the conda-forge packaging isn't so bad, but it can be a little bit of a hassle. Currently sphinx-apidoc and recordlinkage are out of date and prevent us from creating a valid conda-lock file. I've requested to be a maintainer of recordlinkage.
  • Installing all packages via conda, and then running Tox inside that environment and having it re-install everything in another virtual environment does feel kind of duplicative, but I guess we're effectively doing that now -- it's just that we're installing all of the packages via pip and suffering from downstream dependency issues.

Poetry

Pros

  • Works with dependabot.

Cons

  • Can't manage non-Python dependencies.

PDM

pixi

Pros

  • Relies on existing conda ecosystem for platform-specific binaries.
  • Can therefore manage non-python dependencies.
  • Might have a better UI than mamba + conda-lock?
  • Directory (?) based environments rather than needing to activate / deactivate.
  • Produces multi-platform lockfiles.

Cons

  • Extremely new and in flux.
  • Not integrated with any other tooling yet.
@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Sep 27, 2023
@zaneselvans zaneselvans changed the title Use dependency lockfile to specify a reproducible python environment Use lockfile to specify a reproducible python environment Sep 27, 2023
@NickleDave
Copy link

Hi @zaneselvans hope it's ok if I chime in on this issue since @lwasser posted it in the pyOS Slack.

If you want a workflow tool like poetry or PDM that will give you lockfiles for conda environments, you might want to know about pixi from https://prefix.dev/

pixi is a fast software package manager build on top of the existing conda ecosystem. Spins up development environments quickly on Windows, macOS and Linux.
Automatic lockfiles produce reproducible environments across operating systems (without Docker!).
pixi supports Python, R, C/C++, Rust, Ruby, and many other languages.

If you were working in pure Python then keeping everything in your pyproject.toml would be great, if possible--see this post on the state of lockfiles for pure Python from Brett Cannon--but it sounds like you've got a "modern data stack" 😭 😏 and you also want a tool that lets you manage it.

Hope that helps, love the work you all are doing

@zaneselvans zaneselvans linked a pull request Sep 27, 2023 that will close this issue
8 tasks
@zaneselvans zaneselvans removed a link to a pull request Sep 28, 2023
8 tasks
@zaneselvans zaneselvans linked a pull request Sep 28, 2023 that will close this issue
8 tasks
@e-belfer e-belfer moved this from New to In progress in Catalyst Megaproject Oct 19, 2023
@zaneselvans zaneselvans linked a pull request Oct 21, 2023 that will close this issue
8 tasks
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Nov 14, 2023
@zaneselvans zaneselvans moved this from In review to Done in Catalyst Megaproject Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
Archived in project
2 participants