Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UM2N Docker build #59

Merged
merged 16 commits into from
Nov 23, 2024
Merged

UM2N Docker build #59

merged 16 commits into from
Nov 23, 2024

Conversation

jwallwork23
Copy link
Member

Closes #58.

@jwallwork23 jwallwork23 added testing Extensions and improvements to the testing infrastructure install Installation instructions/process needs updating labels Nov 2, 2024
@jwallwork23 jwallwork23 self-assigned this Nov 2, 2024
@jwallwork23
Copy link
Member Author

Whenever I try to install pytorch3d it crashes my laptop, so I built a reduced version of the Docker image, pushed it to ghcr, enabled mesh-adaptation-docs to update it and have now triggered the update. Hopefully it works!

@ddundo
Copy link
Member

ddundo commented Nov 2, 2024

I cancelled my workflow on #52 until you get this done since I see the error is that there is no space left. I guess it's too much to have 4 of them running at the same time :)

@jwallwork23
Copy link
Member Author

Hmm "no space left on device" error https://github.com/mesh-adaptation/mesh-adaptation-docs/actions/runs/11642012121/job/32421127119?pr=59. I'll try without pytorch3d.

@jwallwork23
Copy link
Member Author

I cancelled my workflow on #52 until you get this done since I see the error is that there is no space left. I guess it's too much to have 4 of them running at the same time :)

What, do the runners share memory? I thought they were separate. Do you have a link to docs on this?

@jwallwork23
Copy link
Member Author

Free runners have 14GB and the UM2N image I created is already 10.4GB. https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for--private-repositories

This makes cutting down the image sizes a priority: #13.

@ddundo
Copy link
Member

ddundo commented Nov 2, 2024

I cancelled my workflow on #52 until you get this done since I see the error is that there is no space left. I guess it's too much to have 4 of them running at the same time :)

What, do the runners share memory? I thought they were separate. Do you have a link to docs on this?

No idea really! I just saw that in my last workflow the "Initialise container" step took over 15 minutes, when it's usually about a minute. So I assumed it's because you had these other 3 workflows running :)

Edit: yeah it's super slow still @jwallwork23. It seems to have got slow after the firedrake-um2n container was pushed to ghcr. Not sure if connected. Would you mind deleting that container on ghcr if you don't need it, just to test if it gets faster?

I am also confused what triggered this workflow to run https://github.com/mesh-adaptation/mesh-adaptation-docs/actions/runs/11642077765 and push a new image to ghcr. It's only supposed to push it if triggered on the main branch... do you know?

@jwallwork23
Copy link
Member Author

Edit: yeah it's super slow still @jwallwork23. It seems to have got slow after the firedrake-um2n container was pushed to ghcr. Not sure if connected. Would you mind deleting that container on ghcr if you don't need it, just to test if it gets faster?

Okay, I deleted it for now.

I am also confused what triggered this workflow to run https://github.com/mesh-adaptation/mesh-adaptation-docs/actions/runs/11642077765 and push a new image to ghcr. It's only supposed to push it if triggered on the main branch... do you know?

No, sorry.

@ddundo
Copy link
Member

ddundo commented Nov 2, 2024

Thanks, but that didn't help. Pulling locally is also very slow. Maybe it's a ghcr issue at the moment.

Hmm... I think we then need to add the branches filter in docker_firedrake-parmmg.yml, i.e.:

  workflow_run:
    workflows: ['Build bespoke PETSc Docker container']
    types: [completed]
    branches: [main]

Would you mind adding this in this PR before you build again please? It seems like the workflow_run trigger defaults to the main branch no matter where the triggering workflow was run.

Edit: added this in #61

@jwallwork23
Copy link
Member Author

I realised that by dropping torchvision and torchaudio in 19a17cf we can cut down the size of the Docker image considerably.

@jwallwork23 jwallwork23 marked this pull request as ready for review November 17, 2024 17:41
@jwallwork23 jwallwork23 requested a review from ddundo November 17, 2024 17:41
@ddundo
Copy link
Member

ddundo commented Nov 17, 2024

@jwallwork23 I think the size is so large because pip3 install torch torchvision torchaudio installs a lot of CUDA stuff that we can't make use of anyway. Could you try with pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu? As per torch installation instructions for CPU only https://pytorch.org/get-started/locally/

Edit: just saw the comment above... sorry. Might still be worth doing pip3 install torch --index-url https://download.pytorch.org/whl/cpu?

@jwallwork23
Copy link
Member Author

Might still be worth doing pip3 install torch --index-url https://download.pytorch.org/whl/cpu?

Good point, addressed in ae5dd0e.

Copy link
Member

@ddundo ddundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! But when I pull the image and try to import UM2N, I get

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 import UM2N

File ~/firedrake/lib/python3.12/site-packages/UM2N/__init__.py:7
      3 os.environ["OMP_NUM_THREADS"] = "1"
      5 from pkg_resources import DistributionNotFound, get_distribution  # noqa
----> 7 from .processor import *  # noqa
      8 from .generator import *  # noqa
      9 from .model import *  # noqa

ModuleNotFoundError: No module named 'UM2N.processor'

This is what caused https://github.com/mesh-adaptation/mesh-adaptation-docs/actions/runs/11881234411/job/33105325669 to fail as well.

Edit: I was confused why test_import.py passes if I get this error. See mesh-adaptation/UM2N#57

ddundo added a commit that referenced this pull request Nov 18, 2024
Closes #60.

Edit: I expanded this PR to fix these related things at once:
* Allow manual triggering of workflows
* Added concurrency
* Fix for what was mentioned in
#59 (comment)
@jwallwork23
Copy link
Member Author

Okay, I think the combination of mesh-adaptation/movement#133, 3bf32d1 and 1817d15 does the trick. If I run the Docker image and activate the Firedrake venv inside it then I can run the tests successfully.

@jwallwork23 jwallwork23 requested a review from ddundo November 23, 2024 11:31
@ddundo
Copy link
Member

ddundo commented Nov 23, 2024

Thanks @jwallwork23! Could you please push the UM2N container to ghcr so I can test it?

@jwallwork23
Copy link
Member Author

Thanks @jwallwork23! Could you please push the UM2N container to ghcr so I can test it?

Sure, uploaded just now. (It took a while.)

Copy link
Member

@ddundo ddundo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now! Thanks again :)

@jwallwork23 jwallwork23 merged commit bc2ad18 into main Nov 23, 2024
3 checks passed
@jwallwork23 jwallwork23 deleted the 58_um2n_docker branch November 23, 2024 17:23
jwallwork23 added a commit to mesh-adaptation/UM2N that referenced this pull request Nov 23, 2024
Closes #57.

Linked PR:
mesh-adaptation/docs#59.

This PR sets up UM2N's CI to use the newly added bespoke Docker image.
It also makes use of the reusable test workflow from the docs repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
install Installation instructions/process needs updating testing Extensions and improvements to the testing infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Docker container for UM2N environment
3 participants