Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ops Maintenance 2024 #53

Open
lcjohnso opened this issue Oct 28, 2024 · 0 comments
Open

Ops Maintenance 2024 #53

lcjohnso opened this issue Oct 28, 2024 · 0 comments

Comments

@lcjohnso
Copy link
Member

lcjohnso commented Oct 28, 2024

There are a number of ops-related ToDo items related to the BaJoR-facing zoobot Azure Batch account.

  1. Update pytorch image in zoobot container registry: We want to upgrade to Zoobot 2.0 and use its recommended combo of pytorch + CUDA (torch == 2.1.0+cu121; see repo readme). An image update will also resolve the following pool-level warning from Azure: "This pool's image is nearing its end-of-life date of Tuesday, April 22, 2025 at 19:00:00. After this date, it will not appear as an option when creating new pools. API calls to create or scale pools using the image may continue to function for up to 60 days afterwards."

  2. Select & Update VM type: we had selected Standard_NC6s_v3 as the cheapest single GPU (didn't use the CPUs), but we now need to resolve the following warning: "Support for the NCv3-series virtual machine family will be retired for Azure Batch pools on 30 September 2025 -- You're receiving this notice because you're currently using NCv3-series virtual machine sizes with Azure Batch pools. We'll retire support for NCv3-series virtual machines on 30 September 2025. This includes Standard_NC24rs_v3, Standard_NC6s_v3, Standard_NC12s_v3, and Standard_NC24s_v3. Between now and 30 September 2025, you'll need to either migrate your Batch pools to a newer virtual machine series in the same NC product line, or migrate to a different Batch-supported virtual machine size suitable for your workload."

  3. Upgrade to blobfuse2: based on sporadic MountConfigurationError issues, consider upgrade of the mount-to-blob library we're using to current up-to-date version. See repo and docs.

    • Note: configuration of blob storage directory mounting (e.g., models and predictions for prediction job) is part of the pool configuration. See this pool configuration notebook for details.
    • Unlike past issues related to the job preparation tasks (configured as part of job), this error stems from the mount commands set via the pool-level configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant