-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard #22
Comments
Here is what I have in mind as to do list. Any thoughts @guillaumeeb & @j34ni ?
|
👍 for me! I'll let @j34ni answer the first two questions. For point 4, not sure we really need a maximum 🙂. |
@tinaok : |
@guillaumeeb I do not remember that, may be was it at the same time as other things which failed and we went back a few steps to get something working? |
Yeah probably. I guess we just need to make some room on our VMs and try to redeploy an elastic version on Kubernetes to host our Pangeo platform. |
Ok, following up from #21 I think we never tried the elastic option before since we wanted to focus on other higher priority issues. Have you tried the elastic option now? If so, could you please provide feedback? Happy to help with this. |
Hi @guillaumeeb thank you very much, I had no time (and will have no time at all this week neither) for trying out unfortunately. Your help will be super appreciated. |
In the process of creating a Pangeo deployment with elastic Kubernetes on IM Dashboard, following https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md, I've got some questions/remarks (noting there as much as for other people as for myself):
So after step 2 from EGI.md, I created my daskhub.yaml file as below: dask-gateway:
enabled: true
gateway:
auth:
jupyterhub:
apiToken: <token1>
type: jupyterhub
extraConfig:
dasklimits: |
c.ClusterConfig.cluster_max_cores = 6
c.ClusterConfig.cluster_max_memory = "24 G"
c.ClusterConfig.cluster_max_workers = 4
c.ClusterConfig.idle_timeout = 1800
optionHandler: |
from dask_gateway_server.options import Options, Integer, Float, String
def options_handler(options):
if ":" not in options.image:
raise ValueError("When specifying an image you must also provide a tag")
return {
"worker_cores": options.worker_cores,
"worker_memory": int(options.worker_memory * 2 ** 30),
"image": options.image,
}
c.Backend.cluster_options = Options(
Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
String("image", default="pangeo/ml-notebook:2022.09.21", label="Image"),
handler=options_handler,
)
backend:
worker:
cores:
limit: 4
memory:
limit: 8G
threads: 2
dask-kubernetes:
enabled: false
jupyterhub:
hub:
config:
GenericOAuthenticator:
client_id: <client>
client_secret: <secret>
oauth_callback_url: https://pangeo-elastic.vm.fedcloud.eu/hub/oauth_callback
authorize_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/auth
token_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/token
userdata_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/userinfo
login_service: EGI Check-In
scope:
- openid
- email
- profile
- eduperson_entitlement
username_key: preferred_username
userdata_params:
state: state
allowed_groups:
- urn:mace:egi.eu:group:vo.pangeo.eu:role=member#aai.egi.eu
claim_groups_key: eduperson_entitlement
JupyterHub:
authenticator_class: generic-oauth
services:
dask-gateway:
apiToken: <token1>
proxy:
secretToken: <token2>
service:
type: ClusterIP
singleuser:
cpu:
guarantee: 1
limit: 2
defaultUrl: /lab
extraEnv:
DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
image:
name: pangeo/ml-notebook
tag: 2022.09.21
memory:
guarantee: 2G
limit: 4G
startTimeout: 600
storage:
capacity: 2Gi
type: dynamic
rbac:
enabled: true and just issued the helm command: sudo helm upgrade daskhub daskhub --repo=https://helm.dask.org --install --wait --cleanup-on-fail --create-namespace --namespace daskhub --version 2022.8.2 --values daskhub.yaml Followed by reconfiguring ingress. The helm command and kubectl for ingress worked with no error. Same as the other day when trying to access IM Dashboard. I'll stop there for tonight. If someone can test the deployment at https://pangeo-foss4g.vm.fedcloud.eu and see if they can login? Edit: correct link is https://pangeo-elastic.vm.fedcloud.eu/ |
Thank you @guillaumeeb !!
In the example it was pangeo.vm.fedcloud.eu but I guess we can have it like pangeo-eosc.vm.fedcloud.eu ?
👍 I checked IP address of the cluster you made from swift dashboard and I could have the grafana login portal
If I remember right, the button in the IM dashboard indicating cluster configuration "configured" , could be clicked and there was information about cloudadm |
Crap, I didn't indicate the right link. The correct one is https://pangeo-elastic.vm.fedcloud.eu. This is only a test deployment for now. There is no need to add /jupyterhub/. Jupyterhub is available at /. |
@sebastian-luna-valero do you have any idea why the auth might failed? I'll have another look at it tonight. |
Hi, Could you please double check that the Redirect UR in https://aai-dev.egi.eu/federation has the same value as |
I think I get the problem: I simply didn't go through the registration of a new Service in https://aai-dev.egi.eu/federation as I though I could reuse the old Open ID credentials. Could we use the same credentials by adding another Redirect URI to the form your showing? How can we have access to the management of the already existing service? |
Trying to answer some questions:
I believe user pods end up running on worker nodes. However, for large clusters, I like to have a big flavor for the front-end too (with the master role).
I normally take a copy of the version I choose, trying to be reproducible. Last time I deployed
It should be ok, but I still prefer to stay with 20.04.
This is something that I wanted to explore myself since I haven't tried it yet. I believe the minimum number of workers can also be configured from CLI, maybe we need to ask to expose this option on IM dashboard as well.
Correct, would you open a PR with this and other suggestions? Happy to review.
Please make sure you use https://aai-dev.egi.eu/federation since you can self-approve your request to add this new service. The secrets are available on that form after the service is approved.
If you only get |
Each service (different URIs) have each own credentials. These credentials are only available to the "service owner", the one adding the config to https://aai-dev.egi.eu/federation, I am afraid. |
Okay, trying this now!
Yes I'm planning to do this once everything works fine. Thanks for every answers, that's really helpful! |
@sebastian-luna-valero Thanks to your inputs, I created a service in https://aai-dev.egi.eu/federation/egi/services, and self approve it. I think I was able to give you access to this service. However, it is still pending (Deployment in progress status). Do you know how much time it can take (it's already been about 30min)? I'm trying with NativeAuthenticator waiting for the OIDC credentials to be OK. |
So the platform seems to be working (Jupyterhub and Dask-gateway), however, I do'nt see any scaling up when I ask for more pods. I've launch a Dask-gateway cluster, and scaled it. Default platform has only one worker node with 8CPUs, 32GB. Here are my pods waiting:
And if I check one Pending pod details:
But I still have two nodes:
Not sure where to look to see where the problem could be. |
I also see
When connecting to front node. Should I do something about that? |
PR opened at #28. Still waiting for https://aai-dev.egi.eu/federation/egi/services, my service is still on I also tested manual scaling using IM Dashboard on this new deployment, this worked well. |
My bad, we should use https://aai.egi.eu/federation (instead of https://aai-dev.egi.eu/federation). Then select Please try again, and this should solve the issue (i.e. after a few seconds, the service will be automatically
I am double checking why that would be the case in: grycap/clues#114 (comment)
Ideally we should update and restart all the VMs in the cluster before the initial deployment, and then periodically to apply updates to the underlying operating system. This implies a downtime so I would it immediately before the workshop and immediately afterwards.
Great! |
This worked! Thanks a lot. So that leaves us with the Elastic functionality not working. I just tried again by scaling a Dask cluster, but I still have pods that are not able to be scheduled due to insufficient resources, and no new nodes incoming. @sebastian-luna-valero maybe we should open a new issue in https://github.com/grycap/clues rather than adding a comment in an existing issue? cc @micafer. Everyone should be able to login to the https://pangeo-elastic.vm.fedcloud.eu deployment. I've just noted a display error using Dask-gateway when displaying cluster object in the notebook, but it's probably some minor bug. |
Please send me an email with the detailed problem ans we can try to debug the issue. |
I logged in looks great! thank you @guillaumeeb!! I didn't test dask yet but I do not see the cloud bucket? Is it the same Pangeo notebook docker image as we used for https://pangeo-foss4g.vm.fedcloud.eu/ infrastructure? |
You mean the S3 browser on the left side bar? Yes, not sure why it is not there. I used pangeo/ml-notebook in the last available version. See y'all file in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md. I cannot check right now, but we might have used pangeo-notebook image in the other deployment. I thought that ml-notebook was more complete, but I may probably be wrong. This can be easily changed! Other than that, it's the exact same deployment of Daskhub, so there won't be more functionalities. As the elastic part of kubernetes is not working currently, there is no interest of using this deployment instead of the other one. |
I think https://github.com/IBM/jupyterlab-s3-browser needs to be added explicitly (and we could update https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md accordingly) We are currently trying to solve the elastic k8s option, and we will report back here. In the meantime, manual scaling up and down is the best option. |
Hi, I think the foss4g configuration was based on Pangeo-notebook docker image and not on the ml notebook. the purpose was not to use too much resources. @j34ni or @annefou can you please confirm? Once the automatic scaling up works, existing dask hub need to be destroyed and re-created to have it benefit from it? |
Yes, it used the pangeo-notebook:latest |
Yes, we would need to redeploy to get the elasticity. |
I just redeployed the Daskhub with the pangeo/pangeo-notebook Docker image, and the So now we're in the same setup, we just need to wait and see if we manage to get elasticity working. Also, I still encounter an error when starting a dask-gateway cluster, but this does not prevent using it. I'll open an issue on the pangeo-docker repo to get some feedback. This is probably due to the last version of the image. We can pin a previous one if needed. ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/IPython/core/formatters.py:921, in IPythonDisplayFormatter.__call__(self, obj)
919 method = get_real_method(obj, self.print_method)
920 if method is not None:
--> 921 method()
922 return True
File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:1225, in GatewayCluster._ipython_display_(self, **kwargs)
1223 widget = self._widget()
1224 if widget is not None:
-> 1225 return widget._ipython_display_(**kwargs)
1226 else:
1227 from IPython.display import display
AttributeError: 'VBox' object has no attribute '_ipython_display_' |
And I confirm that ml-notebook image does not have s3-browser installed, see pangeo-data/pangeo-docker-images#383. |
Do we want to keep |
We should replace it for now, feel free to do it! |
Sure! #29 |
Just for information, I'm going to delete the pangeo-elastic infrastructure and create a new one using operational IM Dashboard instance after discussing with @micafer. |
I just redeployed the pangeo-elastic infrastructure. I see an improvement, but there are still things to debug for elasticity to be working. |
This is an issue so that we can coordinate for creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard.
The text was updated successfully, but these errors were encountered: