Issues Upgrading from JHub 0.9 --> 10.2 #884

scottyhq · 2020-11-25T02:04:48Z

Currently unable to access staging hub on AWS with a ERR_TIMED_OUT and your connection to this site is unsecure. @consideRatio or @yuvipanda your guidance would be much appreciated here, I've spent a bit of time looking over issues but I'm not having any epiphanies.

All pods are running, and there are not obvious error messages, but the hub log is very suspicious to me, these lines in particular:

No config at /etc/jupyterhub/config/values.yaml vs Loading /etc/jupyterhub/config/values.yaml
Hub API listening on http://:8081/hub/ vs Hub API listening on http://0.0.0.0:8081/hub/
Using Spawner: kubespawner.spawner.KubeSpawner-0.14.1 vs Using Spawner: builtins.PatchedKubeSpawner
Initialized 0 spawners in 0.002 seconds vs Initialized 1 spawners in 0.119 seconds

Full diff compared to our functioning prod hub:
https://gist.github.com/scottyhq/e381d2b01e3db0a162ae317faf9a2193/revisions

ping @tjcrone @TomAugspurger

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-11-25T02:31:47Z

Hmm did something change with DNS / static-IPs? I notice that going to https://staging.us-central1-b.gcp.pangeo.io fails, but kubectl -n staging proxy-public shows an external IP of 34.69.173.244, which serves the login page. Actually logging in fails on the oauth callback I think, since that goes through pangeo.io.

TomAugspurger · 2020-11-25T02:37:24Z

In the logs for the proxy pod:

2020-11-24T23:33:02.626083552Z 23:33:02.619 [ConfigProxy] info: Adding route / -> http://hub:8081
2020-11-24T23:33:02.641676722Z 23:33:02.635 [ConfigProxy] info: Proxying http://:::8000 to http://hub:8081
2020-11-24T23:33:02.641700627Z 23:33:02.636 [ConfigProxy] info: Proxy API at http://:::8001/api/routes
2020-11-24T23:33:02.649476952Z 23:33:02.642 [ConfigProxy] info: Route added / -> http://hub:8081
2020-11-24T23:33:46.605605637Z 23:33:46.605 [ConfigProxy] error: 503 GET / connect ECONNREFUSED 10.39.255.103:8081
2020-11-24T23:33:47.630779782Z 23:33:47.630 [ConfigProxy] error: Failed to get custom error page: Error: connect ECONNREFUSED 10.39.255.103:8081
2020-11-24T23:33:47.630821921Z     at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1145:16) {
2020-11-24T23:33:47.630827036Z   errno: 'ECONNREFUSED',
2020-11-24T23:33:47.630840654Z   code: 'ECONNREFUSED',
2020-11-24T23:33:47.630845460Z   syscall: 'connect',
2020-11-24T23:33:47.630849328Z   address: '10.39.255.103',
2020-11-24T23:33:47.630853181Z   port: 8081
2020-11-24T23:33:47.630856966Z }

Perhaps we also disable the networkPolicy for the proxy pod @scottyhq? Or we need to add labels to it / the hub so they can talk to each other? Hopefully @consideRatio knows what's best here.

consideRatio · 2020-11-25T05:06:17Z

Concerns raised

No config at /etc/jupyterhub/config/values.yaml vs Loading /etc/jupyterhub/config/values.yaml

Not to worry I think, all config is put into /etc/jupyterhub/secret/values.yaml now instead.

Hub API listening on http://:8081/hub/ vs Hub API listening on http://0.0.0.0:8081/hub/

Not to worry, a detail to allow for also IPv6 in this part.

Using Spawner: kubespawner.spawner.KubeSpawner-0.14.1 vs Using Spawner: builtins.PatchedKubeSpawner

Not to worry, a security patch no longer needed.

Initialized 0 spawners in 0.002 seconds vs Initialized 1 spawners in 0.119 seconds

Not to worry, it is about active users I think.

Complexity reduction

Let's disable all network policies to remove one variable from the equation.

Diagnosis

When exactly do you get ERR_TIMED_OUT @scottyhq? Did you get it by visiting /? Can you try the visiting /hub/admin directly instead which reduce the amount of logic triggered in the hub.
Is the hub image customized or similarly? I tried to figure that out myself, but not sure about the configuration locations.
Is it correct that the failing deployment on staging is configured in a common.yaml, staging.yaml, and pangeo-deploy/values.yaml, and secrets.yaml, but secrets.yaml doesn't contain anything else of interest other than actual secrets?

Feedback on #885

Explicitly setting daskhub.jupyterhub.proxy.https.enabled=true is required, so if it wasn't before, that is perhaps the issue as it seems you have your connection to this site is unsecure in the error somehow also @scottyhq. I think this is the root issue.

scottyhq · 2020-11-25T18:28:27Z

Explicitly setting daskhub.jupyterhub.proxy.https.enabled=true is required

I did deploy this change locally to staging, but it didn't fix the issue

Let's disable all network policies to remove one variable from the equation.

Can try this next

Can you try the visiting /hub/admin directly

On Chrome I see 'establishing secure connection...' and then eventually 'ERR_TIMED_OUT'. On Safari things hand longer and eventually the 'can't establish secure connection message'

Is the hub image customized or similarly?

No Image: jupyterhub/k8s-hub:0.10.2

correct that the failing deployment on staging is configured in a common.yaml, staging.yaml, and pangeo-deploy/values.yaml, and secrets.yaml, but secrets.yaml doesn't contain anything else of interest other than actual secrets?

Correct, we have those 4 values.yamls. Secrets.yam; also contains load-balancer IPs, I notice that the ingress-nginx got commented out at some point... perhaps that is the issue!

daskhub:
  jupyterhub:
    proxy:
      secretToken: SECRET
      service:
        loadBalancerIP: XXXXXXX.us-west-2.elb.amazonaws.com
    auth:
      custom:
        config:
          client_id: "SECRET"
          client_secret: "SECRET" 
    hub:
      services:
        dask-gateway:
          apiToken: "SECRET"
  dask-gateway:
    gateway:
      proxyToken: "SECRET"
      auth:
        type: jupyterhub
        jupyterhub:
          apiToken: "SECRET"
    # webProxy:
    #   service:
    #     loadBalancerIP:  XXXXXXXXX.us-west-2.elb.amazonaws.com
    # schedulerProxy:
    #   service:
    #     loadBalancerIP: XXXXXXXXXX.us-west-2.elb.amazonaws.com

# How do I get this IP without manually deploying?
# Just manually deploy, I guess
# ingress-nginx:
#   controller:
#     service:
#       loadBalancerIP: "XXXXXXXXX.us-west-2.elb.amazonaws.com"

(Is ingress-nginx only used for grafana @TomAugspurger ? #849)
ingress-nginx pod logs have some suspicious messages like 2020/11/25 01:30:49 [crit] 37#37: *16789 SSL_do_handshake() failed (SSL: error:141CF06C:SSL routines:tls_parse_ctos_key_share:bad key share) while SSL handshaking, client: 192.168.163.65, server: 0.0.0.0:443

scottyhq · 2020-11-25T18:47:41Z

running helm lint locally I get the following error trying to disable proxy network policy following https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#enabling-and-disabling-network-policies

    proxy:
      networkPolicy:
        enabled: false

Big long discussion of things here
jupyterhub/zero-to-jupyterhub-k8s#1842

[ERROR] templates/: template: pangeo-deploy/charts/daskhub/charts/jupyterhub/templates/NOTES.txt:42:4: executing "pangeo-deploy/charts/daskhub/charts/jupyterhub/templates/NOTES.txt" at <fail "DEPRECATION: proxy.networkPolicy has been renamed to proxy.chp.networkPolicy">: error calling fail: DEPRECATION: proxy.networkPolicy has been renamed to proxy.chp.networkPolicy

scottyhq · 2020-11-25T19:21:29Z

so after redeploying with (full current config in #885)

    singleuser:
       networkPolicy:
         enabled: false

    proxy:
       chp:
         networkPolicy:
           enabled: false
       https:
         enabled: true

    hub:
      networkPolicy:
        enabled: false

kubectl get networkpolicy -A Shows:

NAMESPACE         NAME        POD-SELECTOR                                                 AGE
icesat2-staging   autohttps   app=jupyterhub,component=autohttps,release=icesat2-staging   18h

scottyhq · 2020-11-25T21:28:20Z

OK, I'm stuck again. No idea how to debug this further. I've turned off all network policies and still am not able to reach the landing page with https enabled. Turning off https (https: enabled: false) I can get to the login page, but auth0 gets stuck because it accepts https.

So it seems there is some issue with https configuration, but none of the pod logs show any obvious errors. A quick glance around suggests there are some alternative HTTPs setups (compared to when we initially set things up over a year ago!). Maybe it's worth trying these unless others have some insight into what is no longer working:

docs: https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html?highlight=offload#off-loading-ssl-to-a-load-balancer
blog post: https://www.arhea.net/posts/2020-06-18-jupyterhub-amazon-eks.html
comparing AWS load balancers: https://aws.amazon.com/elasticloadbalancing/features/

TomAugspurger · 2020-11-28T13:19:31Z

The staging GCP deployment seems to be fixed with this (deployed local)

diff --git a/pangeo-deploy/values.yaml b/pangeo-deploy/values.yaml
index c191ced..b4a6618 100644
--- a/pangeo-deploy/values.yaml
+++ b/pangeo-deploy/values.yaml
@@ -16,7 +16,20 @@ daskhub:
   jupyterhub:
     # Helm config for jupyterhub goes here
     # See https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/jupyterhub/values.yaml
+    proxy:
+      https:
+        enabled: true
+      chp:
+        networkPolicy:
+          enabled: false
+      traefik:
+        networkPolicy:
+          enabled: false
     singleuser:
+      networkPolicy:
+        # Disable network security policy, perhaps causing upgrade issues.
+        # https://github.com/pangeo-data/pangeo-cloud-federation/issues/884
+        enabled: false
       cpu:
         limit: 2
         guarantee: 1

I might move those changes to just apply to GCP for now, till we get the rest sorted out.

Unfortunately, I don't have much help to offer for the AWS side. I've never really understood how they do load balancers.

xref pangeo-data#884

xref #884

scottyhq · 2020-11-30T19:27:44Z

@consideRatio - a clue at least for what is going wrong on AWS, "Error syncing load balancer: failed to ensure load balancer: LoadBalancerIP cannot be specified for AWS ELB"

kubectl describe svc -n icesat2-staging proxy-public abbreviated output

Name:                     proxy-public
Namespace:                icesat2-staging
Labels:                   app=jupyterhub
                          app.kubernetes.io/managed-by=Helm
                          chart=jupyterhub-0.10.6
                          component=proxy-public
                          heritage=Helm

Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   EnsuringLoadBalancer    7m6s (x339 over 27h)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m6s (x340 over 27h)  service-controller  Error syncing load balancer: failed to ensure load balancer: LoadBalancerIP cannot be specified for AWS ELB

@salvis2 suggested the fix of removing the following config from for the similar error message Error creating load balancer but this is a bit different since we already have an existing ELB... jupyterhub/zero-to-jupyterhub-k8s#1477 (comment)

daskhub:
  jupyterhub:
    proxy:
      service:
        loadBalancerIP: XXXXXXX.us-west-2.elb.amazonaws.com

scottyhq · 2020-11-30T20:01:35Z

Looking around a bit more https://docs.aws.amazon.com/eks/latest/userguide/load-balancing.html , I'm suspecting we're having issues b/c cluster subnets now require tags to cooperate with K8s and classic load balancer it seems "Public subnets must be tagged as follows so that Kubernetes knows to use only those subnets for external load balancers instead of choosing a public subnet in each Availability Zone (in lexicographical order by subnet ID). If you use eksctl or an Amazon EKS AWS CloudFormation template to create your VPC after March 26, 2020, then the subnets are tagged appropriately when they're created."

We created this cluster a long while back and I don't see kubernetes.io/role/elb: 1 tags for the subnets in the AWS console.

related issue eksctl-io/eksctl#1982

scottyhq · 2020-12-01T19:19:57Z

Many many thanks to @consideRatio for taking the time to do some live debugging with me on this. To fix this we ended up deleting the existing load balancer (over 600 days old) and redeploying. Here is a quick summary:

To address the Error syncing load balancer message above, make sure all AWS EKS components are up-to-date (without bumping k8s version):

# 1.16.8
eksctl utils update-legacy-subnet-settings --cluster pangeo  
eksctl utils update-coredns --cluster pangeo --approve  
eksctl utils update-aws-node --cluster pangeo --approve
eksctl utils update-kube-proxy --cluster pangeo --approve
# path kube-proxy pods according to https://github.com/weaveworks/eksctl/issues/1088#issuecomment-717429367 
kubectl edit daemonset kube-proxy --namespace kube-system

Delete the existing load balancer(proxy-public service

kubectl delete -n icesat2-staging svc  proxy-public

Comment out loadBalancerIP mapping. This is not necessary for AWS ELBs

daskhub:
  jupyterhub:
    proxy:
      #service:
      #  loadBalancerIP: XXXXXXXXXXXXXXX.us-west-2.elb.amazonaws.com

Redeploy and remap DNS CNAME XXXXXXXXXXXXXXX.us-west-2.elb.amazonaws.com--> staging.aws-uswest2.pangeo.io

 hubploy deploy icesat2 pangeo-deploy staging

closes #884. fix AWS staging for jupyterhub 10.2

scottyhq mentioned this issue Nov 25, 2020

[WIP] fix AWS staging for jupyterhub 10.2 #885

Merged

2 tasks

TomAugspurger added a commit to TomAugspurger/pangeo-cloud-federation that referenced this issue Nov 28, 2020

[GCP] Fixed deployment for daskhub update

55a8008

xref pangeo-data#884

TomAugspurger mentioned this issue Nov 28, 2020

[GCP] Fixed deployment for daskhub update #889

Merged

TomAugspurger added a commit that referenced this issue Nov 28, 2020

[GCP] Fixed deployment for daskhub update

3cf1506

xref #884

scottyhq closed this as completed in #885 Dec 1, 2020

scottyhq added a commit that referenced this issue Dec 1, 2020

Merge pull request #885 from scottyhq/aws-fix

6f28f26

closes #884. fix AWS staging for jupyterhub 10.2

scottyhq mentioned this issue Dec 1, 2020

Bump DaskHub 4.5.6 #895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues Upgrading from JHub 0.9 --> 10.2 #884

Issues Upgrading from JHub 0.9 --> 10.2 #884

scottyhq commented Nov 25, 2020 •

edited

Loading

TomAugspurger commented Nov 25, 2020

TomAugspurger commented Nov 25, 2020 •

edited

Loading

consideRatio commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020

scottyhq commented Nov 25, 2020 •

edited

Loading

TomAugspurger commented Nov 28, 2020

scottyhq commented Nov 30, 2020 •

edited

Loading

scottyhq commented Nov 30, 2020

scottyhq commented Dec 1, 2020

Issues Upgrading from JHub 0.9 --> 10.2 #884

Issues Upgrading from JHub 0.9 --> 10.2 #884

Comments

scottyhq commented Nov 25, 2020 • edited Loading

TomAugspurger commented Nov 25, 2020

TomAugspurger commented Nov 25, 2020 • edited Loading

consideRatio commented Nov 25, 2020 • edited Loading

Concerns raised

Complexity reduction

Diagnosis

Feedback on #885

scottyhq commented Nov 25, 2020 • edited Loading

scottyhq commented Nov 25, 2020 • edited Loading

scottyhq commented Nov 25, 2020

scottyhq commented Nov 25, 2020 • edited Loading

TomAugspurger commented Nov 28, 2020

scottyhq commented Nov 30, 2020 • edited Loading

scottyhq commented Nov 30, 2020

scottyhq commented Dec 1, 2020

scottyhq commented Nov 25, 2020 •

edited

Loading

TomAugspurger commented Nov 25, 2020 •

edited

Loading

consideRatio commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 25, 2020 •

edited

Loading

scottyhq commented Nov 30, 2020 •

edited

Loading