Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues Upgrading from JHub 0.9 --> 10.2 #884

Closed
scottyhq opened this issue Nov 25, 2020 · 11 comments · Fixed by #885
Closed

Issues Upgrading from JHub 0.9 --> 10.2 #884

scottyhq opened this issue Nov 25, 2020 · 11 comments · Fixed by #885

Comments

@scottyhq
Copy link
Member

scottyhq commented Nov 25, 2020

Currently unable to access staging hub on AWS with a ERR_TIMED_OUT and your connection to this site is unsecure. @consideRatio or @yuvipanda your guidance would be much appreciated here, I've spent a bit of time looking over issues but I'm not having any epiphanies.

All pods are running, and there are not obvious error messages, but the hub log is very suspicious to me, these lines in particular:

  • No config at /etc/jupyterhub/config/values.yaml vs Loading /etc/jupyterhub/config/values.yaml
  • Hub API listening on http://:8081/hub/ vs Hub API listening on http://0.0.0.0:8081/hub/
  • Using Spawner: kubespawner.spawner.KubeSpawner-0.14.1 vs Using Spawner: builtins.PatchedKubeSpawner
  • Initialized 0 spawners in 0.002 seconds vs Initialized 1 spawners in 0.119 seconds

Full diff compared to our functioning prod hub:
https://gist.github.com/scottyhq/e381d2b01e3db0a162ae317faf9a2193/revisions

ping @tjcrone @TomAugspurger

@TomAugspurger
Copy link
Member

Hmm did something change with DNS / static-IPs? I notice that going to https://staging.us-central1-b.gcp.pangeo.io fails, but kubectl -n staging proxy-public shows an external IP of 34.69.173.244, which serves the login page. Actually logging in fails on the oauth callback I think, since that goes through pangeo.io.

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 25, 2020

In the logs for the proxy pod:

2020-11-24T23:33:02.626083552Z 23:33:02.619 [ConfigProxy] info: Adding route / -> http://hub:8081
2020-11-24T23:33:02.641676722Z 23:33:02.635 [ConfigProxy] info: Proxying http://:::8000 to http://hub:8081
2020-11-24T23:33:02.641700627Z 23:33:02.636 [ConfigProxy] info: Proxy API at http://:::8001/api/routes
2020-11-24T23:33:02.649476952Z 23:33:02.642 [ConfigProxy] info: Route added / -> http://hub:8081
2020-11-24T23:33:46.605605637Z 23:33:46.605 [ConfigProxy] error: 503 GET / connect ECONNREFUSED 10.39.255.103:8081
2020-11-24T23:33:47.630779782Z 23:33:47.630 [ConfigProxy] error: Failed to get custom error page: Error: connect ECONNREFUSED 10.39.255.103:8081
2020-11-24T23:33:47.630821921Z     at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1145:16) {
2020-11-24T23:33:47.630827036Z   errno: 'ECONNREFUSED',
2020-11-24T23:33:47.630840654Z   code: 'ECONNREFUSED',
2020-11-24T23:33:47.630845460Z   syscall: 'connect',
2020-11-24T23:33:47.630849328Z   address: '10.39.255.103',
2020-11-24T23:33:47.630853181Z   port: 8081
2020-11-24T23:33:47.630856966Z }

Perhaps we also disable the networkPolicy for the proxy pod @scottyhq? Or we need to add labels to it / the hub so they can talk to each other? Hopefully @consideRatio knows what's best here.

@consideRatio
Copy link
Member

consideRatio commented Nov 25, 2020

Concerns raised

  • No config at /etc/jupyterhub/config/values.yaml vs Loading /etc/jupyterhub/config/values.yaml

Not to worry I think, all config is put into /etc/jupyterhub/secret/values.yaml now instead.

  • Hub API listening on http://:8081/hub/ vs Hub API listening on http://0.0.0.0:8081/hub/

Not to worry, a detail to allow for also IPv6 in this part.

  • Using Spawner: kubespawner.spawner.KubeSpawner-0.14.1 vs Using Spawner: builtins.PatchedKubeSpawner

Not to worry, a security patch no longer needed.

  • Initialized 0 spawners in 0.002 seconds vs Initialized 1 spawners in 0.119 seconds

Not to worry, it is about active users I think.

Complexity reduction

  • Let's disable all network policies to remove one variable from the equation.

Diagnosis

  • When exactly do you get ERR_TIMED_OUT @scottyhq? Did you get it by visiting /? Can you try the visiting /hub/admin directly instead which reduce the amount of logic triggered in the hub.
  • Is the hub image customized or similarly? I tried to figure that out myself, but not sure about the configuration locations.
  • Is it correct that the failing deployment on staging is configured in a common.yaml, staging.yaml, and pangeo-deploy/values.yaml, and secrets.yaml, but secrets.yaml doesn't contain anything else of interest other than actual secrets?

Feedback on #885

Explicitly setting daskhub.jupyterhub.proxy.https.enabled=true is required, so if it wasn't before, that is perhaps the issue as it seems you have your connection to this site is unsecure in the error somehow also @scottyhq. I think this is the root issue.

@scottyhq
Copy link
Member Author

scottyhq commented Nov 25, 2020

Explicitly setting daskhub.jupyterhub.proxy.https.enabled=true is required

I did deploy this change locally to staging, but it didn't fix the issue

Let's disable all network policies to remove one variable from the equation.

Can try this next

Can you try the visiting /hub/admin directly

On Chrome I see 'establishing secure connection...' and then eventually 'ERR_TIMED_OUT'. On Safari things hand longer and eventually the 'can't establish secure connection message'

Is the hub image customized or similarly?

No Image: jupyterhub/k8s-hub:0.10.2

correct that the failing deployment on staging is configured in a common.yaml, staging.yaml, and pangeo-deploy/values.yaml, and secrets.yaml, but secrets.yaml doesn't contain anything else of interest other than actual secrets?

Correct, we have those 4 values.yamls. Secrets.yam; also contains load-balancer IPs, I notice that the ingress-nginx got commented out at some point... perhaps that is the issue!

daskhub:
  jupyterhub:
    proxy:
      secretToken: SECRET
      service:
        loadBalancerIP: XXXXXXX.us-west-2.elb.amazonaws.com
    auth:
      custom:
        config:
          client_id: "SECRET"
          client_secret: "SECRET" 
    hub:
      services:
        dask-gateway:
          apiToken: "SECRET"
  dask-gateway:
    gateway:
      proxyToken: "SECRET"
      auth:
        type: jupyterhub
        jupyterhub:
          apiToken: "SECRET"
    # webProxy:
    #   service:
    #     loadBalancerIP:  XXXXXXXXX.us-west-2.elb.amazonaws.com
    # schedulerProxy:
    #   service:
    #     loadBalancerIP: XXXXXXXXXX.us-west-2.elb.amazonaws.com

# How do I get this IP without manually deploying?
# Just manually deploy, I guess
# ingress-nginx:
#   controller:
#     service:
#       loadBalancerIP: "XXXXXXXXX.us-west-2.elb.amazonaws.com"

(Is ingress-nginx only used for grafana @TomAugspurger ? #849)
ingress-nginx pod logs have some suspicious messages like 2020/11/25 01:30:49 [crit] 37#37: *16789 SSL_do_handshake() failed (SSL: error:141CF06C:SSL routines:tls_parse_ctos_key_share:bad key share) while SSL handshaking, client: 192.168.163.65, server: 0.0.0.0:443

@scottyhq
Copy link
Member Author

scottyhq commented Nov 25, 2020

running helm lint locally I get the following error trying to disable proxy network policy following https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/security.html#enabling-and-disabling-network-policies

    proxy:
      networkPolicy:
        enabled: false

Big long discussion of things here
jupyterhub/zero-to-jupyterhub-k8s#1842

[ERROR] templates/: template: pangeo-deploy/charts/daskhub/charts/jupyterhub/templates/NOTES.txt:42:4: executing "pangeo-deploy/charts/daskhub/charts/jupyterhub/templates/NOTES.txt" at <fail "DEPRECATION: proxy.networkPolicy has been renamed to proxy.chp.networkPolicy">: error calling fail: DEPRECATION: proxy.networkPolicy has been renamed to proxy.chp.networkPolicy

@scottyhq
Copy link
Member Author

so after redeploying with (full current config in #885)

    singleuser:
       networkPolicy:
         enabled: false

    proxy:
       chp:
         networkPolicy:
           enabled: false
       https:
         enabled: true

    hub:
      networkPolicy:
        enabled: false

kubectl get networkpolicy -A Shows:

NAMESPACE         NAME        POD-SELECTOR                                                 AGE
icesat2-staging   autohttps   app=jupyterhub,component=autohttps,release=icesat2-staging   18h

@scottyhq
Copy link
Member Author

scottyhq commented Nov 25, 2020

OK, I'm stuck again. No idea how to debug this further. I've turned off all network policies and still am not able to reach the landing page with https enabled. Turning off https (https: enabled: false) I can get to the login page, but auth0 gets stuck because it accepts https.

So it seems there is some issue with https configuration, but none of the pod logs show any obvious errors. A quick glance around suggests there are some alternative HTTPs setups (compared to when we initially set things up over a year ago!). Maybe it's worth trying these unless others have some insight into what is no longer working:

@TomAugspurger
Copy link
Member

The staging GCP deployment seems to be fixed with this (deployed local)

diff --git a/pangeo-deploy/values.yaml b/pangeo-deploy/values.yaml
index c191ced..b4a6618 100644
--- a/pangeo-deploy/values.yaml
+++ b/pangeo-deploy/values.yaml
@@ -16,7 +16,20 @@ daskhub:
   jupyterhub:
     # Helm config for jupyterhub goes here
     # See https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/jupyterhub/values.yaml
+    proxy:
+      https:
+        enabled: true
+      chp:
+        networkPolicy:
+          enabled: false
+      traefik:
+        networkPolicy:
+          enabled: false
     singleuser:
+      networkPolicy:
+        # Disable network security policy, perhaps causing upgrade issues.
+        # https://github.com/pangeo-data/pangeo-cloud-federation/issues/884
+        enabled: false
       cpu:
         limit: 2
         guarantee: 1

I might move those changes to just apply to GCP for now, till we get the rest sorted out.


Unfortunately, I don't have much help to offer for the AWS side. I've never really understood how they do load balancers.

TomAugspurger added a commit to TomAugspurger/pangeo-cloud-federation that referenced this issue Nov 28, 2020
TomAugspurger added a commit that referenced this issue Nov 28, 2020
@scottyhq
Copy link
Member Author

scottyhq commented Nov 30, 2020

@consideRatio - a clue at least for what is going wrong on AWS, "Error syncing load balancer: failed to ensure load balancer: LoadBalancerIP cannot be specified for AWS ELB"

kubectl describe svc -n icesat2-staging proxy-public abbreviated output

Name:                     proxy-public
Namespace:                icesat2-staging
Labels:                   app=jupyterhub
                          app.kubernetes.io/managed-by=Helm
                          chart=jupyterhub-0.10.6
                          component=proxy-public
                          heritage=Helm

Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   EnsuringLoadBalancer    7m6s (x339 over 27h)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m6s (x340 over 27h)  service-controller  Error syncing load balancer: failed to ensure load balancer: LoadBalancerIP cannot be specified for AWS ELB

@salvis2 suggested the fix of removing the following config from for the similar error message Error creating load balancer but this is a bit different since we already have an existing ELB... jupyterhub/zero-to-jupyterhub-k8s#1477 (comment)

daskhub:
  jupyterhub:
    proxy:
      service:
        loadBalancerIP: XXXXXXX.us-west-2.elb.amazonaws.com

@scottyhq
Copy link
Member Author

Looking around a bit more https://docs.aws.amazon.com/eks/latest/userguide/load-balancing.html , I'm suspecting we're having issues b/c cluster subnets now require tags to cooperate with K8s and classic load balancer it seems "Public subnets must be tagged as follows so that Kubernetes knows to use only those subnets for external load balancers instead of choosing a public subnet in each Availability Zone (in lexicographical order by subnet ID). If you use eksctl or an Amazon EKS AWS CloudFormation template to create your VPC after March 26, 2020, then the subnets are tagged appropriately when they're created."

We created this cluster a long while back and I don't see kubernetes.io/role/elb: 1 tags for the subnets in the AWS console.

related issue eksctl-io/eksctl#1982

@scottyhq
Copy link
Member Author

scottyhq commented Dec 1, 2020

Many many thanks to @consideRatio for taking the time to do some live debugging with me on this. To fix this we ended up deleting the existing load balancer (over 600 days old) and redeploying. Here is a quick summary:

  1. To address the Error syncing load balancer message above, make sure all AWS EKS components are up-to-date (without bumping k8s version):
# 1.16.8
eksctl utils update-legacy-subnet-settings --cluster pangeo  
eksctl utils update-coredns --cluster pangeo --approve  
eksctl utils update-aws-node --cluster pangeo --approve
eksctl utils update-kube-proxy --cluster pangeo --approve
# path kube-proxy pods according to https://github.com/weaveworks/eksctl/issues/1088#issuecomment-717429367 
kubectl edit daemonset kube-proxy --namespace kube-system 
  1. Delete the existing load balancer(proxy-public service
kubectl delete -n icesat2-staging svc  proxy-public 
  1. Comment out loadBalancerIP mapping. This is not necessary for AWS ELBs
daskhub:
  jupyterhub:
    proxy:
      #service:
      #  loadBalancerIP: XXXXXXXXXXXXXXX.us-west-2.elb.amazonaws.com
  1. Redeploy and remap DNS CNAME XXXXXXXXXXXXXXX.us-west-2.elb.amazonaws.com--> staging.aws-uswest2.pangeo.io
 hubploy deploy icesat2 pangeo-deploy staging 

scottyhq added a commit that referenced this issue Dec 1, 2020
closes #884. fix AWS staging for jupyterhub 10.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants