Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Set up Lagoon on EKS and get a demo of the CMS up and running #6674

Closed
7 of 8 tasks
cweagans opened this issue Oct 12, 2021 · 14 comments
Closed
7 of 8 tasks

POC: Set up Lagoon on EKS and get a demo of the CMS up and running #6674

cweagans opened this issue Oct 12, 2021 · 14 comments
Assignees
Labels
Blocked Issues that are blocked on factors other than blocking issues. Platform CMS Team

Comments

@cweagans
Copy link
Contributor

cweagans commented Oct 12, 2021

Description

As a CMS engineer, I would like to validate that Lagoon will be sufficient for our needs so that we can begin to evaluate the value and cost-savings that Lagoon potentially offers.

Acceptance Criteria

  • We have administrative access to a cluster for testing purposes
  • Documentation has been added concerning the details of:
    • deploying and configuring the Lagoon system on EKS
    • configuring end user tooling for Lagoon
    • creating and deploying a Lagoon project for the CMS
  • A retrospective report on the proof-of-concept has been authored and made available to the CMS team and other interested parties.

CMS Team

Please leave only the team that will do this work selected. If you're not sure, it's fine to leave both selected.

  • Platform CMS Team
  • Sitewide CMS Team

Related #6673

@mchelen-gov
Copy link
Contributor

Can we size this?

@jefflbrauer
Copy link
Contributor

@jefflbrauer
Copy link
Contributor

Please add your planning poker estimate with ZenHub @cweagans

@timcosgrove timcosgrove removed the Needs refining Issue status label Oct 14, 2021
@cweagans cweagans changed the title POC: Set up a new EKS cluster, install Lagoon, and get a demo of the CMS up and running POC: Set up Lagoon on EKS and get a demo of the CMS up and running Oct 25, 2021
@ndouglas
Copy link
Contributor

ndouglas commented Oct 27, 2021

Lagoon Core

  • Following this guide
  • Deployed Lagoon-Core via Helm with values.yaml
    • Super preliminary values file, shouldn't be relied on as a source of truth
    • I'll try to keep it updated as I fumble through this process
    • I'll probably fail
  • Created some Route53 records:
    • lagoon-dev.cms.va.gov CNAME -> traefik-dev.vfs.va.gov
    • *.lagoon-dev.cms.va.gov CNAME ALIAS -> lagoon-dev.cms.va.gov

Screen Shot 2021-10-27 at 6 53 03 AM

I'm not sure what the ANSI art is trying to be. If I can get a source image, I'll try to create a clearer one.

Upon requesting http://api.lagoon-dev.cms.va.gov/ (HTTPS doesn't work, because reasons), I get:

Screen Shot 2021-10-27 at 6 47 16 AM

Which is actually success (at this point)! That means that the routing and ingress are working properly. Next I'll fight with Keycloak, I guess.

  • I was able to log in to Keycloak (:tada:) and set an email for the user (which I set to my A6 email).

  • I did not initially configure the email server settings -- hoping that might be beyond the scope of this PoC.

  • EDIT: Removed SES stuff -- unnecessary AFAICT.

  • I enabled the "Forgot Password" functionality and attempted to access the UI. However, I was greeted only with "Not Authenticated / Please wait while we log you in..." and nothing ever happened. I'm not sure if this is because of a lack of TLS (it shouldn't be).

  • Oh:

Screen Shot 2021-10-27 at 8 07 19 AM

I think that's fixable by overriding the keycloakAPIURL in the values file.

        - name: KEYCLOAK_API
          {{- if .Values.keycloakAPIURL }}
          value: {{ .Values.keycloakAPIURL | quote }}
          {{- else }}
          value: https://{{ index .Values.keycloak.ingress.hosts 0 "host" }}/auth
          {{- end }}

And it works:

Screen Shot 2021-10-27 at 8 19 45 AM

Well, sorta:

Screen Shot 2021-10-27 at 8 20 55 AM

Probably because this:

Screen Shot 2021-10-27 at 8 22 24 AM

So let's override this other URL:

        - name: GRAPHQL_API
          {{- if .Values.lagoonAPIURL }}
          value: {{ .Values.lagoonAPIURL | quote }}
          {{- else }}
          value: https://{{ index .Values.api.ingress.hosts 0 "host" }}/graphql
          {{- end }}

And:
Screen Shot 2021-10-27 at 8 28 32 AM

  • I added a couple of SSH keys, but then realized that I might be heading a bit out of my depth. There's a Lagoon SSH LoadBalancer-type service, which should provision a Network Load Balancer, but since we're in dev and not utility it might not necessarily be reachable. So the Lagoon CLI might take a couple more steps to get operative. Indeed, I can't see any NLBs allocated for the dev cluster aside from a jumpbox NLB, and that makes me think I should stop and work on something else until someone more familiar with the system gets online.

@ndouglas
Copy link
Contributor

ndouglas commented Oct 27, 2021

Harbor

Harbor... just... kinda worked, I guess.
Screen Shot 2021-10-27 at 8 48 50 AM

I updated Lagoon-Core with the Harbor admin password, but obv didn't update the Gist.

Screen Shot 2021-10-27 at 9 11 31 AM

Lagoon Remote

I created a values.yaml file for Lagoon Remote and deployed the helm chart. It, uh, appears to have deployed successfully:

Screen Shot 2021-10-27 at 9 13 29 AM

I mean, who knows what it's actually doing, but I'll burn that bridge when I come to it.

@ndouglas
Copy link
Contributor

ndouglas commented Oct 27, 2021

SSH

SSH access to the lagoon-core-ssh service is required to access Lagoon through the CLI. I thought the service had launched correctly, but upon closer inspection found that it was in Pending state. After debugging some with Eric and Elijah, we found this page, which had the answer:

    # This annotation is only required if you are creating an internal facing ELB. Remove this annotation to create public facing ELB.
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"

After editing this into the service, the NLB seemed reachable via SSH from CMS-Test Dev:

sh-4.2$ telnet dsva-vagov-dev-jumpbox-nlb-d30d20cd3ae50f82.elb.us-gov-west-1.amazonaws.com 22
Trying 10.247.96.216...
Connected to dsva-vagov-dev-jumpbox-nlb-d30d20cd3ae50f82.elb.us-gov-west-1.amazonaws.com.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4
^C^]
telnet> quit
Connection closed.

The issue from here is that this isn't cleanly accessible from our local machines. A solution is probably straightforward for someone better versed in SOCKS and so forth. I'm currently messing with ProxyJump/ProxyCommand in SSH trying to get this working 🤔

This was the magic necessary to be able to connect (not login) from my local machine.

Host lagoon
    HostName internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com
    User lagoon
    ProxyCommand ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b  nc %h %p

From here I can generate a token. However, it appears that Lagoon CLI doesn't use ~/.ssh/config but attempts to login to the specified hostname directly, e.g. doing a DNS lookup and stuff. This might require upstream patches.

@ndouglas
Copy link
Contributor

ndouglas commented Oct 28, 2021

SSH and SOCKS5

I thought Lagoon-CLI used the Go SSH client library, but upon closer inspection it seemed to use the SSH CLI. Then, upon still closer inspection, it only seemed to use the SSH CLI under certain circumstances.

After discussing this with Elijah, Eric, and Cameron, we figured that a good course of action would be to modify the Lagoon CLI to support SOCKS5 or ProxyJump/ProxyCommand or something. Elijah opened an issue.

This morning, I did some tentative work in that direction. Then I started getting itchy and changed the SSH generated command for the codepath that I was fairly sure was never executed, and -- it started working 😕

diff --git a/pkg/lagoon/ssh/main.go b/pkg/lagoon/ssh/main.go
index 3b6e013..23a1ee2 100644
--- a/pkg/lagoon/ssh/main.go
+++ b/pkg/lagoon/ssh/main.go
@@ -120,7 +120,7 @@ func RunSSHCommand(lagoon map[string]string, sshService string, sshContainer str
 
 // GenerateSSHConnectionString .
 func GenerateSSHConnectionString(lagoon map[string]string, service string, container string) string {
-	connString := fmt.Sprintf("ssh -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
+	connString := fmt.Sprintf("ssh -o \"ProxyCommand=ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b  nc %%h %%p\" -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
 	if service != "" {
 		connString = fmt.Sprintf("%s service=%s", connString, service)
 	}

Kinda:

🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Error: Post "http://api.lagoon-dev.cms.va.gov/graphql": dial tcp: lookup api.lagoon-dev.cms.va.gov: no such host

So we need that SOCKS5 proxy to cover everything.

But Go can import proxy information from an HTTP_PROXY environment variable, so:

🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ export HTTP_PROXY="socks5://127.0.0.1:2001/"
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Token fetched and saved.
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli whoami
ID                                  	EMAIL                    	FIRSTNAME	LASTNAME	SSHKEYS 
2015e338-4c55-44f5-8217-25f77af81937	[email protected]	Nathan   	Douglas 	2	

🎉

So at this point my work is unblocked and I can go find some new obstacle to slam into at high speed.

But... why does it work? At this point in my engineering career, nothing makes me more suspicious than something that Just Works™. I did not bleed enough, I did not suffer enough for this to work.

So I git stashed my changes, rebuilt the CLI, and... it still worked. I changed to a new tab (without the exported HTTP_PROXY variable), re-ran it, and... it still worked. I removed the token completely, re-ran, and... it worked.

Something is rotten in the state of Denmark.
– Shakespeare Hamlet 1.4.???

After some poking around, I think that the answer was just to change my SSH connection info for Lagoon:

current: lagoon-dev
default: lagoon-dev
lagoons:
  amazeeio:
    graphql: https://api.lagoon.amazeeio.cloud/graphql
    hostname: ssh.lagoon.amazeeio.cloud
    ui: https://dashboard.amazeeio.cloud
    kibana: https://logs.amazeeio.cloud/
    port: "32222"
    token: ""
    version: ""
  lagoon-dev:
    graphql: http://api.lagoon-dev.cms.va.gov/graphql
    hostname: internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com
    ui: https://ui.lagoon-dev.cms.va.gov
    kibana: ""
    port: "22"
    token: <lemme 'lone>
    version: v2.1.0
updatecheckdisable: false
environmentfromdirectory: false

Then export the HTTP_PROXY. Then things seem to work and we can continue on our quest.

I still don't really understand why this works. internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com resolves to 10.247.x.y via dig. This address can't be pinged. It can, however, be SSH'ed to. So this sounds like an OSI layer thing. I suspect that there's some SOCKS5 setting somewhere that's getting picked up, but I don't know where it is.

Fun With GraphQL

The next step is to play with Lagoon via GraphQL. Unfortunately:

Screen Shot 2021-10-28 at 10 16 27 AM

GraphiQL doesn't expose any sort of SOCKS proxy configuration.

boo-boo-booooooo

Fortunately, this is precisely the sort of suffering I've come to expect in engineering.

maxresdefault

With this command:

curl -g \
  --socks5-hostname 127.0.0.1:2001 \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <my-token>" \
  -d '{"query":"query allProjects {allProjects {name } }"}' \
  http://api.lagoon-dev.cms.va.gov/graphql

I received the expected response:

{"data":{"allProjects":[]}}

With the following query:

curl -g \
  --socks5-hostname 127.0.0.1:2001 \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <lagoon-token>" \
  -d '{"query": "mutation addKubernetes {\r\n  addKubernetes(input:\r\n  {\r\n    name: \"lagoon-dev\",\r\n    consoleUrl: \"https:\/\/4FE820642ABFA95BCB6854C69A1AF5A2.gr7.us-gov-west-1.eks.amazonaws.com\",\r\n    token: \"<kubernetes-build-deploy-token>\",\r\n    routerPattern: \"${environment}.${project}.lagoon-dev.cms.va.gov\"\r\n  }){id}\r\n}"}' \
  http://api.lagoon-dev.cms.va.gov/graphql

I got the following:

{"data":{"addKubernetes":{"id":1}}}

Which might also be a sign that things are working. I'm not 100% on the legitimacy of that build-deploy token, though. My kubectl isn't working for some reason, and so I looked up the token in Lens and base64 decoded it. If there's a permission failure after this point, I might need to set that token to the base64 encoded value instead, or something like that.

Next is creating the project:

🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ lagoon add project --gitUrl git://github.com/department-of-veterans-affairs/va.gov-cms.git --openshift 1 --productionEnvironment lagoon-dev --branches "^(master|main|VACMS-6674.*)$"
Result: success
Project Name: lagoon-dev
GitURL: https://github.com/department-of-veterans-affairs/va.gov-cms.git

and it's visible upon login:

Screen Shot 2021-10-28 at 12 29 11 PM

I added this deploy key:

Screen Shot 2021-10-28 at 12 32 46 PM

and deployed:

Screen Shot 2021-10-28 at 12 36 30 PM

but alas:

Screen Shot 2021-10-28 at 12 37 17 PM

Screen Shot 2021-10-28 at 12 37 25 PM

This might be failing because the logs pods are still in ImagePullBackoff:

Screen Shot 2021-10-28 at 12 38 07 PM

So it might be time for More Fun With Kubernetes™.

EDIT: Nope, just should've supplied a git:// URL instead of SSH. Sorry, hadn't read about the deploy key yet. Just making it up as I go.

That made it further:

Screen Shot 2021-10-28 at 12 47 20 PM

but without logs, my ability to figure out wut's going on is obv limited, so I probably need to fix the root issue there.

@ndouglas
Copy link
Contributor

ndouglas commented Oct 28, 2021

Fun With Lagoon, Kubernetes, Docker, RDS, IDK What

So why are the logs (and only the logs) in ImagePullBackoff?

The first obstacle along the way is that kubectl stopped working. After some poking around, it appears that the same HTTP_PROXY env var that lets lagoon-cli work actually breaks kubectl EKS access. I'll press on, switching back and forth between Terminal.app tabs.

But now that I can kubectl, I can look at the failures a little more closely.

🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl get pods --all-namespaces | grep lagoon-build
lagoon-dev-master          lagoon-build-wl8fej                                            0/1     Error              0          24m
lagoon                     lagoon-remote-lagoon-build-deploy-bfb74bf4-mrf66               2/2     Running            0          28h
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl describe pod -n lagoon-dev-master lagoon-build-wl8fej
Name:                 lagoon-build-wl8fej
Namespace:            lagoon-dev-master
<snip>
Events:
  Type    Reason     Age   From                                                      Message
  ----    ------     ----  ----                                                      -------
  Normal  Scheduled  24m   default-scheduler                                         Successfully assigned lagoon-dev-master/lagoon-build-wl8fej to ip-10-247-96-165.us-gov-west-1.compute.internal
  Normal  Pulling    24m   kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal  Pulling image "uselagoon/kubectl-build-deploy-dind:latest"
  Normal  Pulled     24m   kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal  Successfully pulled image "uselagoon/kubectl-build-deploy-dind:latest" in 1.14009574s
  Normal  Created    24m   kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal  Created container lagoon-build
  Normal  Started    24m   kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal  Started container lagoon-build
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl logs -n lagoon-dev-master lagoon-build-wl8fej
Agent pid 33
Identity added: /home/.ssh/key (/home/.ssh/key)
+ set -eo pipefail
+ set -o noglob
+ REGISTRY=none.com
++ cat /var/run/secrets/kubernetes.io/serviceaccount/namespace
+ NAMESPACE=lagoon-dev-master
+ REGISTRY_REPOSITORY=lagoon-dev-master
++ cat /lagoon/version
+ LAGOON_VERSION=21.9.0
+ set +x
+ '[' false == true ']'
+ CI_OVERRIDE_IMAGE_REPO=
+ '[' branch == pullrequest ']'
+ /kubectl-build-deploy/scripts/git-checkout-pull.sh git://github.com/department-of-veterans-affairs/va.gov-cms.git origin/master
+ set -eo pipefail
+ REMOTE=git://github.com/department-of-veterans-affairs/va.gov-cms.git
+ REF=origin/master
+ git init .
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /kubectl-build-deploy/git/.git/
+ git config remote.origin.url git://github.com/department-of-veterans-affairs/va.gov-cms.git
+ git fetch --depth=10 --tags --progress git://github.com/department-of-veterans-affairs/va.gov-cms.git '+refs/heads/*:refs/remotes/origin/*'
fatal: unable to connect to github.com:
github.com[0: 192.30.255.112]: errno=Operation timed out

Hmm. So it looks kinda like there's an outgoing networking issue.

When consulted, Eric and Elijah nodded sadly and explained that outgoing requests to port SSH are dropped by the TIC. And although GitHub can be SSH'ed to on port 443 this violates the spirit of TIC law and would get me yelled at.

Two solutions are:

  • to modify Lagoon to allow cloning over HTTPS -- which would probably take a while
  • run Lagoon Core in the Utility VPC rather than the Dev VPC -- which is what we wanted anyway, but Ops was leery about this because a lot of important things are running in the Utility cluster.

A decision on the latter probably isn't possible until Monday, so I'm kinda blocked here.

I think I'll go back and see if I can get HTTPS cloning to work. IDK why it wouldn't, but it didn't before.

EDIT: Yeah, no, definitely still doesn't work.

@ndouglas
Copy link
Contributor

ndouglas commented Nov 1, 2021

I'm blocked on moving much forward by the outgoing Git/SSH issue, but I can move forward with other things...

EFS

I created an EFS filesystem dsva-vagov-lagoon-dev-cms-efs and created it with the following command:

helm upgrade --install --create-namespace --namespace lagoon-efs-provisioner -f efs-provisioner-values.yaml  lagoon-efs-provisioner stable/efs-provisioner

This created a storage class with the name lagoon-bulk. Easy enough.

That does nothing to unblock me with regard to Git/SSH, though, and I still need to figure out a couple things:

  • logging (this looks nontrivial)
  • file transfer from prod
  • database transfer from prod
  • ???

@ENeal49 ENeal49 added the Blocked Issues that are blocked on factors other than blocking issues. label Nov 1, 2021
@ENeal49 ENeal49 closed this as completed Nov 1, 2021
@ndouglas ndouglas reopened this Nov 1, 2021
@ndouglas
Copy link
Contributor

ndouglas commented Nov 1, 2021

The ops team, in office hours, confirmed our suspicions that this restriction on outbound SSH is pretty legit. As such, this PoC is blocked.

We have a number of options for moving forward (h/t Cameron for typing them up):

  • Press forward with Lagoon
    • Open port 22 outbound + appeal up the chain as necessary
    • Fix lagoon to support https clone (see this discussion)
      • Then maybe composer install issues
      • Definitely would need to fix CA trust in the build container so VA can MITM with us
    • Do we even need to worry about the dev/stage/prod vpcs? Why? Can we just run in util?
    • Move Lagoon builds to Utility then ship them off to the various VPCs?
  • Non Lagoon solutions?
    • Lots of options

@ndouglas
Copy link
Contributor

ndouglas commented Nov 2, 2021

Roundabout Approaches

RnGEd93

A few of the options above could actually be addressed. I've addressed them, sorta, and will discuss.

Modifying the Lagoon Build Deploy Image

No one has responded yet to my discussion thread about git cloning via HTTPS. However, even if they had, it wouldn't work because the DHS is MITMing the TLS.

I forked the Lagoon service images, rebuilt the kubectl image, pushed it to Docker Hub, modified the derivative kubectl-build-deploy-dind image to insert the cert, rebuilt the image, and pushed it to Docker Hub.

The second half of that was to actually alter the Lagoon configuration to use the new Docker image. I injected the override into the remote-values.yaml and updated the lagoon-remote deployment, but unfortunately the keycloak pods went into ImagePullBackoff because we'd hit the Docker pull limit.

Modifying the codebase

There are some changes that need to be made to the CMS codebase as part of a move to Lagoon. I made them in #6867, although I have no way of testing them.

@ndouglas
Copy link
Contributor

ndouglas commented Nov 2, 2021

Per this exchange:

Screen Shot 2021-11-02 at 12 52 22 PM

There is no way to move forward with this PoC.

@ndouglas
Copy link
Contributor

ndouglas commented Nov 3, 2021

So got some responses on this discussion thread saying that although the build-deploy pipeline was implemented with Git/SSH in mind, that was mostly to accommodate GitHub deploy keys and that there was no real hard reason that HTTPS cloning should not work.

Who's to blame?

The Tick

I mentioned that I'd injected the TIC TLS cert into a Docker container, at which point Toby pointed out that I was using an older Dockerfile -- a great catch which undoubtedly would save me some frustration.

So going back to Docker to build the new image:

#17 20.15   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#17 20.16                                  Dload  Upload   Total   Spent    Left  Speed
100 38.3M  100 38.3M    0     0  17.7M      0  0:00:02  0:00:02 --:--:-- 17.7M
#17 22.33   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#17 22.33                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com

I ran into this issue which seems to plague Docker for Mac. I don't want to upgrade Docker for Mac because that's caused issues with Lando in the past. Fortunately, I have about sixty LXC containers with Docker installed, so I'll just SSH into one of them and build the image and push it from there.

Well, then SSH is hanging. I can't SSH into any of said containers, or anything else on my network. SSH works with everything else on my network... except my work computer.

st,small,507x507-pad,600x600,f8f8f8

I attempted to find a solution for a few minutes, but being pressed for time I ended up just switching computers, SSHing into my work computer from my personal computer, grabbing the updated Dockerfile, then SSHing into an LXC container to build the kubectl and kubectl-build-deploy-dind images. After adding my SSH pub key for that machine to GitHub. And docker logining.

The CMS project's URL is HTTPS, so I can attempt to deploy the branch PR to see where my PR (see #6867 ) fails:

🔔nathan.douglas@Belmore:~/Projects/lagoon-stuff$ lagoon deploy branch -p lagoon-dev -b VACMS-6674-lagoon
✔ Yes
success

Now I can log into Lagoon UI because Keycloak is running because it's no longer in ImagePullBackoff because it didn't have Docker Hub credentials.

And:

Screen Shot 2021-11-03 at 8 17 16 AM

🎉

It's taking longer to fail than it has before. Which is, technically, progress.

🔔nathan.douglas@Belmore:~/Projects/va.gov-cms$ kubectl get pods --all-namespaces | grep lagoon-build

lagoon-dev-vacms-6674-lagoon   lagoon-build-vkn4ha                                            0/1     Error               0          3m12s
lagoon                         lagoon-remote-lagoon-build-deploy-758bd85997-vdkth             2/2     Running             0          18h
🔔nathan.douglas@Belmore:~/Projects/va.gov-cms$ kubectl logs -n lagoon-dev-vacms-6674-lagoon lagoon-build-vkn4ha
<snip>
HEAD is now at 040fe2b7 Fix webroot.
+ git submodule update --init --recursive --jobs=6
+ [[ -n '' ]]
+ '[' '!' -f .lagoon.yml ']'
++ cat .lagoon.yml
++ shyaml get-value environment_variables.git_sha false
+ INJECT_GIT_SHA=true
+ '[' true == true ']'
++ git rev-parse HEAD
+ LAGOON_GIT_SHA=040fe2b79034c9f31832b64bd9281d9188df7973
+ REGISTRY_SECRETS=()
+ PRIVATE_REGISTRY_COUNTER=0
+ PRIVATE_REGISTRY_URLS=()
+ PRIVATE_DOCKER_HUB_REGISTRY=0
+ PRIVATE_EXTERNAL_REGISTRY=0
+ set +x
User "lagoon/kubernetes.default.svc" set.
Cluster "kubernetes.default.svc" set.
Context "default/lagoon/kubernetes.default.svc" created.
Switched to context "default/lagoon/kubernetes.default.svc".
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get https://harbor.lagoon-dev.cms.va.gov/v2/: x509: certificate is valid for *.ci.cms.va.gov, *.demo.cms.va.gov, *.tugboat.vfs.va.gov, tugboat.vfs.va.gov, not harbor.lagoon-dev.cms.va.gov

So something is requesting Harbor, but doing so via HTTPS and not HTTP. Since I've not specified HTTPS anywhere, this would appear to be an issue with a script somewhere.

After doing so, I don't see any commands issued after WARNING! Using --password via the CLI is insecure. Use --password-stdin., which is a Docker error message. So I think the HTTPS error is from Docker attempting to log in to Harbor and failing to do so because of the certificate.

Why? Well, Docker requires some additional configuration for insecure registries -- configuration that I don't believe the kubectl-build-deploy-dind scripts perform. So I gotta do that.

The problem is that I think since this is built around Docker-in-Docker that we're using insecure registries as specified by the host, not by the container. So I think this might be doomed to fail.

<snip: I tried it anyway. It failed.>

So the only way to move forward at this point, AFAICT, is to add the insecure registry for Harbor to the /etc/docker/daemon.json file, add that to a custom AMI, and recreate the EKS cluster using that AMI. As far as I can tell.

So I'm blocked again.

@ElijahLynn ElijahLynn moved this from Backlog to Done in Platform CMS: Lagoon Mar 31, 2023
@ndouglas
Copy link
Contributor

ndouglas commented Apr 7, 2023

LOL, I remember none of this.

Summarizing significant issues that I encountered in this PoC:

  1. Lagoon attempts to clone repos via SSH, but SSH outbound is blocked by the TIC. We need to keep Lagoon Core in the vagov-utility VPC and corresponding EKS cluster, which means outgoing connections transit the TIC. This means Lagoon needs to support HTTPS cloning.

  2. Lagoon's Docker image builds target Harbor and that's viewed as "insecure" from the perspective of EKS at this time, and we don't have authority to modify the relevant settings in the EKS cluster. So we might conceivably need to set request and justify settings modifications to support using Harbor insecurely or do the necessary work to make Harbor secure from the perspective of the EKS cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Issues that are blocked on factors other than blocking issues. Platform CMS Team
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants