Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod / dev divergence monitoring #315

Open
bodymindarts opened this issue Dec 7, 2021 · 20 comments
Open

Prod / dev divergence monitoring #315

bodymindarts opened this issue Dec 7, 2021 · 20 comments

Comments

@bodymindarts
Copy link
Member

bodymindarts commented Dec 7, 2021

The settings in the default monitoring/values.yml are currently somewhat out of date.
Many values are being overridden in production.

Ideally we would like to:

  • have less prod overrides (ie. bring as many of the prod settings as possible into the defaults)
  • have a setup that verifiably works in the dev setup - in particular the values coming from the blackbox exporter (that monitor the main backend) should work locally.

Here are the current production overrides - note that some values are injected via terraform templates (eg ${graphql_playground_url}) - don't know how best to set defaults for that. At least for dev setup we should probably hard code the values.

prometheus:
  extraScrapeConfigs: |
    - job_name: 'prometheus-blackbox-exporter-noauth'
      metrics_path: /probe
      params:
        module: [buildParameters]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
    - job_name: 'prometheus-blackbox-exporter-auth'
      scrape_timeout: 30s
      metrics_path: /probe
      params:
        module: [walletAuth]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115

  alerts:
    groups:
    - name: Ingress Controller
      rules:
      - alert: NGINXTooMany500s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 5XXs
          summary: More than 5% of all requests returned 5XX
      - alert: NGINXTooMany400s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 4XXs
          summary: More than 5% of all requests returned 4XX
    - name: ${instance_name}
      rules:
      - alert: PodRestart
        expr: increase(kube_pod_container_status_restarts_total{namespace=~'${galoy_namespace}|${bitcoin_namespace}'}[10m]) >= 2
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} restarted too many times"
      - alert: PodStartupError
        for: 1m
        expr: kube_pod_container_status_waiting_reason{reason!="ContainerCreating",namespace=~'${galoy_namespace}|${bitcoin_namespace}'} == 1
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} is unable to start"
      - alert: GraphqlIssue
        for: 3m
        expr: probe_success{job="prometheus-blackbox-exporter-mainnet"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
      - alert: GraphqlNoAuthIssue
        for: 3m
        expr: probe_success{namespace=~'${galoy_namespace}', job="prometheus-blackbox-exporter-noauth"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
  
  alertmanagerFiles:
    alertmanager.yml:
      global:
        pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
      route:
        group_wait: 10s
        group_interval: 10m
        receiver: slack
        repeat_interval: 6h
        routes:
        - receiver: slack-pagerduty
          matchers:
            - severity="critical"
          group_interval: 2m

prometheus-blackbox-exporter:
  secretConfig: true
  config:
    modules:
      buildParameters:
        prober: http
        timeout: 3s
        http:
          method: POST
          headers:
            Content-Type: application/json
          body: '{"query":"query buildParameters { buildParameters { id commitHash buildTime helmRevision minBuildNumberAndroid minBuildNumberIos lastBuildNumberAndroid lastBuildNumberIos }}","variables":{}}'
      walletAuth:
        prober: http
        timeout: 30s
        http:
          method: POST
          fail_if_body_matches_regexp:
            - "errors+"
          headers:
            Content-Type: application/json
          body: '{"query":"query gql_query_logged { prices { __typename id o } earnList { __typename id value completed } wallet { __typename id balance currency transactions { __typename id amount description created_at hash type usd fee feeUsd pending } } getLastOnChainAddress { __typename id } me { __typename id level username phone } maps { __typename id title coordinate { __typename latitude longitude } } nodeStats { __typename id } }","variables":{}}'

Another file containing sensitive information is also merged in:

prometheus:
  alertmanagerFiles:
    alertmanager.yml:
      global:
        slack_api_url: ${slack_api_url}
      receivers:
        - name: slack
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true
        - name: slack-pagerduty
          pagerduty_configs:
          - service_key: ${pagerduty_service_key}
            send_resolved: true
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true

prometheus-blackbox-exporter:
  config:
    modules:
      walletAuth:
        http:
          headers:
            Authorization: Bearer ${probe_auth_token}
@mjschmidt
Copy link

mjschmidt commented Dec 10, 2021

went ahead and stood up a quick vm in GCP and setup the environment (setup instructions were reasonably easy to follow, only minor snag was getting the right version of npm onto my ubuntu machine).

Did notice as I went to stand up the charts that the terraform main file was making reference to a branch that may have been cleaned up recently.

make init
terraform init
Initializing modules...
Downloading git::https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f for infra_services...
╷
│ Error: Failed to download module
│ 
│ Could not download module "infra_services" (main.tf:6) source code from "git::https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f": error downloading 'https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f': /usr/bin/git exited with
│ 128: Cloning into '.terraform/modules/infra_services'...
│ fatal: Remote branch fbafa3f not found in upstream origin
│ 
╵

╷
│ Error: Failed to download module
│ 
│ Could not download module "infra_services" (main.tf:6) source code from "git::https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f": error downloading 'https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f': /usr/bin/git exited with
│ 128: Cloning into '.terraform/modules/infra_services'...
│ fatal: Remote branch fbafa3f not found in upstream origin
│ 
╵

╷
│ Error: Failed to download module
│ 
│ Could not download module "infra_services" (main.tf:6) source code from "git::https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f": error downloading 'https://github.com/GaloyMoney/galoy-infra.git?ref=fbafa3f': /usr/bin/git exited with
│ 128: Cloning into '.terraform/modules/infra_services'...
│ fatal: Remote branch fbafa3f not found in upstream origin

I removed the reference and instead pointed to the main branch of the project, hopefully that will be okay, if not its easy to clean up the dev cluster and redeploy using the correct branch.

It looks like the main.tf file is looking for the honeycomb_api_key arguement to be replaced by an argument called secrets, however upon replacement of that argument I am encountering the following error. Unfortunately I am not familiar enough with the codebase and when I searched through the dev folder I only found four references to this honeycomb api key. I am not sure what it is being used for exactly past a few opentelementry/jeager references in several files within the .terraform folder. I do know how I am setting that api key is not correct, there isn't any documentation I can see on setting that variable correctly.

make deploy-services
terraform apply -target module.infra_services.helm_release.cert_manager -auto-approve
╷
│ Warning: Resource targeting is in effect
│ 
│ You are creating a plan with the -target option, which means that the result of this plan may not represent all of the changes requested by the current configuration.
│ 
│ The -target option is not for routine use, and is provided only for exceptional situations such as recovering from errors or mistakes, or when Terraform specifically suggests to use it as part of an error message.
╵
╷
│ Error: Error in function call
│ 
│   on .terraform/modules/infra_services/modules/services/variables.tf line 33, in locals:
│   33:   honeycomb_api_key           = jsondecode(var.secrets).honeycomb_api_key
│     ├────────────────
│     │ var.secrets has a sensitive value
│ 
│ Call to function "jsondecode" failed: invalid character 'd' looking for beginning of value.
╵
make: *** [Makefile:13: deploy-services] Error 1

The error is referring to the secret = "dummy" key value pair I added

cat main.tf 
locals {
  name_prefix              = "galoy-dev"
  letsencrypt_issuer_email = "[email protected]"
}

module "infra_services" {
  source = "git::https://github.com/GaloyMoney/galoy-infra.git//modules/services"

  name_prefix              = local.name_prefix
  letsencrypt_issuer_email = local.letsencrypt_issuer_email
  local_deploy             = true
  cluster_endpoint         = "dummy"
  cluster_ca_cert          = "dummy"
  #honeycomb_api_key        = "dummy"
  secrets                  = "dummy"
}

module "bitcoin" {
  source = "./bitcoin"

  name_prefix = local.name_prefix
}

module "galoy" {
  source = "./galoy"

  name_prefix = local.name_prefix

  depends_on = [
    module.bitcoin
  ]
}

module "monitoring" {
  source = "./monitoring"

  name_prefix = local.name_prefix
}

module "addons" {
  source = "./addons"

  name_prefix = local.name_prefix

  depends_on = [
    module.galoy
  ]
}

provider "kubernetes" {
  experiments {
    manifest_resource = true
  }

You can view the one line change I made here:
https://github.com/mjschmidt/charts-1/tree/prometheus-updates

@bodymindarts
Copy link
Member Author

For the terraform module download issue you ran into hashicorp/terraform#30119.

It only effects the latest version of terraform - we had the same problem in our CI.

@bodymindarts
Copy link
Member Author

As to the secrets - yeah that's a recent refactoring we didn't update yet in dev. Also sorry for that. Will fix.

@bodymindarts
Copy link
Member Author

bodymindarts commented Dec 10, 2021

Alright - I've made a commit and verified that the following sequence brings up a dev environment for me locally:

 $ terraform --version
Terraform v1.0.11
on darwin_amd64

Your version of Terraform is out of date! The latest version
is 1.1.0. You can update by downloading from https://www.terraform.io/downloads.html
$ cd dev
$ make init
$ make create-cluster
$ make deploy-services
$ make deploy

Make sure you aren't using the latest terraform as it has a bug when referencing GitHub hosted modules.
Sorry you ran into this issue.
Not quite sure what you're using a GCP vm for - we run k8s locally for dev with https://k3d.io/.
If you get stuck again just LMK.

@mjschmidt
Copy link

mjschmidt commented Dec 11, 2021

Rgr.
Finding is issues is part of the development process :)

I'll go ahead and wipe my changes, downgrade my terraform, and get started with this over the weekend.
Thank you!

and I'm using the GCP vm as my local development environment.

@mjschmidt
Copy link

mjschmidt commented Dec 11, 2021

Sorry one (or two) last problem with respects to the deployment of the services portion of the makefile that deploys the galoy charts to k3s cluster.

It looks like the services I have the least experience with (lightning, as I run my own node, but I have only started looking at lnd) are not starting up correctly:

$ kubectl get pods --all-namespaces

NAMESPACE              NAME                                                      READY   STATUS              RESTARTS   AGE
kube-system            local-path-provisioner-5ff76fc89d-s78vc                   1/1     Running             0          41m
galoy-dev-otel         opentelemetry-collector-86c75cdfb9-6456l                  1/1     Running             0          41m
kube-system            coredns-7448499f4d-zpmxm                                  1/1     Running             0          41m
kube-system            metrics-server-86cbb8457f-zrgzs                           1/1     Running             0          41m
galoy-dev-ingress      ingress-nginx-controller-6c9594575f-b2w89                 1/1     Running             0          41m
galoy-dev-ingress      cert-manager-cainjector-856d4df858-589hz                  1/1     Running             0          41m
galoy-dev-ingress      cert-manager-66b6d6bf59-6mvq7                             1/1     Running             0          41m
galoy-dev-ingress      cert-manager-webhook-6d94f58b9-24sn8                      1/1     Running             0          41m
galoy-dev-monitoring   monitoring-prometheus-blackbox-exporter-76778c4bd-2rbjq   1/1     Running             0          40m
galoy-dev-monitoring   monitoring-prometheus-node-exporter-ncrcd                 1/1     Running             0          40m
galoy-dev-monitoring   monitoring-kube-state-metrics-5ffd577c76-kznhl            1/1     Running             0          40m
galoy-dev-bitcoin      bitcoind-0                                                2/2     Running             0          40m
galoy-dev-bitcoin      lnd1-lndmon-7d9b9cd896-srvnn                              0/1     ContainerCreating   0          39m
galoy-dev-monitoring   monitoring-prometheus-alertmanager-7894858c4d-frz5d       1/1     Running             0          40m
galoy-dev-monitoring   monitoring-grafana-55d5759948-m962t                       1/1     Running             0          40m
galoy-dev-monitoring   monitoring-prometheus-server-0                            2/2     Running             0          40m
galoy-dev-bitcoin      lnd1-0                                                    2/3     CrashLoopBackOff    11         39m

for the lnd deamon this looks like a secret that may be predeployed prior to start up since you mentioned that vault was being used but I do not see a vault as part of the deployment?

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    36m                   default-scheduler  Successfully assigned galoy-dev-bitcoin/lnd1-lndmon-7d9b9cd896-srvnn to k3d-k3s-default-server-0
  Warning  FailedMount  29m (x2 over 34m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[lnd-tls lnd-macaroons], unattached volumes=[lnd-tls lnd-macaroons kube-api-access-vlz77]: timed out waiting for the condition
  Warning  FailedMount  20m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[lnd-tls lnd-macaroons], unattached volumes=[kube-api-access-vlz77 lnd-tls lnd-macaroons]: timed out waiting for the condition
  Warning  FailedMount  15m (x3 over 31m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[lnd-macaroons lnd-tls], unattached volumes=[lnd-macaroons kube-api-access-vlz77 lnd-tls]: timed out waiting for the condition
  Warning  FailedMount  9m32s (x21 over 36m)  kubelet            MountVolume.SetUp failed for volume "lnd-tls" : secret "lnd1-credentials" not found
  Warning  FailedMount  5m28s (x23 over 36m)  kubelet            MountVolume.SetUp failed for volume "lnd-macaroons" : secret "lnd1-credentials" not found

For the lnd container in the lnd1-0 pod, this looks like it could be because the container above never starts up? I am not sure exactly how that is working yet:

$ kubectl -n galoy-dev-bitcoin  logs lnd1-0 lnd

2021-12-11 19:01:00.752 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: wallet locked, unlock it to enable full RPC access
2021-12-11 19:01:01.049 [INF] CHRE: Primary chain is set to: bitcoin
2021-12-11 19:01:01.253 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: the RPC server is in the process of starting up, but not yet ready to accept calls
2021-12-11 19:01:01.353 [INF] LNWL: Started listening for bitcoind block notifications via ZMQ on 10.43.162.211:28332
2021-12-11 19:01:01.355 [INF] LNWL: Started listening for bitcoind transaction notifications via ZMQ on 10.43.162.211:28333
unable to create wallet controller: missing address manager namespace
2021-12-11 19:01:01.360 [ERR] LTND: unable to create chain control: missing address manager namespace
2021-12-11 19:01:01.361 [INF] LTND: Shutdown complete

unable to create chain control: missing address manager namespace

I still am interested in learning how these work, but I should be able to update the monitoring section without getting these specific containers running on my system. I can go ahead and update some of these values. I just may have a hard time verifying that the changes depending on how the galoy specific services are interacting with the btcoind and lnd setup.

I went ahead and reviewed the values files and am getting starting with replacing some of the overrides that seem obvious in terms of bringing dev and prod closer to together, but I am going to be more conservative since I am new in terms of contributing.

More interestingly -
I thought about the terraform default variables, which is funny because my current project I am on was also talking about the best way to handle nested variables inside of helm charts for Prometheus operator, which would be great because then the things like your alert groups, rules, routes, etc. could all be handled through terraform overwrites before being deploy as custom resources right into Kubernetes.

I digress on the ${graphql_playground_url} mention, it looks like the first link of explicitly setting a default could be what your looking for:
https://discuss.hashicorp.com/t/how-to-set-a-default-value-for-a-variable-in-a-terraform-template/12817

I also found helm has added a tpl, which is not particularly useful in this specific instance since Galoy is leveraging the community prom charts:
https://helm.sh/docs/howto/charts_tips_and_tricks/#using-the-tpl-function

I wonder if it could allow you guys to break out your terraform values into a conf file in cases where the team was writing and maintaining its own helm charts?

@bodymindarts
Copy link
Member Author

Cool inputs - thanks for the ideas!

Please treat the failing LND as out of scope here. lndmon will always fail if lnd is failing and it'll be tricky to debug your setup via this ticket. It shouldn't impact the monitoring related work.

@mjschmidt
Copy link

Okay I had to up the size on my vm to get enough resources to run all the software, it looks like half the galoy-dev-galoy namespaced software is expecting a secret kubectl get secret lnd1-credentials -o jsonpath='{.data}' and unfortunately that secret is empty.

Perhaps another secret handled by vault?
I am sorry I feel like I am wasting your time with this. I turned the bitcoin module back on and tried to deploy that way and instead changed the dependency of the galoy module to monitoring in case it was the bitcoin module that populated that secret, but no dice.

@bodymindarts
Copy link
Member Author

bodymindarts commented Dec 12, 2021

There is currently no vault in the stack. I may have mentioned it as something we want to do.

The secret that is missing gets populated by a container that needs access to values that get generated during startup https://github.com/GaloyMoney/charts/blob/main/charts/lnd/templates/export-secrets-configmap.yaml#L8
Its part of initially bringing up lnd and its a bit fragile to certain conditions.

it probably didn’t work because of the limited resources. Try again from a clean state with enough space.

@krtk6160
Copy link
Member

@mjschmidt You can also do kubectl logs lnd1-0 -c export-secrets to see the problem with the export-secrets container

@mjschmidt
Copy link

@krtk6160 this makes a lot of sense because the lnd1-0 is not starting, as a result the export-secrets container can't contact the lnd container inside the pod, which means that export secrets container never sets the secrets for galoy banking services.

Was able to verify this in the logs as well

[lncli] rpc error: code = Unknown desc = wallet locked, unlock it to enable full RPC access
[lncli] rpc error: code = Unavailable desc = transport is closing
[lncli] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:10009: connect: connection refused"

I could probably hack it by editing the lnd cm or not hack it by editing the terraform/values file for lnd sice it all seems to be being set by the lnd.conf conf file set in the lnd1 CM, but I would need to know the key value pair lnd is looking for in terms of configuration. Any ideas? I assume its the key for whatever sets the address manager namespace

2021-12-13 01:03:20.416 [INF] CHRE: Primary chain is set to: bitcoin
2021-12-13 01:03:20.516 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: the RPC server is in the process of starting up, but not yet ready to accept calls
2021-12-13 01:03:20.715 [INF] LNWL: Started listening for bitcoind block notifications via ZMQ on 10.43.31.86:28332
2021-12-13 01:03:20.716 [INF] LNWL: Started listening for bitcoind transaction notifications via ZMQ on 10.43.31.86:28333
unable to create wallet controller: missing address manager namespace
2021-12-13 01:03:20.720 [ERR] LTND: unable to create chain control: missing address manager namespace
2021-12-13 01:03:20.720 [INF] LTND: Shutdown complete

unable to create chain control: missing address manager namespace

@mjschmidt
Copy link

mjschmidt commented Jan 1, 2022

@krtk6160 sorry to ping you again, was wondering if you could or knew what part of the code I could look at to find the answer to this.

It looks like a secret wasn't getting mounted so I went into the bitcoin-values yaml and turned that on. It didn't appear to be a port problem when I was looking at how bitcoind and lnd are suppose to interact over port 18443.

@mjschmidt
Copy link

So it looks like if you're using the terraform you cannot turn secret creation in the bitoin chart on.

@krtk6160
Copy link
Member

krtk6160 commented Jan 2, 2022

So it looks like if you're using the terraform you cannot turn secret creation in the bitoin chart on.

The reason why we have secrets.create=false is because we create the secret via terraform.

@mjschmidt
Copy link

right right, I got that, I am just unsure as to why my "lnd1-credentials" secret is not being created. From what I gather off readit I think that my lnd daemon not starting up is what is preventing my lnd container from being able to connect to bitcoin, while the lnd daemon is not starting due to a missing secret.

@mjschmidt
Copy link

I was hoping that turning on the bitcoin "create secret" would be what created the lnd secret as well.

@bodymindarts
Copy link
Member Author

@mjschmidt it would be interesting to verify wether or not you can run the setup locally with k3d as intended. If it does you have a working example to compare with your remote vm setup - if not then we can more easily help you debug as you'd be running it on a setup we intend to support and can try to reproduce.

@mjschmidt
Copy link

Went ahead and attempted the galoy stack on my local machine (I was worried I would have problems attempting on a windows10) and I got the same error unfortunately that I had with the ubuntu setup with the lnd-credentials kubernetes secret halting the deployment.

On the bright side, while it didn't take me very long, it was a good opportunity to get docker setup on my local machine as I rarely use Windows based distributions for development (mostly linux).

@mjschmidt
Copy link

kubectl describe pod/lnd1-lndmon-7d9b9cd896-r6c6k
...
Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    4m50s                 default-scheduler  Successfully assigned galoy-dev-bitcoin/lnd1-lndmon-7d9b9cd896-r6c6k to k3d-k3s-default-server-0
  Warning  FailedMount  2m47s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[lnd-tls lnd-macaroons], unattached volumes=[lnd-tls lnd-macaroons kube-api-access-lxqpr]: timed out waiting for the condition
  Warning  FailedMount  40s (x10 over 4m50s)  kubelet            MountVolume.SetUp failed for volume "lnd-tls" : secret "lnd1-credentials" not found
  Warning  FailedMount  40s (x10 over 4m50s)  kubelet            MountVolume.SetUp failed for volume "lnd-macaroons" : secret "lnd1-credentials" not found
  Warning  FailedMount  30s                   kubelet            Unable to attach or mount volumes: unmounted volumes=[lnd-tls lnd-macaroons], unattached volumes=[kube-api-access-lxqpr lnd-tls lnd-macaroons]: timed out waiting for the condition

@bodymindarts
Copy link
Member Author

@mjschmidt just FYI we've also seen issues bringing up lnd locally since switching images from lncm to the lightning labs curated one. Will let you know when its sorted out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants