Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos failing to query data due to memcached #797

Closed
tumido opened this issue May 3, 2021 · 15 comments
Closed

Thanos failing to query data due to memcached #797

tumido opened this issue May 3, 2021 · 15 comments
Assignees

Comments

@tumido
Copy link
Member

tumido commented May 3, 2021

Grafana reporting error:

<html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>

Thanos logs:

level=error ts=2021-05-03T15:10:51.072919873Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

Thanos shard logs:

level=warn ts=2021-05-03T15:01:17.170470941Z caller=memcached_client.go:382 msg="failed to fetch items from memcached" numKeys=1 firstKey=attrs:01F4J99JM1SB8WQZPN1W6G42N6/chunks/000001 err="read tcp 10.130.3.22:58630->10.130.2.59:11211: i/o timeout"

Memcached logs:

Failed to write, and not due to blocking: Broken pipe
@tumido
Copy link
Member Author

tumido commented May 3, 2021

Extracted from: https://github.com/operate-first/SRE/issues/280

@tumido
Copy link
Member Author

tumido commented May 3, 2021

This is a separate issue from 280. Trying to upscale memcached pod.

@tumido
Copy link
Member Author

tumido commented May 3, 2021

* failed to execute action (CreateOrUpdate): discovering resource information failed for  in : groupVersion shouldn't be empty
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-query-frontend" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-cache", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-query-frontend"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-rule" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-compact" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-loki-query-frontend" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-frontend", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"loki", "app.kubernetes.io/part-of":"observatorium"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-store-shard-0" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-receive-controller" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"kubernetes-controller", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-receive-controller"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-observatorium-api" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"api", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"observatorium-api", "app.kubernetes.io/part-of":"observatorium"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-thanos-query" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"query-layer", "app.kubernetes.io/instance":"opf-observatorium", "app.kubernetes.io/name":"thanos-query"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-loki-querier" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): Deployment.apps "opf-observatorium-loki-distributor" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/component":"distributor", "app.kubernetes.io/instance":"observatorium-xyz", "app.kubernetes.io/name":"loki", "app.kubernetes.io/part-of":"observatorium", "loki.grafana.com/gossip":"true"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-loki-ingester" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
* failed to execute action (CreateOrUpdate): StatefulSet.apps "opf-observatorium-thanos-receive-default" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden

@tumido
Copy link
Member Author

tumido commented May 3, 2021

The currently deployed Observatorium version doesn't provide a way to scale up the default memory limit on memcached, we had to upgrade Observatorium to the latest image.

Upgrade wasn't smooth - we had to delete the PVCs and add a anyuid to the containers.

@tumido
Copy link
Member Author

tumido commented May 3, 2021

Memcached still OOMKilled when a big query is run.

@tumido tumido reopened this May 3, 2021
@HumairAK
Copy link
Member

HumairAK commented May 3, 2021

So we're seeing issues on 3 fronts:

  1. We are seeing 2 types of errors in odh-prometheus, 409 conflict errors (not sure where these are coming from. 500 internal errors, likely due to failures occurring in observatorium (see below)
  2. Memchached in opf-observatorium, continues to cap out on memroy and crash, it doesn't seem to survive stress tests (and I use the word stress loosely here) very well, at 1GiB of memory it crashes.
  3. Thanos Compact continues to fail it's liveness/readiness probe tests, we suspect this may be due to a short timeout of 1s, though we are not certain.

@tumido
Copy link
Member Author

tumido commented May 3, 2021

Prometheus errors:

ts=2021-05-03T19:13:10.535Z caller=dedupe.go:112 component=remote level=warn remote_name=645dc0 url=http://opf-observatorium-thanos-receive.opf-observatorium.svc.cluster.local:19291/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: "
ts=2021-05-03T19:13:22.845Z caller=dedupe.go:112 component=remote level=error remote_name=645dc0 url=http://opf-observatorium-thanos-receive.opf-observatorium.svc.cluster.local:19291/api/v1/receive msg="non-recoverable error" count=494 err="server returned HTTP status 409 Conflict: conflict"

@HumairAK
Copy link
Member

HumairAK commented May 3, 2021

Also it seems like the images being specified in the observatorium CR are not being respected.

@tumido
Copy link
Member Author

tumido commented May 4, 2021

Similar issues being handled elsewhere:

Take away: lower memcached memory limit so the store is not hammered by big queries and avoid big queries.

We need to reach out to upstream to help us tune the setup. Maybe we're missing some rollup/downsamping settings somewhere.

@4n4nd 4n4nd self-assigned this May 6, 2021
@4n4nd
Copy link

4n4nd commented May 7, 2021

Also it seems like the images being specified in the observatorium CR are not being respected.

I have created an issue upstream for this: observatorium/operator#67

@4n4nd
Copy link

4n4nd commented May 7, 2021

For the time being, I have scaled down the Observatorium Operator and manually Updated deployments to use correct versions of the image (quay.io/thanos/thanos:v0.20.1).

@HumairAK
Copy link
Member

HumairAK commented Jul 7, 2021

can this be closed @4n4nd ?

@4n4nd
Copy link

4n4nd commented Jul 7, 2021

the default image issue is still there in the operator, so let's keep this issue open to track it.

@HumairAK
Copy link
Member

HumairAK commented Oct 7, 2021

@4n4nd is this issue still relevant? can it be closed? iirc you mentioned you'll be making some changes to how we deploy observatorium for smaug.

@4n4nd
Copy link

4n4nd commented Oct 7, 2021

yeah we can close this

@4n4nd 4n4nd closed this as completed Oct 7, 2021
@durandom durandom transferred this issue from operate-first/operations Sep 6, 2022
This issue is being transferred. Timeline may not be complete until it finishes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants