Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1898341, bug-1898345: add "host" tag, fix metrics key prefix #6623

Merged
merged 2 commits into from
May 24, 2024

Conversation

willkg
Copy link
Contributor

@willkg willkg commented May 23, 2024

This adds a "host" tag to emitted metrics when running in GCP. This is derived from "HOSTNAME" if it exists, otherwise it defaults to socket.gethostname() like our other services.

This changes Sentry and logging to use HOSTNAME configuration variable rather than HOST_ID. This brings us in line with other services as we migrate to GCP.

This also adds "socorro." prefix to all emitted keys, but only for the GCP environments so we don't make a major change to socorro running in AWS at this time. This brings keys in line with our other services.

In order to do this, I had to create a singleton METRICS in socorro/libmarkus.py and then rework everything to use that.

To test:

  1. run the tests--they have assertions about keys and tags
  2. run socorro, process crashes, make sure metrics are emitted and look correct
    • processor, webapp, and crontabber metrics keys are NOT prefixed with socorro
    • keys do NOT include a host tag
  3. set CLOUD_PROVIDER=GCP, run socorro, process crashes, make sure metrics are emitted and look correct
    • processor, webapp, and crontabber metrics keys are prefixed with socorro
    • keys do include a host tag

@@ -12,19 +12,19 @@

LOGGING_LEVEL = CONFIG("LOGGING_LEVEL", "INFO")
LOCAL_DEV_ENV = CONFIG("LOCAL_DEV_ENV", False, cast=bool)
HOST_ID = socket.gethostname()
HOSTNAME = CONFIG("HOSTNAME", default=socket.gethostname())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're switching from HOST_ID to HOSTNAME everywhere. It's interesting that this didn't originally pull the HOST_ID from the environment--it only used socket.gethostname().

def filter(self, record):
record.host_id = HOST_ID
record.hostname = HOSTNAME
return True
Copy link
Contributor Author

@willkg willkg May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we've switched to GCP, we can remove AddHostname because we won't need it anymore. I wrote up a bug for that:

https://bugzilla.mozilla.org/show_bug.cgi?id=1898596

Copy link
Member

@relud relud May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why won't we need it anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because logs in gcp are already tagged with pod and container name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we did Eliot, I remember that Harold and I talked about this but I don't remember what we covered. I had this bug where I removed the logging filter from Eliot logging:

https://bugzilla.mozilla.org/show_bug.cgi?id=1815981

It doesn't talk about why, though.

Eliot app log entries look like this:

{
  "insertId": "bk16cbjhu33my4va",
  "jsonPayload": {
    "Logger": "eliot",
    "Fields": {
      "processname": "webapp",
      "msg": "symbolicate/v5: jobs: 6, symbols: 139, time: 0.9378394290106371"
    },
    "Hostname": "eliot-b967b84dc-grnhh",
    "Type": "eliot.symbolicate_resource",
    "EnvVersion": "2.0",
    "Pid": 865,
    "Timestamp": 1716566432197194000,
    "Severity": 6
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "namespace_name": "symbols-prod",
      "pod_name": "eliot-b967b84dc-grnhh",
      "project_id": "moz-fx-webservices-low-prod",
      "location": "us-west1",
      "container_name": "eliot",
      "cluster_name": "webservices-low-prod"
    }
  },
  "timestamp": "2024-05-24T16:00:32.197680883Z",
  "severity": "INFO",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-webservices-low--c2-prod-1-202311-c9515c07-d2ci",
    "k8s-pod/app_kubernetes_io/component": "eliot-data",
    "k8s-pod/pod-template-hash": "b967b84dc",
    "k8s-pod/env_code": "prod",
    "k8s-pod/app_kubernetes_io/name": "eliot"
  },
  "logName": "projects/moz-fx-webservices-low-prod/logs/stdout",
  "receiveTimestamp": "2024-05-24T16:00:34.452227485Z"
}

That includes a "hostname" field as well as a pod_name and container_name.

The "hostname" field gets added by the python-dockerflow mozlog formatter:

https://github.com/mozilla-services/python-dockerflow/blob/ba6936bf345f87aba0d96a9f63399e2a3aa395d3/src/dockerflow/logging.py#L115

@@ -98,8 +96,6 @@ def __init__(
self.bucket = bucket
self.dump_file_suffix = dump_file_suffix

self.metrics = markus.get_metrics(metrics_prefix)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never got used in the crashstorage classes except for ESCrashStorage. I removed it from everywhere and for ESCrashStorage I did something different.

self.metrics = markus.get_metrics(metrics_prefix)
# Create a MetricsInterface that includes the base prefix plus the prefix passed
# into __init__
self.metrics = markus.get_metrics(build_prefix(METRICS.prefix, metrics_prefix))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The METRICS.prefix will be either "" or "socorro" depending on the cloud environment. build_prefix will append the crash storage metrics prefix (i.e. "processor.es" here) to that and that results in the prefix it uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up a bug in Markus to make this a little easier. It'd be nice to hone a MetricsInterface iteratively. Then we wouldn't need this wonky looking thing.

willkg/markus#142

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes the host tag to be lost for socorro.processor.es.* metrics in GCP. For example i got these two log lines when testing, the first with host tag working for gcs storage, the second with host tag missing for es:

processor-1      | 2024-05-24 15:20:13,012 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.storage.save_processed_crash|12.225023994687945|#host:160ce9ffa6cc
processor-1      | 2024-05-24 15:20:13,039 INFO - processor - markus - Thread-1 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.crash_document_size|4454|

Copy link
Member

@relud relud May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this only breaks the host tag for metrics emitted in this file, metrics emitted in the base class still get the host tag, for example the first metric here has no host tag, but the second does have it:

processor-1      | 2024-05-24 15:20:13,148 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.index|95.75057029724121|#outcome:successful
processor-1      | 2024-05-24 15:20:13,148 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.es.save_processed_crash|136.01990399183705|#host:160ce9ffa6cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoops. It's because I'm not bringing over the filters, too. Good catch!

# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at https://mozilla.org/MPL/2.0/.

"""Holds Markus utility functions and global state."""
Copy link
Contributor Author

@willkg willkg May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively a copy from Antenna with some minor adjustments.

socorro/processor/pipeline.py Outdated Show resolved Hide resolved
@@ -48,7 +45,7 @@


def count_sentry_scrub_error(msg):
metrics.incr("sentry_scrub_error", 1)
METRICS.incr("webapp.sentry_scrub_error", 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key changes completely. Before we had webapp.crashstats.apps.sentry_scrub_error and now we've got webapp.sentry_scrub_error which matches the other services.

After this lands, we'll need to update dashboards.

],
)
else:
records = metrics_mock.filter_records(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem here that makes this wonky looking is that a host tag gets added and it's hard to know what the value would be. Markus doesn't have a good way of dealing with this. I wrote this up:

willkg/markus#141

I might have been able to implement something here, but Markus tries hard to allow for unstable orders and does this:

https://github.com/willkg/markus/blob/8dcc2fad9178dc03853e666c1a84e90a30411150/src/markus/testing.py#L109-L110

In the meantime, I did this ridiculous thing. We can change it later when Markus grows an AnyValue thing.

socorro/mozilla_settings.py Outdated Show resolved Hide resolved
@willkg willkg force-pushed the willkg-bug-1898341-hostname branch from 2778c64 to dc3e8e5 Compare May 24, 2024 01:50
@willkg willkg marked this pull request as ready for review May 24, 2024 01:50
@willkg willkg requested a review from a team as a code owner May 24, 2024 01:50
@willkg willkg force-pushed the willkg-bug-1898341-hostname branch from dc3e8e5 to a6c97ca Compare May 24, 2024 01:51
@willkg willkg requested a review from relud May 24, 2024 01:51
self.metrics = markus.get_metrics(metrics_prefix)
# Create a MetricsInterface that includes the base prefix plus the prefix passed
# into __init__
self.metrics = markus.get_metrics(build_prefix(METRICS.prefix, metrics_prefix))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes the host tag to be lost for socorro.processor.es.* metrics in GCP. For example i got these two log lines when testing, the first with host tag working for gcs storage, the second with host tag missing for es:

processor-1      | 2024-05-24 15:20:13,012 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.storage.save_processed_crash|12.225023994687945|#host:160ce9ffa6cc
processor-1      | 2024-05-24 15:20:13,039 INFO - processor - markus - Thread-1 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.crash_document_size|4454|

@@ -1,3 +1,5 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at https://mozilla.org/MPL/2.0/.

default_app_config = "crashstats.crashstats.apps.CrashstatsAppConfig"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is doing or why it's needed for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally moved the AppConfig so it was structured more like Tecken and copied this over. But then I abandoned that because it created issues, but I never deleted this line.

On top of that, turns out this line is deprecated since Django 3.2, so we don't need it anymore anyhow in Socorro or Tecken. I'll nix it.

willkg added 2 commits May 24, 2024 14:55
This adds a `host` tag to emitted metrics when running in GCP. This is
derived from `HOSTNAME` if it exists, otherwise it defaults to
`socket.gethostname()` like our other services.

This changes Sentry and logging to use `HOSTNAME` configuration variable
rather than `HOST_ID`. This brings us in line with other services as we
migrate to GCP.

This also adds `"socorro"` prefix to all emitted keys, but only for the
GCP environments. This brings keys in line with our other services.

In order to do this, I had to create a singleton `METRICS` and then
rework everything to use that.
Removes a vestigial default_app_config that we shouldn't have since
Django 3.2.

Fixes filters in the MetricsInterface ESCrashStorage uses.
@willkg willkg force-pushed the willkg-bug-1898341-hostname branch from b52a410 to 5a04971 Compare May 24, 2024 18:55
@willkg willkg requested a review from relud May 24, 2024 19:16
@willkg willkg merged commit 169ca35 into main May 24, 2024
2 checks passed
@willkg
Copy link
Contributor Author

willkg commented May 24, 2024

Thank you!

After this autodeploys to stage, I'll update the dashboards.

@willkg willkg deleted the willkg-bug-1898341-hostname branch May 24, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants