-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug-1898341, bug-1898345: add "host" tag, fix metrics key prefix #6623
Conversation
@@ -12,19 +12,19 @@ | |||
|
|||
LOGGING_LEVEL = CONFIG("LOGGING_LEVEL", "INFO") | |||
LOCAL_DEV_ENV = CONFIG("LOCAL_DEV_ENV", False, cast=bool) | |||
HOST_ID = socket.gethostname() | |||
HOSTNAME = CONFIG("HOSTNAME", default=socket.gethostname()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're switching from HOST_ID
to HOSTNAME
everywhere. It's interesting that this didn't originally pull the HOST_ID
from the environment--it only used socket.gethostname()
.
def filter(self, record): | ||
record.host_id = HOST_ID | ||
record.hostname = HOSTNAME | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we've switched to GCP, we can remove AddHostname
because we won't need it anymore. I wrote up a bug for that:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why won't we need it anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because logs in gcp are already tagged with pod and container name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we did Eliot, I remember that Harold and I talked about this but I don't remember what we covered. I had this bug where I removed the logging filter from Eliot logging:
https://bugzilla.mozilla.org/show_bug.cgi?id=1815981
It doesn't talk about why, though.
Eliot app log entries look like this:
{
"insertId": "bk16cbjhu33my4va",
"jsonPayload": {
"Logger": "eliot",
"Fields": {
"processname": "webapp",
"msg": "symbolicate/v5: jobs: 6, symbols: 139, time: 0.9378394290106371"
},
"Hostname": "eliot-b967b84dc-grnhh",
"Type": "eliot.symbolicate_resource",
"EnvVersion": "2.0",
"Pid": 865,
"Timestamp": 1716566432197194000,
"Severity": 6
},
"resource": {
"type": "k8s_container",
"labels": {
"namespace_name": "symbols-prod",
"pod_name": "eliot-b967b84dc-grnhh",
"project_id": "moz-fx-webservices-low-prod",
"location": "us-west1",
"container_name": "eliot",
"cluster_name": "webservices-low-prod"
}
},
"timestamp": "2024-05-24T16:00:32.197680883Z",
"severity": "INFO",
"labels": {
"compute.googleapis.com/resource_name": "gke-webservices-low--c2-prod-1-202311-c9515c07-d2ci",
"k8s-pod/app_kubernetes_io/component": "eliot-data",
"k8s-pod/pod-template-hash": "b967b84dc",
"k8s-pod/env_code": "prod",
"k8s-pod/app_kubernetes_io/name": "eliot"
},
"logName": "projects/moz-fx-webservices-low-prod/logs/stdout",
"receiveTimestamp": "2024-05-24T16:00:34.452227485Z"
}
That includes a "hostname" field as well as a pod_name and container_name.
The "hostname" field gets added by the python-dockerflow mozlog formatter:
@@ -98,8 +96,6 @@ def __init__( | |||
self.bucket = bucket | |||
self.dump_file_suffix = dump_file_suffix | |||
|
|||
self.metrics = markus.get_metrics(metrics_prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This never got used in the crashstorage classes except for ESCrashStorage
. I removed it from everywhere and for ESCrashStorage
I did something different.
socorro/external/es/crashstorage.py
Outdated
self.metrics = markus.get_metrics(metrics_prefix) | ||
# Create a MetricsInterface that includes the base prefix plus the prefix passed | ||
# into __init__ | ||
self.metrics = markus.get_metrics(build_prefix(METRICS.prefix, metrics_prefix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The METRICS.prefix
will be either ""
or "socorro"
depending on the cloud environment. build_prefix
will append the crash storage metrics prefix (i.e. "processor.es"
here) to that and that results in the prefix it uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote up a bug in Markus to make this a little easier. It'd be nice to hone a MetricsInterface
iteratively. Then we wouldn't need this wonky looking thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes the host tag to be lost for socorro.processor.es.* metrics in GCP. For example i got these two log lines when testing, the first with host tag working for gcs storage, the second with host tag missing for es:
processor-1 | 2024-05-24 15:20:13,012 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.storage.save_processed_crash|12.225023994687945|#host:160ce9ffa6cc
processor-1 | 2024-05-24 15:20:13,039 INFO - processor - markus - Thread-1 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.crash_document_size|4454|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, this only breaks the host tag for metrics emitted in this file, metrics emitted in the base class still get the host tag, for example the first metric here has no host tag, but the second does have it:
processor-1 | 2024-05-24 15:20:13,148 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.index|95.75057029724121|#outcome:successful
processor-1 | 2024-05-24 15:20:13,148 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.es.save_processed_crash|136.01990399183705|#host:160ce9ffa6cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, whoops. It's because I'm not bringing over the filters, too. Good catch!
# License, v. 2.0. If a copy of the MPL was not distributed with this | ||
# file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
"""Holds Markus utility functions and global state.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively a copy from Antenna with some minor adjustments.
@@ -48,7 +45,7 @@ | |||
|
|||
|
|||
def count_sentry_scrub_error(msg): | |||
metrics.incr("sentry_scrub_error", 1) | |||
METRICS.incr("webapp.sentry_scrub_error", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This key changes completely. Before we had webapp.crashstats.apps.sentry_scrub_error
and now we've got webapp.sentry_scrub_error
which matches the other services.
After this lands, we'll need to update dashboards.
], | ||
) | ||
else: | ||
records = metrics_mock.filter_records( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem here that makes this wonky looking is that a host
tag gets added and it's hard to know what the value would be. Markus doesn't have a good way of dealing with this. I wrote this up:
I might have been able to implement something here, but Markus tries hard to allow for unstable orders and does this:
In the meantime, I did this ridiculous thing. We can change it later when Markus grows an AnyValue
thing.
2778c64
to
dc3e8e5
Compare
dc3e8e5
to
a6c97ca
Compare
socorro/external/es/crashstorage.py
Outdated
self.metrics = markus.get_metrics(metrics_prefix) | ||
# Create a MetricsInterface that includes the base prefix plus the prefix passed | ||
# into __init__ | ||
self.metrics = markus.get_metrics(build_prefix(METRICS.prefix, metrics_prefix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes the host tag to be lost for socorro.processor.es.* metrics in GCP. For example i got these two log lines when testing, the first with host tag working for gcs storage, the second with host tag missing for es:
processor-1 | 2024-05-24 15:20:13,012 INFO - processor - markus - Thread-2 - METRICS|2024-05-24 15:20:13|timing|socorro.processor.storage.save_processed_crash|12.225023994687945|#host:160ce9ffa6cc
processor-1 | 2024-05-24 15:20:13,039 INFO - processor - markus - Thread-1 - METRICS|2024-05-24 15:20:13|histogram|socorro.processor.es.crash_document_size|4454|
webapp/crashstats/__init__.py
Outdated
@@ -1,3 +1,5 @@ | |||
# This Source Code Form is subject to the terms of the Mozilla Public | |||
# License, v. 2.0. If a copy of the MPL was not distributed with this | |||
# file, You can obtain one at https://mozilla.org/MPL/2.0/. | |||
|
|||
default_app_config = "crashstats.crashstats.apps.CrashstatsAppConfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is doing or why it's needed for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally moved the AppConfig so it was structured more like Tecken and copied this over. But then I abandoned that because it created issues, but I never deleted this line.
On top of that, turns out this line is deprecated since Django 3.2, so we don't need it anymore anyhow in Socorro or Tecken. I'll nix it.
This adds a `host` tag to emitted metrics when running in GCP. This is derived from `HOSTNAME` if it exists, otherwise it defaults to `socket.gethostname()` like our other services. This changes Sentry and logging to use `HOSTNAME` configuration variable rather than `HOST_ID`. This brings us in line with other services as we migrate to GCP. This also adds `"socorro"` prefix to all emitted keys, but only for the GCP environments. This brings keys in line with our other services. In order to do this, I had to create a singleton `METRICS` and then rework everything to use that.
Removes a vestigial default_app_config that we shouldn't have since Django 3.2. Fixes filters in the MetricsInterface ESCrashStorage uses.
b52a410
to
5a04971
Compare
Thank you! After this autodeploys to stage, I'll update the dashboards. |
This adds a "host" tag to emitted metrics when running in GCP. This is derived from "HOSTNAME" if it exists, otherwise it defaults to
socket.gethostname()
like our other services.This changes Sentry and logging to use
HOSTNAME
configuration variable rather thanHOST_ID
. This brings us in line with other services as we migrate to GCP.This also adds
"socorro."
prefix to all emitted keys, but only for the GCP environments so we don't make a major change to socorro running in AWS at this time. This brings keys in line with our other services.In order to do this, I had to create a singleton
METRICS
insocorro/libmarkus.py
and then rework everything to use that.To test:
socorro
host
tagCLOUD_PROVIDER=GCP
, run socorro, process crashes, make sure metrics are emitted and look correctsocorro
host
tag