Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opentelemetry plugin] Specify api.ft.com metrics for underneath APIs #1191

Open
gyss opened this issue Sep 6, 2024 · 11 comments
Open

[Opentelemetry plugin] Specify api.ft.com metrics for underneath APIs #1191

gyss opened this issue Sep 6, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request package: opentelemetry Related to the OpenTelemetry package

Comments

@gyss
Copy link
Contributor

gyss commented Sep 6, 2024

We'd like to have added a hook in the Opentelemetry plugin that adds specificity to the api.ft.com net_peer_name attribute. So we can select the metrics specific to an API that lives inside the API Gateway

What problem does this feature solve?

Due to the ongoing Graphite migration, we need to replace some of the system's Graphite based alerts with Opentelemetry.

Some of these alerts do need to trigger when a specific service is not working as it should. For example, let's see this alert in next-myft-api

	{
		name: 'Insights API is serving topic recommendations',
		id: 'snr-recs-availability',
		type: 'graphiteThreshold',
		metric: 'asPercent(sumSeries(next.heroku.myft-api.*.fetch.insights-api-topic-recommendations.response.status_{5xx,4xx}.count), sumSeries(next.heroku.myft-api.*.fetch.insights-api-topic-recommendations.response.status_200.count))',
		threshold: 10,
		severity: 2,
		businessImpact: 'Topic recommendations in myft pages or in some daily digest email might be missing.',
		technicalSummary: 'Calls to insight API to get topic recommendations are returning 4xx and 5xx results',
		panicGuide: 'Check if there is any issue with Insights API with Search and Recommendation team. You can get more details on Splunk with the query `index=heroku source="next-myft-api" sourcetype="heroku:app" level=error path="/v2/recommendation/user/*/concept"`. It is possible to modify the trafic to get all recommendations from neo4j: in Doppler, set up the env PERCENT_SNR_TOPIC_RECS to 0 and make sure FORCE_SNR_TOPIC_RECS is false. That will stop the calls to Insight API.',
	}

Graphite health check

The metrics inside next.heroku.myft-api.*.fetch.insights-api-topic-recommendations correspond to the requests done to the service in this URL https://api.ft.com/snr/v1/insights/topic-recommendations/, which lives inside of the API Gateway. But when creating the Opentelemetry Grafana panel for this alert, we find that net_peer_name doesn't allow us to select topic-recommendations, only api.ft.com because this label is domain based.

Screenshot 2024-09-06 at 09 58 26

Link to Grafana panel

This means that if next-myft-api uses many services that are inside API Gateway (which it does), all the metrics are bundled together and we can't tell where the 5xx or 4xx are coming from

Ideal solution

As suggested by Sam Parkinson here, we could add a hook in the Opentelemetry plugin to add the first part of the path to the net_peer_name attribute. In this case it would be api.ft.com/snr, which works for us because next-myft-api doesn't use any other API from snr. But maybe we could consider adding the last part of the path, eg : api.ft.com/topic-recommendations

Anyways, this is a problem shared in many repositories (and probably teams), and it'd be amazing to have this solved in the Opentelemetry plugin

Alternatives

An alternative is to code this OTel hook in every system that needs this granularity.

@gyss gyss added the enhancement New feature or request label Sep 6, 2024
@rowanmanning rowanmanning added the package: opentelemetry Related to the OpenTelemetry package label Sep 6, 2024
@rowanmanning rowanmanning moved this from 📥 Inbox to 📝 Planned in Reliability Kit Roadmap Sep 6, 2024
@gyss gyss changed the title [Opentelemtry plugin] Specify api.ft.com metrics for underneath APIs [Opentelemetry plugin] Specify api.ft.com metrics for underneath APIs Sep 9, 2024
@rowanmanning rowanmanning moved this from 📝 Planned to 🏗 In Progress in Reliability Kit Roadmap Sep 16, 2024
@rowanmanning rowanmanning self-assigned this Sep 16, 2024
@rowanmanning
Copy link
Member

Hiya, status update on where I've got to so far. This is hard! I'm encountering two issues:

HTTP vs Undici

I set up a test app locally which makes requests via both Undici and node-fetch – these are the two most common ways we make client requests from our apps. I can get the Undici request to log but I don't seem to be able to hook into the HTTP (node-fetch) request at all. How is this working?? @sjparkinson this is making me question everything - are we definitely getting metrics from node-fetch? 😂

Here's what I've been trying (I've also tried every other option under @opentelemetry/instrumentation-http):

getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
        requestHook: (span, request) => {
            console.log('HTTP REQUEST HOOK', request.url);
        }
    },
    '@opentelemetry/instrumentation-undici': {
        requestHook: (span, request) => {
            console.log('UNDICI REQUEST HOOK', request.path);
        }
    }
    // ...
})

When I make the following requests:

const nFetch = require('node-fetch');

const bulbasaur = await fetch('https://pokeapi.co/api/v2/pokemon/bulbasaur'); // native fetch
const squirtle = await nFetch('https://pokeapi.co/api/v2/pokemon/squirtle'); // node-fetch

I get only this log:

UNDICI REQUEST HOOK /api/v2/pokemon/bulbasaur

I might be missing something, maybe there's a different instrumentation library that handles HTTP client requests? Seems unlikely though.

Custom attributes on existing metrics

Turns out that the instrumentations for Node.js are quite strict. Ignoring the HTTP issue I decided to try and add custom metrics to native fetches. Attributes would get added to the http.client.request.duration metric.

If I try adding a custom attribute that isn't explicitly supported or even if I add the one experimental attribute (url.template) then I see nothing. So adding the following:

getNodeAutoInstrumentations({
    // ...
    '@opentelemetry/instrumentation-undici': {
        startSpanHook(request) {
            return {
                ['url.template']: '/hello-world'
            };
        }
    }
    // ...
})

The metric that gets exported is like this:

{
    // ...
    attributes: [
        {
            'http.response.status_code': 200,
            'http.request.method': 'GET',
            'server.address': 'pokeapi.co',
            'server.port': 443,
            'url.scheme': 'https'
        }
    ]
}

However if I reuse one of the metrics that's already exported then it works:

getNodeAutoInstrumentations({
    // ...
    '@opentelemetry/instrumentation-undici': {
        startSpanHook(request) {
            return {
                ['server.address']: 'why.wont.you.work'
            };
        }
    }
    // ...
})

This exports the metric:

{
    // ...
    attributes: [
        {
            'http.response.status_code': 200,
            'http.request.method': 'GET',
            'server.address': 'why.wont.you.work',
            'server.port': 443,
            'url.scheme': 'https'
        }
    ]
}

Very frustrating. If this is a hard limitation at the moment then I'm not sure what we can do to get paths into our metrics.

@sjparkinson
Copy link
Contributor

If I try adding a custom attribute that isn't explicitly supported or even if I add the one experimental attribute (url.template) then I see nothing.

I wonder if that is true for attributes in the semantic conventions like peer.service? Or just no new attributes to the map.

@sjparkinson this is making me question everything - are we definitely getting metrics from node-fetch? 😂

No idea! Though I have wondered if some systems are a bit light on client request metrics, worth a browse of https://grafana.ft.com/d/HzGduSwIz/node-js?orgId=1&var-workspace=I7o8Aa1Sz&var-job=next-myft-api&var-environment=production&var-cloud_provider=heroku&var-cloud_region=All&var-http_method=All&var-http_route=All&var-http_errors=All&viewPanel=13 for a system with well known dependencies.

But my understanding for node-fetch is that https://github.com/open-telemetry/opentelemetry-js/blob/main/experimental/packages/opentelemetry-instrumentation-http/src/http.ts is covering it, as it's a wrapper.

@jonnangle
Copy link

It's a bit light on detail but this issue suggests that load order might be relevant.

@sjparkinson
Copy link
Contributor

Some thoughts on this (assuming it is possible).

There's a helpful list of API endpoints under Tyk at https://apigateway.in.ft.com/check-apis, 179 of them today.

I think there are two options for adding the detail we're looking for:

  1. Using a map of URLs to system codes, add a peer.service attribute to requests

    e.g. a client request to https://api.ft.com/enrichedcontent/... would have peer.service set to enriched-content-read-api.

  2. Update the net.peer.name (at some point this becomes server.address FYI) attribute to include more of the listen path

    e.g. a client request to https://api.ft.com/content/a21abcbe-ae52-40c0-8e8d-858372065e2a would have net.peer.name updated to api.ft.com/content/{id}.

I don't really have a strong list of pros and cons, but my gut feeling is 2 would be most intuitive, and perhaps a little easier to maintain (let's try not to reinvent the next-metrics list right?).

rowanmanning added a commit that referenced this issue Sep 18, 2024
This creates a test app for the work in #1191. It illustrates the issues
I'm seeing and makes it easier for others to try things out.

You can test this wth the following:

In one terminal window, run:

```
node ./packages/opentelemetry/test/end-to-end/scripts/run.js
```

This starts a test app on port 4001 which has OpenTelemetry metrics
enabled. I've overridden the OpenTelemetry package internals to log
metrics to `stdout` instead of sending to a collector.

Once you've done this, run the following:

```
curl http://localhost:4001
```

You should see some metrics get logged. Note two things about them:

  1. There is never a metric logged for the node-fetch request

  2. No custom attributes come through on the logged metrics
@rowanmanning
Copy link
Member

@sjparkinson I've opened a draft PR here which includes a bunch of manual edits and a test app to illustrate the issue: #1202.

  1. Still not getting anything from node-fetch, very interested if you can find a way!

  2. Setting net.peer.name doesn't seem to work

@sjparkinson
Copy link
Contributor

sjparkinson commented Sep 18, 2024

Still not getting anything from node-fetch, very interested if you can find a way!

Reading through how the SDK wraps and how node-fetch is written, I wonder if it's related to node-fetch using node:http and node:https as imports?

https://github.com/node-fetch/node-fetch/blob/8b3320d2a7c07bce4afc6b2bf6c3bbddda85b01f/src/index.js#L9-L10

(I'm sort of seeing this as maybe a good opportunity on figuring out how to write an auto-instrumentation package.)

@sjparkinson
Copy link
Contributor

@rowanmanning
Copy link
Member

possibly for the latest node-fetch but I'm explicitly using a v2 version that uses CommonJS 😄

@sjparkinson
Copy link
Contributor

sjparkinson commented Sep 19, 2024

Still not getting anything from node-fetch, very interested if you can find a way!

I am also finding the same with a vanilla express app. Traces are captured for the POST request by the @opentelemetry/instrumentation-net package, but nothing from @opentelemetry/instrumentation-http.

Not sure why though. Something to fix somewhere!

I have asked in the #otel-js CNCF channel, join link if you're interested. Can't say I always get support, but worth asking.

@sjparkinson
Copy link
Contributor

sjparkinson commented Sep 19, 2024

I have however been able to use node-fetch@v2 fine with the http example at https://github.com/open-telemetry/opentelemetry-js/tree/main/examples/http. Bit of a fiddle but I installed node-fetch in the client and was able to get the expected trace.

Screenshot 2024-09-19 at 11 41 43
diff --git a/examples/http/client.js b/examples/http/client.js
index 168d43392..0cf4c675f 100644
--- a/examples/http/client.js
+++ b/examples/http/client.js
@@ -2,7 +2,7 @@
 
 const api = require('@opentelemetry/api');
 const tracer = require('./tracer')('example-http-client');
-const http = require('http');
+const fetch = require('node-fetch');
 
 /** A function which makes requests and handles response. */
 function makeRequest() {
@@ -10,18 +10,11 @@ function makeRequest() {
   // the span, which is created to track work that happens outside of the
   // request lifecycle entirely.
   tracer.startActiveSpan('makeRequest', (span) => {
-    http.get({
-      host: 'localhost',
-      port: 8080,
-      path: '/helloworld',
-    }, (response) => {
-      const body = [];
-      response.on('data', (chunk) => body.push(chunk));
-      response.on('end', () => {
-        console.log(body.toString());
+    fetch('http://localhost:8080/helloworld')
+      .then((response) => {
+        response.text();
         span.end();
       });
-    });
   });
 
   // The process must live for at least the interval past any traces that
diff --git a/examples/http/package.json b/examples/http/package.json
index 64a05ed43..3d8638469 100644
--- a/examples/http/package.json
+++ b/examples/http/package.json
@@ -5,10 +5,10 @@
   "description": "Example of HTTP integration with OpenTelemetry",
   "main": "index.js",
   "scripts": {
-    "zipkin:server": "cross-env EXPORTER=zipkin node ./server.js",
-    "zipkin:client": "cross-env EXPORTER=zipkin node ./client.js",
-    "jaeger:server": "cross-env EXPORTER=jaeger node ./server.js",
-    "jaeger:client": "cross-env EXPORTER=jaeger node ./client.js",
+    "zipkin:server": "EXPORTER=zipkin node ./server.js",
+    "zipkin:client": "EXPORTER=zipkin node ./client.js",
+    "jaeger:server": "EXPORTER=jaeger node ./server.js",
+    "jaeger:client": "EXPORTER=jaeger node ./client.js",
     "align-api-deps": "node ../../scripts/align-api-deps.js"
   },
   "repository": {
@@ -37,7 +37,8 @@
     "@opentelemetry/resources": "1.26.0",
     "@opentelemetry/sdk-trace-base": "1.26.0",
     "@opentelemetry/sdk-trace-node": "1.26.0",
-    "@opentelemetry/semantic-conventions": "1.27.0"
+    "@opentelemetry/semantic-conventions": "1.27.0",
+    "node-fetch": "^2.7.0"
   },
   "homepage": "https://github.com/open-telemetry/opentelemetry-js/tree/main/examples/http",
   "devDependencies": {

So it does seem like an issue with auto-instrimentation as the issue Jon linked to suggests?

@rowanmanning
Copy link
Member

"HTTP vs Undici" from my original comment is resolved in v2.0.11 of the package but we still need to work out how we want to augment the metrics with our paths. "Custom attributes on existing metrics" is still blocking and we'll either need to find a workaround or try to get the OpenTelemetry maintainers to support custom attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request package: opentelemetry Related to the OpenTelemetry package
Projects
Status: 🏗 In Progress
Development

No branches or pull requests

4 participants