[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

MiriamAparicio · 2023-11-06T11:35:19Z

Description of the problem

The Metric tab is populating the metrics charts data (i.e. memory usage (avg)) from APM agent whilst Infrastructure tab shows a table of metrics populated by metricbeat, this is confusing for customers

Possible solutions

(to be discussed)

For now, meanwhile other solutions are discussed we can just inform the customers about where the data is captured (ie. tooltip, banner, ...)
If metricbeat is running on the host, we should use the cpu and memory captured by it, and only fall back to cpu/memory captured by apm agent (For language specific runtime metrics like event loop delay in Nodejs or the number of jvm threads (java) we should always show it from the APM agent because metricbeat does not capture this)

Related issues

[Infrastructure Observability] Infrastructure metrics data should pull from APM if no agent/beat data is available

✔️ Acceptance criteria

Draft - TBC during refinement

1. Must Have

Must be delivered in this issue in order for the release to be valuable

Name	Description	Notes
TBC	...	...

2. Should Have

Name	Description	Notes
TBC	...	...

3. Could Have

Would be nice to have but not critical

Name	Description	Notes
TBC	...	...

4. Will Not Have (for now)

Explicitly will not be looked at within this issue

Name	Description	Notes
TBC	...	...

MiriamAparicio · 2023-11-06T11:36:22Z

cc @roshan-elastic

elasticmachine · 2023-11-06T11:36:58Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

roshan-elastic · 2023-11-06T13:59:28Z

@smith - do you think this is something that would fit in the team backlog or do you think this needs a project to be prioritised to try and improve this?

smith · 2023-11-06T15:40:32Z

Let's keep this to do a short term solution to explain to the user why the data might be different as @MiriamAparicio described above.

roshan-elastic · 2023-11-06T17:44:25Z

@MiriamAparicio - thanks for raising this, a very good problem description and next steps.

I hope you don't mind but I renamed it slightly to reflect that we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I also added in a draft Acceptance Criteria.

My hope is that the solution/ACE can be figured out during refinement if that works?

@smith - OK?

sorenlouv · 2023-11-06T18:25:58Z

we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I don't understand why we would want to present the user with two different values for memory and cpu. Are there any good reasons for them to be different, other than they were captured through different means? If so, what are they? If we can clearly articulate the difference and when one would need to use one over the other, I can somewhat understand why we'd have both. If not I suggest we should use the metricbeat value, and use the APM agent value as fallback.

roshan-elastic · 2023-11-07T10:35:36Z

we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I don't understand why we would want to present the user with two different values for memory and cpu. Are there any good reasons for them to be different, other than they were captured through different means? If so, what are they? If we can clearly articulate the difference and when one would need to use one over the other, I can somewhat understand why we'd have both. If not I suggest we should use the metricbeat value, and use the APM agent value as fallback.

Hey @sqren, you're right - there isn't a need for them to be different from a user POV.

My main thinking here was whether we can really solve for this without significant work that we likely can't prioritise right now. Having said that, if you can think of a way to elegantly handle this without a lot of work - I'm happy for us to spend some time refining this to try.

I do like your idea, it's pretty smart. I do have a concern but let me check I understand first.

To recap your suggestion:

If a user is running metricbeat on all of the hosts...we show the metricbeat data in the 'metrics' tab (making it consistent with the 'infrastructure' tab and the infrastructure views in general)
If a user is not running metricsbeat on the hosts, we show the APM data as a back-up...inconsistent but the user wouldn't know because the metricbeat data doesn't exist (so it's better than nothing)

My concern would be what happens if some of the hosts run metricbeat and some don't - what do we show in the 'metrics' tab?

sorenlouv · 2023-11-08T15:46:47Z

To recap your suggestion:

Yeah, my thinking is that we first fetch the metric (cpu, memory) from the infra indices. If that doesn't yield any results we fetch from the apm indices. We can start doing this from within the APM app (we already have data clients to access infra and apm indices). The better solution would be to have this encapsulated somewhere (OAM?) so that we can just call a function getCpuForHost(hostId) and it will return the right value.

sorenlouv · 2023-11-08T15:50:01Z

My concern would be what happens if some of the hosts run metricbeat and some don't - what do we show in the 'metrics' tab?

Yes, good point. I suggest that if we detect any metricbeat data for the selected service, we use that for all hosts. I think we should treat it as a configuration error if the customer has a service running across multiple hosts, and some but not all are running metricbeat.

roshan-elastic · 2023-11-09T12:20:23Z

Hey @sqren, I like your thinking here...I think I got ahead of myself with the acceptance criteria here.

What do you think about me just deleting the acceptance criteria for now and you/the team/me would have time to think of possibilities during refinement?

That way, you have the freedom to propose some solutions and the acceptance criteria would be based on that?

sorenlouv · 2023-11-13T19:00:34Z

@roshan-elastic SGTM 👍

crespocarlos · 2023-11-20T14:12:12Z

@roshan-elastic @sqren

If a user is not running metricsbeat on the hosts, we show the APM data as a back-up...inconsistent but the user wouldn't know because the metricbeat data doesn't exist (so it's better than nothing)

Since APM data is inconsistent, wouldn't it make more sense to prompt users to install metricbeat or deploy an agent to those hosts?

crespocarlos · 2023-11-20T14:43:24Z

Also, the inconsistency will be evident when we integrate the Asset Details flyout in the Infra table?!

sorenlouv · 2023-11-22T17:01:58Z

Since APM data is inconsistent, wouldn't it make more sense to prompt users to install metricbeat or deploy an agent to those hosts?

My intention was that if the user has metricbeat running for some hosts but not all, the hosts without metricbeat will not show up at all. We should only fall back to APM data, if there are no hosts with metricbeat data. We can improve this down the line by letting the user know that we have discovered hosts that do not have metricbeat - this should also take into account hosts discovered via other means than APM agents (eg filebeat).

roshan-elastic · 2023-11-23T15:14:38Z

@sqren @crespocarlos

Playing this back for my understanding, for the 'infrastructure' and 'metrics' tabs in APM:

If none of the APM-detected hosts run metricbeat, we only show APM data
If all of the APM-detected hosts run metricbeat data, we only show the metricbeat data
If some of the APM-detected hosts run metricbeat, we'll only show those which are running metricbeat (and discard all hosts without from both tabs)
In the future, we an help users plug those gaps by prompting them on how to onboard hosts discovered via APM with metricbeat/system integration

Thoughts
If so, this does sound sensible to me, the only concern I would have would be around (3) where there is a mix of hosts which run metricbeat and those which don't (I don't have any numbers on how often this happens).

My worry is that once a user has at least 1 APM-detected host that runs metricbeat/agent, will they lose all of the metrics for the hosts which they previously had via APM-detected hosts but now are being excluded?

Idea...
I'm wondering whether it might be worth pursuing still trying to leverage the APM data so there isn't a drastic difference from running no hosts with metricbeat vs running one - I'd imagine the APM metric data would be helpful (even if it doesn't match metricbeat perfectly).

e.g. as soon as we go to option (3), we still show the APM data but flag it, show them how to filter it out and also provide instructions on how to onboard them with elastic agent/metricbeat?

More complexity...Containers vs Hosts
One more added layer of complexity is how this all works with matching the host.hostname detected by APM to the actual host.name detected by elastic agent/beats etc? If the app is running in a container on a host, I'm wondering whether the host.hostname detected by APM will be the container name and won't match the host.name of the host (detected by beats/elastic agent - assuming they run agent on the host itself and not in the container).

I'm not sure how this plays into the handling of everything...

I'm thinking a list of potential use cases would be quite helpful so we could map out what would happen?

Different teams for APM vs hosts : Plugging the gap may not be quick : One thing to consider is that the team who wanted to instrument with APM are usually different to those who want to deploy agent/metricbeat to the hosts so if there some hosts with/without agent/metricbeat running on them...I would imagine it will be hard to get that plugged in because there are different teams.

crespocarlos · 2023-11-27T09:58:59Z

@roshan-elastic, @sqren

Wouldn't discarding hosts, as proposed in option 3, cause more confusion than solving the issue?

I'm wondering whether it might be worth pursuing still trying to leverage the APM data so there isn't a drastic difference from running no hosts with metricbeat vs running one - I'd imagine the APM metric data would be helpful (even if it doesn't match metricbeat perfectly).

We also need to consider that we will soon integrate the asset details flyout into the Infrastructure table. So what we're discussing here will solve the mismatches in APM UI, but the problem will still exist in Infra UIs.

botelastic · 2024-05-25T10:57:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

smith · 2024-05-26T00:58:19Z

We're fixing this with an entity-based view, so closing this issue.

sorenlouv · 2024-05-26T21:07:55Z

We're fixing this with an entity-based view, so closing this issue.

Just curious: if we have to different CPU values for the same host, how will the entity model solve the problem of deciding which value to use?

MiriamAparicio added the apm:infrastructure-integration label Nov 6, 2023

botelastic bot added the needs-team Issues missing a team label label Nov 6, 2023

MiriamAparicio added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Nov 6, 2023

botelastic bot removed the needs-team Issues missing a team label label Nov 6, 2023

roshan-elastic changed the title ~~[ObsUX] Metrics data discrepancy between Metrics tab and Infrastructure tab~~ [ObsUX] Make Metrics data sources transparent to avoid confusion with overlapping metrics in the UI Nov 6, 2023

roshan-elastic added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Nov 6, 2023

roshan-elastic changed the title ~~[ObsUX] Make Metrics data sources transparent to avoid confusion with overlapping metrics in the UI~~ [ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI Nov 6, 2023

botelastic bot added the stale Used to mark issues that were closed for being stale label May 25, 2024

smith closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

MiriamAparicio commented Nov 6, 2023 •

edited by roshan-elastic

Loading

MiriamAparicio commented Nov 6, 2023

elasticmachine commented Nov 6, 2023

roshan-elastic commented Nov 6, 2023

smith commented Nov 6, 2023

roshan-elastic commented Nov 6, 2023

sorenlouv commented Nov 6, 2023 •

edited

Loading

roshan-elastic commented Nov 7, 2023

sorenlouv commented Nov 8, 2023

sorenlouv commented Nov 8, 2023 •

edited

Loading

roshan-elastic commented Nov 9, 2023

sorenlouv commented Nov 13, 2023

crespocarlos commented Nov 20, 2023

crespocarlos commented Nov 20, 2023

sorenlouv commented Nov 22, 2023 •

edited

Loading

roshan-elastic commented Nov 23, 2023

crespocarlos commented Nov 27, 2023

botelastic bot commented May 25, 2024

smith commented May 26, 2024

sorenlouv commented May 26, 2024

[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

Comments

MiriamAparicio commented Nov 6, 2023 • edited by roshan-elastic Loading

Description of the problem

Possible solutions

Related issues

✔️ Acceptance criteria

1. Must Have

2. Should Have

3. Could Have

4. Will Not Have (for now)

MiriamAparicio commented Nov 6, 2023

elasticmachine commented Nov 6, 2023

roshan-elastic commented Nov 6, 2023

smith commented Nov 6, 2023

roshan-elastic commented Nov 6, 2023

sorenlouv commented Nov 6, 2023 • edited Loading

roshan-elastic commented Nov 7, 2023

sorenlouv commented Nov 8, 2023

sorenlouv commented Nov 8, 2023 • edited Loading

roshan-elastic commented Nov 9, 2023

sorenlouv commented Nov 13, 2023

crespocarlos commented Nov 20, 2023

crespocarlos commented Nov 20, 2023

sorenlouv commented Nov 22, 2023 • edited Loading

roshan-elastic commented Nov 23, 2023

crespocarlos commented Nov 27, 2023

botelastic bot commented May 25, 2024

smith commented May 26, 2024

sorenlouv commented May 26, 2024

MiriamAparicio commented Nov 6, 2023 •

edited by roshan-elastic

Loading

sorenlouv commented Nov 6, 2023 •

edited

Loading

sorenlouv commented Nov 8, 2023 •

edited

Loading

sorenlouv commented Nov 22, 2023 •

edited

Loading