Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI #170632

Closed
MiriamAparicio opened this issue Nov 6, 2023 · 19 comments
Labels
apm:infrastructure-integration needs-refinement A reason and acceptance criteria need to be defined for this issue stale Used to mark issues that were closed for being stale Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@MiriamAparicio
Copy link
Contributor

MiriamAparicio commented Nov 6, 2023

Description of the problem

The Metric tab is populating the metrics charts data (i.e. memory usage (avg)) from APM agent whilst Infrastructure tab shows a table of metrics populated by metricbeat, this is confusing for customers

Screenshot 2023-11-06 at 18 08 51 Screenshot 2023-11-06 at 18 11 43

Possible solutions

(to be discussed)

  • For now, meanwhile other solutions are discussed we can just inform the customers about where the data is captured (ie. tooltip, banner, ...)
  • If metricbeat is running on the host, we should use the cpu and memory captured by it, and only fall back to cpu/memory captured by apm agent (For language specific runtime metrics like event loop delay in Nodejs or the number of jvm threads (java) we should always show it from the APM agent because metricbeat does not capture this)

Related issues

[Infrastructure Observability] Infrastructure metrics data should pull from APM if no agent/beat data is available

✔️ Acceptance criteria

Draft - TBC during refinement

1. Must Have

Must be delivered in this issue in order for the release to be valuable

Name Description Notes
TBC ... ...

2. Should Have

Name Description Notes
TBC ... ...

3. Could Have

Would be nice to have but not critical

Name Description Notes
TBC ... ...

4. Will Not Have (for now)

Explicitly will not be looked at within this issue

Name Description Notes
TBC ... ...
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 6, 2023
@MiriamAparicio
Copy link
Contributor Author

cc @roshan-elastic

@MiriamAparicio MiriamAparicio added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Nov 6, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 6, 2023
@roshan-elastic
Copy link

@smith - do you think this is something that would fit in the team backlog or do you think this needs a project to be prioritised to try and improve this?

@smith
Copy link
Contributor

smith commented Nov 6, 2023

Let's keep this to do a short term solution to explain to the user why the data might be different as @MiriamAparicio described above.

@roshan-elastic roshan-elastic changed the title [ObsUX] Metrics data discrepancy between Metrics tab and Infrastructure tab [ObsUX] Make Metrics data sources transparent to avoid confusion with overlapping metrics in the UI Nov 6, 2023
@roshan-elastic roshan-elastic added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Nov 6, 2023
@roshan-elastic
Copy link

@MiriamAparicio - thanks for raising this, a very good problem description and next steps.

I hope you don't mind but I renamed it slightly to reflect that we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I also added in a draft Acceptance Criteria.

My hope is that the solution/ACE can be figured out during refinement if that works?

@smith - OK?

@roshan-elastic roshan-elastic changed the title [ObsUX] Make Metrics data sources transparent to avoid confusion with overlapping metrics in the UI [ObsUX] Make Metrics data sources within APM transparent to avoid confusion with overlapping metrics in the UI Nov 6, 2023
@sorenlouv
Copy link
Member

sorenlouv commented Nov 6, 2023

we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I don't understand why we would want to present the user with two different values for memory and cpu. Are there any good reasons for them to be different, other than they were captured through different means? If so, what are they? If we can clearly articulate the difference and when one would need to use one over the other, I can somewhat understand why we'd have both. If not I suggest we should use the metricbeat value, and use the APM agent value as fallback.

@roshan-elastic
Copy link

we'll try and focus on your first suggestion - making it clear what each metric really means/where it comes from.

I don't understand why we would want to present the user with two different values for memory and cpu. Are there any good reasons for them to be different, other than they were captured through different means? If so, what are they? If we can clearly articulate the difference and when one would need to use one over the other, I can somewhat understand why we'd have both. If not I suggest we should use the metricbeat value, and use the APM agent value as fallback.

Hey @sqren, you're right - there isn't a need for them to be different from a user POV.

My main thinking here was whether we can really solve for this without significant work that we likely can't prioritise right now. Having said that, if you can think of a way to elegantly handle this without a lot of work - I'm happy for us to spend some time refining this to try.

I do like your idea, it's pretty smart. I do have a concern but let me check I understand first.

To recap your suggestion:

  • If a user is running metricbeat on all of the hosts...we show the metricbeat data in the 'metrics' tab (making it consistent with the 'infrastructure' tab and the infrastructure views in general)
  • If a user is not running metricsbeat on the hosts, we show the APM data as a back-up...inconsistent but the user wouldn't know because the metricbeat data doesn't exist (so it's better than nothing)

My concern would be what happens if some of the hosts run metricbeat and some don't - what do we show in the 'metrics' tab?

@sorenlouv
Copy link
Member

To recap your suggestion:

Yeah, my thinking is that we first fetch the metric (cpu, memory) from the infra indices. If that doesn't yield any results we fetch from the apm indices. We can start doing this from within the APM app (we already have data clients to access infra and apm indices). The better solution would be to have this encapsulated somewhere (OAM?) so that we can just call a function getCpuForHost(hostId) and it will return the right value.

@sorenlouv
Copy link
Member

sorenlouv commented Nov 8, 2023

My concern would be what happens if some of the hosts run metricbeat and some don't - what do we show in the 'metrics' tab?

Yes, good point. I suggest that if we detect any metricbeat data for the selected service, we use that for all hosts. I think we should treat it as a configuration error if the customer has a service running across multiple hosts, and some but not all are running metricbeat.

@roshan-elastic
Copy link

Hey @sqren, I like your thinking here...I think I got ahead of myself with the acceptance criteria here.

What do you think about me just deleting the acceptance criteria for now and you/the team/me would have time to think of possibilities during refinement?

That way, you have the freedom to propose some solutions and the acceptance criteria would be based on that?

@sorenlouv
Copy link
Member

@roshan-elastic SGTM 👍

@crespocarlos
Copy link
Contributor

@roshan-elastic @sqren

If a user is not running metricsbeat on the hosts, we show the APM data as a back-up...inconsistent but the user wouldn't know because the metricbeat data doesn't exist (so it's better than nothing)

Since APM data is inconsistent, wouldn't it make more sense to prompt users to install metricbeat or deploy an agent to those hosts?

@crespocarlos
Copy link
Contributor

Also, the inconsistency will be evident when we integrate the Asset Details flyout in the Infra table?!

@sorenlouv
Copy link
Member

sorenlouv commented Nov 22, 2023

Since APM data is inconsistent, wouldn't it make more sense to prompt users to install metricbeat or deploy an agent to those hosts?

My intention was that if the user has metricbeat running for some hosts but not all, the hosts without metricbeat will not show up at all. We should only fall back to APM data, if there are no hosts with metricbeat data. We can improve this down the line by letting the user know that we have discovered hosts that do not have metricbeat - this should also take into account hosts discovered via other means than APM agents (eg filebeat).

@roshan-elastic
Copy link

@sqren @crespocarlos

Playing this back for my understanding, for the 'infrastructure' and 'metrics' tabs in APM:

  1. If none of the APM-detected hosts run metricbeat, we only show APM data
  2. If all of the APM-detected hosts run metricbeat data, we only show the metricbeat data
  3. If some of the APM-detected hosts run metricbeat, we'll only show those which are running metricbeat (and discard all hosts without from both tabs)
  4. In the future, we an help users plug those gaps by prompting them on how to onboard hosts discovered via APM with metricbeat/system integration

Thoughts
If so, this does sound sensible to me, the only concern I would have would be around (3) where there is a mix of hosts which run metricbeat and those which don't (I don't have any numbers on how often this happens).

My worry is that once a user has at least 1 APM-detected host that runs metricbeat/agent, will they lose all of the metrics for the hosts which they previously had via APM-detected hosts but now are being excluded?

Idea...
I'm wondering whether it might be worth pursuing still trying to leverage the APM data so there isn't a drastic difference from running no hosts with metricbeat vs running one - I'd imagine the APM metric data would be helpful (even if it doesn't match metricbeat perfectly).

e.g. as soon as we go to option (3), we still show the APM data but flag it, show them how to filter it out and also provide instructions on how to onboard them with elastic agent/metricbeat?

More complexity...Containers vs Hosts
One more added layer of complexity is how this all works with matching the host.hostname detected by APM to the actual host.name detected by elastic agent/beats etc? If the app is running in a container on a host, I'm wondering whether the host.hostname detected by APM will be the container name and won't match the host.name of the host (detected by beats/elastic agent - assuming they run agent on the host itself and not in the container).

I'm not sure how this plays into the handling of everything...

I'm thinking a list of potential use cases would be quite helpful so we could map out what would happen?

Different teams for APM vs hosts : Plugging the gap may not be quick : One thing to consider is that the team who wanted to instrument with APM are usually different to those who want to deploy agent/metricbeat to the hosts so if there some hosts with/without agent/metricbeat running on them...I would imagine it will be hard to get that plugged in because there are different teams.

@crespocarlos
Copy link
Contributor

@roshan-elastic, @sqren

Wouldn't discarding hosts, as proposed in option 3, cause more confusion than solving the issue?

I'm wondering whether it might be worth pursuing still trying to leverage the APM data so there isn't a drastic difference from running no hosts with metricbeat vs running one - I'd imagine the APM metric data would be helpful (even if it doesn't match metricbeat perfectly).

We also need to consider that we will soon integrate the asset details flyout into the Infrastructure table. So what we're discussing here will solve the mismatches in APM UI, but the problem will still exist in Infra UIs.

@botelastic
Copy link

botelastic bot commented May 25, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the stale Used to mark issues that were closed for being stale label May 25, 2024
@smith
Copy link
Contributor

smith commented May 26, 2024

We're fixing this with an entity-based view, so closing this issue.

@smith smith closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024
@sorenlouv
Copy link
Member

We're fixing this with an entity-based view, so closing this issue.

Just curious: if we have to different CPU values for the same host, how will the entity model solve the problem of deciding which value to use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:infrastructure-integration needs-refinement A reason and acceptance criteria need to be defined for this issue stale Used to mark issues that were closed for being stale Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

No branches or pull requests

6 participants