Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

adejanovski · 2024-03-25T10:33:21Z

Flamegraphs are often the best (if not the only) way to properly identify what's causing performance issues in Cassandra.
Grafana Pyroscope is a continuous profiling database which allows displaying flamegraphs in Grafana and would be a great addition to our toolbelt.

We should add the pyroscope java agent to our builds, which we'd disable by default (see the PYROSCOPE_AGENT_ENABLED env variable) and fully configure it through env variables.

Definition of Done

Give feedback

The Pyroscope agent is added to our builds and disabled by default
Options

┆Issue is synchronized with this Jira Story by Unito

burmanm · 2024-03-25T11:36:06Z

I don't think this provides user anything interesting. What on earth would users do with thread profiling of Cassandra? It doesn't reveal much of useful information even, given how Cassandra is architected.

If the user is a Cassandra developer, then perhaps they might get something useful out of it, but not otherwise.

adejanovski · 2024-03-25T12:41:47Z

It doesn't reveal much of useful information even, given how Cassandra is architected

My experience with diagnosing Cassandra performance issues contradicts this. It is VERY useful.
It can tell you if compaction is killing your performance, if it's GC, if it's tombstones, etc... In cases where metrics and logs are misleading.

Miles-Garnsey · 2024-03-25T23:31:19Z

Seconded, I've also used flame charts to diagnose performance problems.

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

adejanovski · 2024-03-26T10:25:01Z

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

yeah, the impact of the continuous profiling needs to be evaluated. I guess we can tune the profiling intervals to avoid profiling all the time.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

The service mesh is something we should explore to see what benefits we could get out of it (easy TLS orchestration being one) and what it would impose us as drawbacks (higher latencies being one).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

adejanovski commented Mar 25, 2024 •

edited by sync-by-unito bot

Loading

Definition of Done

burmanm commented Mar 25, 2024

adejanovski commented Mar 25, 2024

Miles-Garnsey commented Mar 25, 2024 •

edited

Loading

adejanovski commented Mar 26, 2024

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

Comments

adejanovski commented Mar 25, 2024 • edited by sync-by-unito bot Loading

Definition of Done

burmanm commented Mar 25, 2024

adejanovski commented Mar 25, 2024

Miles-Garnsey commented Mar 25, 2024 • edited Loading

adejanovski commented Mar 26, 2024

adejanovski commented Mar 25, 2024 •

edited by sync-by-unito bot

Loading

Miles-Garnsey commented Mar 25, 2024 •

edited

Loading