Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

Open
1 task
adejanovski opened this issue Mar 25, 2024 · 4 comments

Comments

@adejanovski
Copy link
Contributor

adejanovski commented Mar 25, 2024

Flamegraphs are often the best (if not the only) way to properly identify what's causing performance issues in Cassandra.
Grafana Pyroscope is a continuous profiling database which allows displaying flamegraphs in Grafana and would be a great addition to our toolbelt.

We should add the pyroscope java agent to our builds, which we'd disable by default (see the PYROSCOPE_AGENT_ENABLED env variable) and fully configure it through env variables.

Definition of Done

┆Issue is synchronized with this Jira Story by Unito

@burmanm
Copy link
Contributor

burmanm commented Mar 25, 2024

I don't think this provides user anything interesting. What on earth would users do with thread profiling of Cassandra? It doesn't reveal much of useful information even, given how Cassandra is architected.

If the user is a Cassandra developer, then perhaps they might get something useful out of it, but not otherwise.

@adejanovski
Copy link
Contributor Author

It doesn't reveal much of useful information even, given how Cassandra is architected

My experience with diagnosing Cassandra performance issues contradicts this. It is VERY useful.
It can tell you if compaction is killing your performance, if it's GC, if it's tombstones, etc... In cases where metrics and logs are misleading.

@Miles-Garnsey
Copy link
Member

Miles-Garnsey commented Mar 25, 2024

Seconded, I've also used flame charts to diagnose performance problems.

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

@adejanovski
Copy link
Contributor Author

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

yeah, the impact of the continuous profiling needs to be evaluated. I guess we can tune the profiling intervals to avoid profiling all the time.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

The service mesh is something we should explore to see what benefits we could get out of it (easy TLS orchestration being one) and what it would impose us as drawbacks (higher latencies being one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: No status
Development

No branches or pull requests

3 participants