Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin: System Latency Heat Map #1464

Open
3 tasks done
tomfanara opened this issue Oct 2, 2024 · 11 comments
Open
3 tasks done

Plugin: System Latency Heat Map #1464

tomfanara opened this issue Oct 2, 2024 · 11 comments
Labels

Comments

@tomfanara
Copy link

tomfanara commented Oct 2, 2024

🔖 Summary

Utilizing the Backstage graph plugin interface we are proposing a system latency heat map. Open telemetry traces would be used to track message pub sub and service invocation and their latency. This monitoring is very important in distributed architectures as detecting contention in distributed systems is time consuming going through Grafana and Prometheus dashboards. The C4 https://c4model.com/ standard fits perfectly into a heat map overlay that allows engineers to see the progressive nature of latency. See attached diagram.

latency-heatmap

🌐 Project website (if applicable)

No response

✌️ Context

The plugin should take advantage of the existing plugins as dependencies such as the graph plugins that ship with Backstage. The system latency heat map is an overlay on top of the graph showing a gradient latency map. Other plugins such as ones that use and expose Open Telemetry metrics are ideal to obtain timing on endpoint to endpoint synchronous (rest calls) or asynchronous messaging (pub/sub) between components, dependencies, resources and APIs. In addition to phase one as stated, phase 2 will add RAG AI predicted latency using a backpropagation model with known outputs of tolerable latencies. This could be annotated in the component, API, resource catalog-info.yaml files. More to add as we can discuss as a team!
Please come and join everyone's input is invited and let's have fun!

👀 Have you spent some time to check if this plugin request has been raised before?

  • I checked and didn't find similar issue

✍️ Are you willing to maintain the plugin?

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

@tomfanara
Copy link
Author

I am thinking to start out simple and get the Open Telemetry data first and create an alerts table to be added to a frontend card.

@tomfanara
Copy link
Author

tomfanara commented Oct 3, 2024

Any thoughts from the community would be great. I can solo but prefer collaboration.

@nia-potato
Copy link

really hope this can take off, when i first looked at the catalog graph i was wondering if we could do something like you presented.

@Phiph
Copy link

Phiph commented Oct 5, 2024

I'd love to see this take off!

In my development of backstage, i've enjoyed viewing it as a system that reads and presents data really well. Most of the config elements are chosen by catalog annotations by component owners. Like DevOps Dashboards, Github Actions and the New Relic Dashboard.

What do you think about using catalog annotations to decide what service that component may use to report service status?

For example we could add a processor that looks for the following annotation:

newrelic.com/APP_ID: 1129082

Which is then used by something that handles that to call the New Relic App Reporting API.

@tomfanara
Copy link
Author

tomfanara commented Oct 5, 2024

Hi @Phiph you must be a psychic LOL. Yes, we are planning to do it that way exactly. Like an annotation latencyheatmap/processingthreshold: 300ms as in the total processing time acceptable threshold of one thread on a service(component). Also, it will report the consuming/providing response times between services for both pub/sub and rest invocations. You could set SLA, SLI and SLO acceptable thresholds in the annotation or in appconfig.

Often services in a distributed architecture are cumulative in latency so it will be important to see it from a graph perspective.

So, we are starting this weekend setting up the repo in community-plugins and will start in on just setting up a simple frontend screen with an alerts card. Then start looking at how we can get the Open Telemetry data.

Thanks for your thought more welcome and we need help!

@nia-potato
Copy link

i would also like to add a small suggestion on the UI aspect of this, instead of doing a green/red/yellow circle to indicate the status of the connected entity, we can just make it more minimal and change the color of the line + entity box to indicate the status of the entity based on telemetry.

@Phiph
Copy link

Phiph commented Oct 6, 2024

ha thanks @tomfanara. I don't primarily use backstage for distributed microservice architectures, however there are some components that call on other components which we map out using the catalog's graph [DependsOn] or [consumesAPI] so being able to present a RAG status would be a ideal for any user looking at the catalog.

I'm not too sure how I feel about letting the component owners handle the processingthreshold. It could mean that it gives the teams their own choice on what Green is, but from an organisation perspective you may want to set the bar. Maybe there should be sensible defaults that are config driven?

I also use soundcheck in my instance of backstage, so having an API that can just give me the information so I can use it for certification would also be of interest.

@tomfanara
Copy link
Author

tomfanara commented Oct 6, 2024

@Phiph, I have never used Soundcheck. Just checked it out but this is what we (my company) also need to adopt. We will eventually use scorecards to see how our templates (scaffolding) are used and govern standards.

Yes, I like your idea on a config that has maybe a default global tolerance setting like conservative, aggressive or moderate.

Also, I like @nia-potato comments above and I incorporated a rough look and feel in the diagram above! All good!

@tomfanara
Copy link
Author

tomfanara commented Oct 6, 2024

The following link shows the various applications of RAG. https://bizbrolly.com/practical-uses-of-retrieval-augmented-generation-rag-in-ai/ . We would be a candidate for data analysis and reporting as a way of predicting latency! However, the first phase of latencyheatmap is to do query averaging on traces just to see real-time issues.

Also, I think you could ask your service catalog what systems are in latency trouble and have it report back using the LLMs for context, then using a SLM (less parameters) for searches of predicted analytics thus augmenting data for the retrieval.

This plugin can serve as a good knowledge share on how to apply RAG AI to systems analysis.

@tomfanara
Copy link
Author

As a result of observing concerning latency, we would then look at increasing replicas or scaling the microservice(s) with KEDA. KEDA is a horizontal scaling technology in K8 that creates another pod to handle through put. There is also vertical scaling by increasing memory or CPU cores. Typically, in microservices the unlimited thread pools or thread loops scale themselves to the amount of CPU cores.

@vinzscam
Copy link
Member

vinzscam commented Oct 8, 2024

It looks like this would be very helpful. Would anyone like to be assigned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants