Overlay execution info messages in timeline view #3429
Closed
hamersaw
started this conversation in
RFC Incubator
Replies: 1 comment 1 reply
-
cc @pradithya regarding this issue - I think this solution, in combination with the runtime metrics integration would achieve your vision. Thoughts? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
The timeline UI view is marginally useful to debug performance, but has a lot of room for improvement. Integrating the runtime metrics breakdown proposed in the performance observability RFC is a step in the right direction, partitioning node executions into a collection of categorized time-series. This representation will help the "what" but misses a lot of the "why". For example, if a particular execution has a large amount of frontend plugin overhead this means that Flyte started the Task but the backend service has not yet indicated the service has started. K8s gurus will be quick to identify that there may be scheduling contention, large image pull times, or a few other likely scenarios. However, this is not easily available to the user even though FlytePropeller has this information available. We currently store a singular "reason" for the current execution status' but may be better off tracking a time-series of reasons to better explain the execution.
Proposal
This proposal outlines a solution for overlaying a collection of human readable messages in the timeline view. The exact representation is VERY open for debate, but I envision something similar to jaeger (time-series telemetry data with events) which uses a single tick mark that displays a message on hover. This solution supplies the "why" in an explanation of the reported execution status that will complement the "what" in the runtime breakdown of the execution time-series. The goal will be to balance utility with simplicity, displaying a "useful" number of messages to improve context.
Implementation
Currently, FlyteAdmin maintains a singular "reason" within the task execution metadata. This is updated in-place on each event from FlytePropeller, meaning the old "reasons" are not persisted. At risk of over-simplifying this, we will need to transition to maintaining a collection of "reasons" with associated timestamps. This will require updates in the following repositories:
Open Questions
currently possible to skip phases if execution progresses before FlytePropeller detects and processes the intermediate stage
could use event buffers to just send multiple events -> probably the better solution
Beta Was this translation helpful? Give feedback.
All reactions