You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Often we see that one or two of vector pods running as agents shoots in CPU and memory. Looking at the events, it shows that the pod is OOMKilled. (Vector is basically not able to handle even a small spike of load)
When the pod come backs up again, I see that the its CPU usage is very high even though its not pushing logs at all via kafka sink. I see that the disk buffer keeps on increasing. The buffer size is still way under whats configured to be the max size and there is still space available on the underlying AWS EC2 instance
In the logs of vector I can see this error
2024-06-12T00:41:37.459225Z ERROR sink{component_kind="sink" component_id=kafka component_type=kafka}: vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb675fffff4 offset 1868849526 not in range 0x7fb673e1f000..0x7fb676000000"
To sum it up
There was a slight increase in load which causes vector to get OOMKilled for some reason
When vector comes back up, it generates the deserialized error as indicated above
The buffer keeps on increasing when the vector comes online
The CPU of vector its way too high even though its not processing anything
Thanks for opening this. I think it is the same issue as #18336 so I'll close this one as a duplicate, but I appreciate all of the additional details you provided in your report!
A note for the community
Problem
Often we see that one or two of vector pods running as agents shoots in CPU and memory. Looking at the events, it shows that the pod is OOMKilled. (Vector is basically not able to handle even a small spike of load)
When the pod come backs up again, I see that the its CPU usage is very high even though its not pushing logs at all via kafka sink. I see that the disk buffer keeps on increasing. The buffer size is still way under whats configured to be the max size and there is still space available on the underlying AWS EC2 instance
In the logs of vector I can see this error
To sum it up
Configuration
Version
0.38.0
Debug Output
No response
Example Data
No response
Additional Context
Vector is running as an agent on EKS with kubernetes version 1.29.
Attaching Screenshots
References
No response
The text was updated successfully, but these errors were encountered: