Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a cn nodes in k8s cluster crash and restart, non-transactional stream load starts to get stuck #49950

Closed
snippins opened this issue Aug 19, 2024 · 3 comments
Assignees
Labels
type/bug Something isn't working

Comments

@snippins
Copy link

snippins commented Aug 19, 2024

Steps to reproduce the behavior (Required)

Wait for a cn node to restart/crash (evicting an cn node manuall does not seems to cause this problem)

Expected behavior (Required)

Non-transaction Stream loads works

Real behavior (Required)

Non transactional stream loads start to timeout (I know this by checking fe-proxy logs), at first for some tables it still works, some stop working, then gradually all stream load stop working, then query starts to get slower.

In the fe-proxy logs, I saw that the fe still might trying to send requests to the old ip of the restarted cn.

StarRocks version (Required)

3.3.1-2b87854

Workarounds:

For now, everytime this happens, I have to manually evict the fe leader node for things to gradually become normal again.

I suspect this is related to #40229, and for k8s environments the IPs are not static and thus causing problems?

@snippins snippins added the type/bug Something isn't working label Aug 19, 2024
@kevincai kevincai self-assigned this Aug 19, 2024
@kevincai
Copy link
Contributor

@snippins do you have some detailed logs for this issue, the fe leader log and fe-proxy log.

@snippins
Copy link
Author

Sorry, we found out the actual reason that making cns crash, the default configuration would use 90% of disks for cache, but sometimes there are 2 cns started on the same k8s node so cn would crashed because there are not enough disk space. Thus we applied podaffinity settings to avoid this. Since then there were no cn crashses happened so we did not investigate further about the problem with stream load.

@kevincai
Copy link
Contributor

thanks for the update.

close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants