-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content node taking too long to come back up after restart #32396
Comments
Do you use |
@hmusum Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184) let me know if you need more logs |
Sounds like you have a setup that cannot handle the feed load. Hardware, local or remote disk, number of documents, # of and type of indexes, paged attributes, tensors/embeddings, HNSW indexes are some of the things that might affect this. You might want to look into https://docs.vespa.ai/en/performance/sizing-feeding.html to find what/how you can do something about this, if possible. Hard to say without knowing more about the application, schema details, HW etc. |
@hmusum Will reducing the number of schemas help? also week back reindexing succeeded once after node restart and around that time in logs, i have found some deadlock related logs proton. deadlock.detector Also when content node is restarted |
@hmusum @kkraune reindexing is not completed for schema in last 2 weeks. And only way to make sure ingestions are working fine, we stopped the reindexing using reindex api. RAM usage of host where content node is running is around 60% so we do have significant RAM available as well. How do we find the where the actual bottleneck is during reindexing? |
To avoid tx replay on startup, let vespa-stop-services compl,ete before stopping the container. How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged? |
@bratseth we stop the vespa services before stopping the container. But can you help with why reindexing is getting stuck. All ingestions are getting timed out as well unless we stop the reindexing using reindex api. |
Is it reindexing that is the problem, or is it replaying the tx log at startup? How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged? |
@bratseth So we think reindexing is the root cause of all this and want to understand what is the bottleneck for reindexing and why it is getting stuck for one schema. |
Are the fields that are HNSW indexed paged? |
@bratseth no, no HNSW indexed paged |
I really think your problem is that you don't do a clean shutdown, leading to tx log replay on startup. Building 2000 indexes will take time.
Did you do this? |
Describe the bug
we have a cluster with 1 container and 1 content node with almost 85 schemas. We have observed that when we restart the content node or when content node is restarted due to any reason., it takes a long time to come back up.
Also cluster is storing and reading data from encrypted path.
vespa version : 8.388.11
Expected behavior
content node should be up witth in few mins max and not take hours.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Vespa version
vespa version : 8.388.11
Additional context
Add any other context about the problem here.
Will be sharing the logs seperately
The text was updated successfully, but these errors were encountered: