Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content node taking too long to come back up after restart #32396

Open
shubh9194 opened this issue Sep 13, 2024 · 12 comments
Open

Content node taking too long to come back up after restart #32396

shubh9194 opened this issue Sep 13, 2024 · 12 comments

Comments

@shubh9194
Copy link

Describe the bug
we have a cluster with 1 container and 1 content node with almost 85 schemas. We have observed that when we restart the content node or when content node is restarted due to any reason., it takes a long time to come back up.
Also cluster is storing and reading data from encrypted path.

vespa version : 8.388.11

Expected behavior
content node should be up witth in few mins max and not take hours.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Infrastructure: Podman
  • Versions [e.g. 22]

Vespa version
vespa version : 8.388.11

Additional context
Add any other context about the problem here.

Will be sharing the logs seperately

@hmusum
Copy link
Member

hmusum commented Sep 13, 2024

Do you use vespa-stop-services when you restart? Read the doc about vespa-stop-services and the prepareRestart command, which might reduce startup time. If you do that and see that prepareRestart times out, Vespa 8.406.26 and later have a higher timeout for this command, as we have seen that content nodes with a lot of data might need more time for this.

@shubh9194
Copy link
Author

shubh9194 commented Sep 16, 2024

@hmusum
I am updating the vespa version to one that you suggested.
In the mean time, can you please help with this as well.
Reindexing is also taking a lot of time and sometimes it seems it is stuck and don’t see any progress.
In the content logs, i dont see any error but seeing a lot warnings like below.

Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)

let me know if you need more logs

@hmusum
Copy link
Member

hmusum commented Sep 16, 2024

Sounds like you have a setup that cannot handle the feed load.

Hardware, local or remote disk, number of documents, # of and type of indexes, paged attributes, tensors/embeddings, HNSW indexes are some of the things that might affect this. You might want to look into https://docs.vespa.ai/en/performance/sizing-feeding.html to find what/how you can do something about this, if possible. Hard to say without knowing more about the application, schema details, HW etc.

@shubh9194
Copy link
Author

shubh9194 commented Sep 17, 2024

@hmusum
We have a single container and content node on different hosts.
Content node host have 62GB RAM and around 400GB disk.
In our cluster, right now we have deployed around 90 schemas out of which only around 30 schemas have documents for now.
Total docs around 1.5m And 1 schema has more than 70% docs.

Will reducing the number of schemas help?

also week back reindexing succeeded once after node restart and around that time in logs, i have found some deadlock related logs

proton. deadlock.detector
warning Thread Modified bucket checker 0x10308644200 has gone 296151
milliseconds without registering a tick.
Global slack not expended yet.
warning for now. Attempting to dump stack of thread... Backtrace:\n /lib64/1ibc.so. 6(syscall+0x1d) [0x7ff21aa919bd] \n /1ib64/1ibstdc++.so.6(std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono:: duration<long, std::ratio<1l, 1l> >, std::chrono: : duration<long, std: :ratio<1l, 10000000001> >+0xe1) [0X7ff21b4777c1]\n /opt/vespa/sbin/vespa-proton-bin (proton::PersistenceEngine::getModifiedBuckets(document: :BucketSpace) const+0x454) [0x5ffe24] \n /opt/vespa/lib64/libstorage.so(storage::ModifiedBucketChecker::requestModifiedBucketsFromProvider(document::BucketSpace)+0x32) [0х7ff229234932]
/opt/vespa/11b64/11bstorage.sostorage:
:ModifiedBucketChecker:: tick+0x1a3) [0x7ff229234d3] \n
/opt/vespa/lib64/1ibstorage.so(storage::ModifiedBucketChecker:: run (storage:: framework:: ThreadHandle&)+Oxdb) [Ox7ff229234cb]\n /opt/vespa/lib64/libstorage.so(storage:: framework::defaultimplementation::ThreadImpl:: run (+0x20) [0x7ff22937b390] \n
/lib64/libstdc++.so.6(+0xc2b23) [ox7ff21b479b23]\n /opt/vespa/lib64/vespa/malloc/libvespamalloc.so(+0x94de) [0x7ff22b2f64de]\n /1ib64/1ibpthread.so.0(+0x81ca) /lib64/libc.so.6(clone+0x43)
[0x7ff21aa91e73]\n

Also when content node is restarted
In logs, we transaction log replay progress is very slow. We also see high cpu usage and it remains until content node is up. cpu usage jumps from around 10% to 60 or even 80% sometimes and high page out rate.
Let me know if you need any more logs or info.

@shubh9194
Copy link
Author

shubh9194 commented Sep 18, 2024

@hmusum @kkraune
We have reduced the number of schemas to less than 50 and updated vespa version to 8.406.26.
Previously in order to make sure cluster is working fine, We used the reindex api to stop the reindexing of one the schema.
After upgrading, I started the reindexing again but it always get stuck and we don't see any progress.
During that time, cpu usage and page out rate is very high.
And if we stop the vespa process and podman container during this time and try to restart them, vespa services takes lot of time come back up as transaction log replay is very slow for schema that is having the most docs.

reindexing is not completed for schema in last 2 weeks. And only way to make sure ingestions are working fine, we stopped the reindexing using reindex api.
Dont see any error in the logs except lot of warnings like
Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)

RAM usage of host where content node is running is around 60% so we do have significant RAM available as well.

How do we find the where the actual bottleneck is during reindexing?

@bratseth
Copy link
Member

To avoid tx replay on startup, let vespa-stop-services compl,ete before stopping the container.

How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?

@shubh9194
Copy link
Author

@bratseth we stop the vespa services before stopping the container.
Will confirm on whether vespa-services are stopped properly.

But can you help with why reindexing is getting stuck. All ingestions are getting timed out as well unless we stop the reindexing using reindex api.
During reindexing we see the logs filled with these warning msg
Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000)

@bratseth
Copy link
Member

Is it reindexing that is the problem, or is it replaying the tx log at startup?

How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?

@shubh9194
Copy link
Author

shubh9194 commented Sep 19, 2024

@bratseth
We observed this issue started with reindexing and tx log start up issue also seen when node is restarted during reindexing.
Any other time, after node restart node comes back in few mins.

So we think reindexing is the root cause of all this and want to understand what is the bottleneck for reindexing and why it is getting stuck for one schema.
That schema has around 1.36 m docs.
And yes this schema has more than 2000 index fields and good number of paged attributes.

@bratseth
Copy link
Member

Are the fields that are HNSW indexed paged?

@shubh9194
Copy link
Author

@bratseth no, no HNSW indexed paged

@bratseth
Copy link
Member

bratseth commented Nov 6, 2024

I really think your problem is that you don't do a clean shutdown, leading to tx log replay on startup. Building 2000 indexes will take time.

Will confirm on whether vespa-services are stopped properly.

Did you do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants