-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion '_features_pos == no_skip.get_features_pos()' fails #32406
Comments
If the difference between the two values are a multiple of 4Gi then it might be a 32-bits overflow for the feature position when writing a chunk of the posting list for that word. If you have a very common word that occurs many times in most documents (e.g. having an indexed field that is an array of string or weighted set of string with many long values) such that the average features length for that word is 2 KiB or more (for each document) then the total feature length for that document across 256Ki documents can become larger than 512 MiB (4 Gib). The If the difference between the two values is not a multiple of 4Gi then the problem is likely elsewhere, e.g.. something went wrong when reading features from the file, causing the decoding to be out of sync with the encoding, probably due to some unhandled corner case. |
Thanks @toregge for the quick response & the insights. We got the problem reproduced on the same data again after switching from ext4 to xfs and we have some details that might help with figuring this out: Only one field seems to be effected, a metadata field of type
Our index state is the following: bash-4.4# ls -alh
total 24K
drwxr-xr-x 8 vespa vespa 140 Sep 18 09:19 .
drwxr-xr-x 6 vespa vespa 76 Sep 16 23:31 ..
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 02:51 index.flush.26
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 03:51 index.flush.27
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 04:55 index.flush.28
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 08:25 index.flush.29
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 01:50 index.fusion.25
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 08:54 index.fusion.29 bash-4.4# du -hs *
570M index.flush.26
544M index.flush.27
524M index.flush.28
113M index.flush.29
12G index.fusion.25
13G index.fusion.29
Through
We also compiled a custom version of vespa-index-inspect with some added debug logs, our insights from that are the following:
The difference between _features_pos and no_skip.get_features_pos seems to suggest that this is not an integer overflow problem. I'll try to get |
The 32-bit overflow problem described by @toregge is fixed by #32425. This is available in Vespa 8.411.13, which was just released to open source. @lukasmoellerch Can you please try out this version to see if it fixes your problem? (Based on your detailed description, we are uncertain if your problem is the same, but it is worth checking out) |
@geirst & @toregge - thank you so much for looking into this! Running this with lldb didn't really add more value: It seems to be failing on some common (but not super common) word. The first ~28k words are totally fine, it's one specific word + doc_id causing this. Ignoring the assertion and continuing the read process also doesn't help, the reader goes completely out of control after this. In the meantime we mitigated the issue on our side through restricting the amount of data we index into this metadata field (faulty behavior on our side that arguably blew up the average cardinality of the field). We haven't hit any assertion failure since, but at the same time this makes reproduction more difficult. We'll upgrade to 8.411.13 as soon as possible and will report back in case we still encounter the assertion error. I'd be fine with closing this in the meantime - I totally understand that it is difficult to further investigate based on the limited information that we are able to provide. |
Describe the bug
The following assertion fails:
void search::diskindex::Zc4PostingReaderBase::L1Skip::check(const search::diskindex::Zc4PostingReaderBase::NoSkipBase&, bool, bool): Assertion '_features_pos == no_skip.get_features_pos()' failed.
To Reproduce
I sadly don't have a way to reproduce this issue in an isolated way currently and can also not fully disclose the schema definition that the data was indexed with. We saw this repeatedly happen on this setup with a specific dataset.
The general timeline was the following: We indexed multiple million documents to the cluster, eventually during a disk index fusion the assertion failure occurred. We then upgraded the cluster to v8.398.4, removed the schema and fed the same data into the cluster, seeing the same assertion failure happen once again after a couple millions documents.
Expected behavior
It shouldn't fail
Screenshots
Environment (please complete the following information):
(AKSUbuntu-2204gen2containerd-202406.07.0)
Vespa version
v8.398.4
Additional context
The text was updated successfully, but these errors were encountered: