Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion '_features_pos == no_skip.get_features_pos()' fails #32406

Open
lukasmoellerch opened this issue Sep 16, 2024 · 4 comments
Open

Assertion '_features_pos == no_skip.get_features_pos()' fails #32406

lukasmoellerch opened this issue Sep 16, 2024 · 4 comments
Assignees
Milestone

Comments

@lukasmoellerch
Copy link

lukasmoellerch commented Sep 16, 2024

Describe the bug
The following assertion fails:

void search::diskindex::Zc4PostingReaderBase::L1Skip::check(const search::diskindex::Zc4PostingReaderBase::NoSkipBase&, bool, bool): Assertion '_features_pos == no_skip.get_features_pos()' failed.

To Reproduce
I sadly don't have a way to reproduce this issue in an isolated way currently and can also not fully disclose the schema definition that the data was indexed with. We saw this repeatedly happen on this setup with a specific dataset.

The general timeline was the following: We indexed multiple million documents to the cluster, eventually during a disk index fusion the assertion failure occurred. We then upgraded the cluster to v8.398.4, removed the schema and fed the same data into the cluster, seeing the same assertion failure happen once again after a couple millions documents.

Expected behavior
It shouldn't fail

Screenshots

Environment (please complete the following information):

  • OS: Ubuntu-22.04
  • Infrastructure: Azure Virtual Machine with Kubernetes
  • Versions 22.04

(AKSUbuntu-2204gen2containerd-202406.07.0)

Vespa version
v8.398.4

Additional context

@geirst geirst added this to the soon milestone Sep 17, 2024
@toregge
Copy link
Member

toregge commented Sep 17, 2024

If the difference between the two values are a multiple of 4Gi then it might be a 32-bits overflow for the feature position when writing a chunk of the posting list for that word. If you have a very common word that occurs many times in most documents (e.g. having an indexed field that is an array of string or weighted set of string with many long values) such that the average features length for that word is 2 KiB or more (for each document) then the total feature length for that document across 256Ki documents can become larger than 512 MiB (4 Gib).

The max-occurrences setting for the field can be used to limit the number of indexed tokens for the word from each string, but it has limited effect when many strings are indexed for that document field and it will also affect handling of phrases and near operators.

If the difference between the two values is not a multiple of 4Gi then the problem is likely elsewhere, e.g.. something went wrong when reading features from the file, causing the decoding to be out of sync with the encoding, probably due to some unhandled corner case.

@lukasmoellerch
Copy link
Author

Thanks @toregge for the quick response & the insights.

We got the problem reproduced on the same data again after switching from ext4 to xfs and we have some details that might help with figuring this out: Only one field seems to be effected, a metadata field of type array<string> with bm25 and bolding enabled:

indexfield[5].name <field_name>
indexfield[5].datatype STRING
indexfield[5].collectiontype ARRAY
indexfield[5].averageelementlen 512
indexfield[5].interleavedfeatures true
indexfield[5].prefix false
indexfield[5].phrases false
indexfield[5].positions true

Our index state is the following:

bash-4.4# ls -alh
total 24K
drwxr-xr-x  8 vespa vespa  140 Sep 18 09:19 .
drwxr-xr-x  6 vespa vespa   76 Sep 16 23:31 ..
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 02:51 index.flush.26
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 03:51 index.flush.27
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 04:55 index.flush.28
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 08:25 index.flush.29
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 01:50 index.fusion.25
drwxr-xr-x 20 vespa vespa 4.0K Sep 18 08:54 index.fusion.29
bash-4.4# du -hs *
570M	index.flush.26
544M	index.flush.27
524M	index.flush.28
113M	index.flush.29
12G	index.fusion.25
13G	index.fusion.29

index.fusion.25 seems to be the last complete disk index, trying to read index.fusion.29 with vespa-index-inspect showpostings --transpose --indexdir index.fusion.29 --field <field_name> --docidlimit 1 is causing a different assertion to fail: Assertion 'static_cast<int64_t>(hhSize) <= fileSize' failed. - very likely because 25 is broken and it died trying to merge into 29 (for which it has to read 25).

Through vespa-index-inspect on index.fusion.25 we found out the following:

  • vespa-index-inspect dumpwords --indexdir index.fusion.25/ --field <field_name> runs successfully
  • The most common word appears 281085 times
  • vespa-index-inspect showpostings <common_word> --indexdir index.fusion.25 --field <field_name> --docidlimit 1 runs successfully

We also compiled a custom version of vespa-index-inspect with some added debug logs, our insights from that are the following:

  • vespa-index-inspect showpostings --transpose --indexdir index.fusion.25 --field <field_name> --docidlimit 1 fails:
L1Skip::check: _features_pos=41363409108, no_skip.get_features_pos()=41369242707
L1Skip::check: _zc_buf._valI=0x20891b9, _zc_buf._valE=0x209c3e4
vespa-index-inspect: /root/rpmbuild/BUILD/vespa-8.398.4/searchlib/src/vespa/searchlib/diskindex/zc4_posting_reader_base.cpp:108: void search::diskindex::Zc4PostingReaderBase::L1Skip::check(const search::diskindex::Zc4PostingReaderBase::NoSkipBase&, bool, bool): Assertion `_features_pos == no_skip.get_features_pos()' failed.
Aborted (core dumped)

The difference between _features_pos and no_skip.get_features_pos seems to suggest that this is not an integer overflow problem. I'll try to get vespa-index-inspect running with lldb, maybe that'll give us more interesting insights.

@geirst
Copy link
Member

geirst commented Sep 19, 2024

The 32-bit overflow problem described by @toregge is fixed by #32425. This is available in Vespa 8.411.13, which was just released to open source.

@lukasmoellerch Can you please try out this version to see if it fixes your problem? (Based on your detailed description, we are uncertain if your problem is the same, but it is worth checking out)

@lukasmoellerch
Copy link
Author

@geirst & @toregge - thank you so much for looking into this!

Running this with lldb didn't really add more value: It seems to be failing on some common (but not super common) word. The first ~28k words are totally fine, it's one specific word + doc_id causing this. Ignoring the assertion and continuing the read process also doesn't help, the reader goes completely out of control after this.

In the meantime we mitigated the issue on our side through restricting the amount of data we index into this metadata field (faulty behavior on our side that arguably blew up the average cardinality of the field). We haven't hit any assertion failure since, but at the same time this makes reproduction more difficult. We'll upgrade to 8.411.13 as soon as possible and will report back in case we still encounter the assertion error.

I'd be fine with closing this in the meantime - I totally understand that it is difficult to further investigate based on the limited information that we are able to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants