Should not read past end for CBOR string values #518

knutwannheden · 2024-10-12T05:55:39Z

In CBOR a string data item's initial byte (and optional additional information bytes) encode the string's length in number of UTF-8 bytes.

The parser should not have to buffer up more bytes than indicated by that length, as it must make sure to not read past that, as subsequent bytes might either not be available (end-of-stream) or should be treated as belonging to the next data item. The parser should throw an exception when the last byte(s) form an incomplete UTF-8 sequence.

knutwannheden · 2024-10-12T08:14:08Z

This PR causes the test ParseInvalidUTF8String236Test.testShortString236EndsWithPartialUTF8 to fail. So it looks like this code was added to deal with invalid CBOR data items.

In my use case this is causing a bit of an issue because the parser is trying to read beyond where it should, which I didn't think should be required with CBOR, as data items are self-terminating.

knutwannheden · 2024-10-12T09:00:05Z

I will have a look at that failing test, but I am wondering if it would not be better if the parser can attempt some kind of error recovery once it detects that a data item was incorrect. But given the current comment regarding surrogates, I was thinking that the code to buffer 3 extra bytes could just be deleted.

A bit more about my use case: I read CBOR data in a streaming manner straight from a socket and when reading a string as the last data item from the server, I've experienced that this operation can hang, because the server didn't send and data after that. So now the client is blocked and the server in turn is waiting for the client to send another request. I could solve this somehow at the application protocol level, but when reading the code and the comment regarding surrogate pairs, this looked like a bug to me.

cowtowncoder · 2024-10-13T04:32:04Z

Yes, the problem is the case of invalid UTF-8 content in which code points would suggest needing to read more bytes than what CBOR String indicates. The issue was found by OSSFuzz project fuzzer.
I agree that reads should never be done beyond actual indicated length (I think this is what you are saying) -- we just need to figure out how this can be done without compromising integrity of decoder.

As to error recovery: I think at first it's just important to enforce invariants (no reads past end) and UTF-8 decoder integrity (no partial/incomplete characters).

And the comment about surrogate pairs: that sounds wrong: surrogate pairs are wrt Java's in-memory char representation: and although JSON has weird rules about allowing pairs, UTF-8 for CBOR (and Smile) shouldn't really be concerned with those.
But the general point is relevant still, wrt minimum bytes needed being more than just 1.

knutwannheden · 2024-10-13T20:09:26Z

Yes, the problem is the case of invalid UTF-8 content in which code points would suggest needing to read more bytes than what CBOR String indicates. The issue was found by OSSFuzz project fuzzer.
I agree that reads should never be done beyond actual indicated length (I think this is what you are saying) -- we just need to figure out how this can be done without compromising integrity of decoder.

Yes, you understood me correctly. I agree that if the string data item indicated the wrong length (number of bytes), then the parser isn't required to be able to recover. I think we should only have to catch the ArrayIndexOutOfBoundsException and wrap it with a JsonParseException, whenever it occurs (as is the case in the failing test). As the UTF-8 code points are themselves also self-delimiting, we should never get any incomplete characters.

knutwannheden · 2024-10-13T20:17:39Z

Linking original issue: #316

knutwannheden · 2024-10-13T20:40:31Z

For an incorrect byte sequence like 0x61, 0xdb, 0xa0 where the first byte indicates that it is a string with one byte and the last byte is an empty map data item and the middle UTF-8 byte requires a second byte (where 0xa0 would be legal), the parser must throw an exception rather than just returning a single string data item. As far as I can tell the parser already has logic for this.

Adding an ArrayIndexOutOfBoundsException exception handler isn't the most elegant solution, but maybe it is acceptable for this rather rare situation.

knutwannheden · 2024-10-13T20:57:54Z

FWIW, cbor-x seems to be a popular CBOR implementation for JS and it looks like it suffers from the problem I describe. So the following code would end up printing a single data item and silently ignore the fact that the length of the string was actually wrong:

import { decodeMultiple } from 'cbor-x';
console.log(decodeMultiple(new Uint8Array([0x61, 0xdb, 0xa0])));

cowtowncoder · 2024-10-14T01:53:41Z

Quick note on ArrayIndexOutOfBoundsException: that unfortunately would not work in general, because:

While broken UTF-8 character might be the last thing in document, it is not necessarily at the end of input buffer (byte[]), so read would just be done (incorrectly) past the end of content but not outside array buffer
But more importantly, invalid character (and token) may be followed by other tokens

So the check needs to be done wrt end of content for the token.

knutwannheden · 2024-10-14T06:46:48Z

I've now "resurrected" that _reportTruncatedUTF8InString() method and copied the code that was already present for detecting the same type of issue with map keys.

I also added one more test case, so that we have one test case for the end-of-stream and one test case, where there are subsequent data items.

cowtowncoder · 2024-10-15T03:52:41Z

@knutwannheden Ok: looks good, would be happy to merge. Thank you a lot for providing the fix!

But 2 things before merging.

First: this is based against 2.19 branch which is ok (it is the current default branch for new 2.x versions). But it will be months until 2.19.0 is released -- would it make sense to re-base against 2.18 (2.18.1 will be released within 1-2 weeks)?
I can probably do this post-merge too (with cherry-picking) actually, if github UI doesn't allow this easily.

Second: before merging, I need to get CLA (unless you have already sent one earlier -- this is one-time thing). It's from https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf and the usual way is to print it, fill & sign, scan/photo, email to cla at fasterxml dot com.
Once this is in, I can proceed (to whatever branch is decided as target).

Looking forward to merging this fix!

knutwannheden · 2024-10-15T04:02:25Z

If we can get this into 2.18.1, that would be even better. I will rebase and force push the PR. I will also sign the CLA.

Thanks for your help on this!

cowtowncoder · 2024-10-15T04:04:53Z

Perfect!

The string parsing now no longer reads more bytes than indicated by the data item. For any incomplete UTF-8 code points an exception with a corresponding message is thrown.

knutwannheden · 2024-10-17T13:15:32Z

I've now changed the merge branch of this PR (never used the edit button of the PR to do that before) and I've also sent an email with a signed CLA.

cowtowncoder · 2024-10-18T04:00:41Z

Quick note: CLA received so that's good.

Hoping to do final code review tomorrow, get PR merged.

knutwannheden changed the title ~~Only read bytes required to complete CBOR string~~ No reading past end for CBOR strings Oct 14, 2024

cowtowncoder approved these changes Oct 15, 2024

View reviewed changes

cowtowncoder added the cbor label Oct 15, 2024

Only read number of bytes indicated by string data item

5bfa87a

The string parsing now no longer reads more bytes than indicated by the data item. For any incomplete UTF-8 code points an exception with a corresponding message is thrown.

knutwannheden force-pushed the text-byte-length branch from 5581a07 to 5bfa87a Compare October 17, 2024 13:10

knutwannheden changed the base branch from 2.19 to 2.18 October 17, 2024 13:14

cowtowncoder changed the title ~~No reading past end for CBOR strings~~ Should not read past end for CBOR string values Oct 19, 2024

cowtowncoder merged commit 265b3c4 into FasterXML:2.18 Oct 19, 2024
4 checks passed

cowtowncoder added a commit that referenced this pull request Oct 19, 2024

Update release notes wrt #518

fcbf9ec

cowtowncoder added this to the 2.18.1 milestone Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should not read past end for CBOR string values #518

Should not read past end for CBOR string values #518

knutwannheden commented Oct 12, 2024 •

edited

Loading

knutwannheden commented Oct 12, 2024

knutwannheden commented Oct 12, 2024

cowtowncoder commented Oct 13, 2024

knutwannheden commented Oct 13, 2024 •

edited

Loading

knutwannheden commented Oct 13, 2024

knutwannheden commented Oct 13, 2024

knutwannheden commented Oct 13, 2024

cowtowncoder commented Oct 14, 2024

knutwannheden commented Oct 14, 2024

cowtowncoder commented Oct 15, 2024

knutwannheden commented Oct 15, 2024

cowtowncoder commented Oct 15, 2024

knutwannheden commented Oct 17, 2024

cowtowncoder commented Oct 18, 2024

Should not read past end for CBOR string values #518

Should not read past end for CBOR string values #518

Conversation

knutwannheden commented Oct 12, 2024 • edited Loading

knutwannheden commented Oct 12, 2024

knutwannheden commented Oct 12, 2024

cowtowncoder commented Oct 13, 2024

knutwannheden commented Oct 13, 2024 • edited Loading

knutwannheden commented Oct 13, 2024

knutwannheden commented Oct 13, 2024

knutwannheden commented Oct 13, 2024

cowtowncoder commented Oct 14, 2024

knutwannheden commented Oct 14, 2024

cowtowncoder commented Oct 15, 2024

knutwannheden commented Oct 15, 2024

cowtowncoder commented Oct 15, 2024

knutwannheden commented Oct 17, 2024

cowtowncoder commented Oct 18, 2024

knutwannheden commented Oct 12, 2024 •

edited

Loading

knutwannheden commented Oct 13, 2024 •

edited

Loading