Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not read past end for CBOR string values #518

Merged
merged 1 commit into from
Oct 19, 2024

Conversation

knutwannheden
Copy link
Contributor

@knutwannheden knutwannheden commented Oct 12, 2024

In CBOR a string data item's initial byte (and optional additional information bytes) encode the string's length in number of UTF-8 bytes.

The parser should not have to buffer up more bytes than indicated by that length, as it must make sure to not read past that, as subsequent bytes might either not be available (end-of-stream) or should be treated as belonging to the next data item. The parser should throw an exception when the last byte(s) form an incomplete UTF-8 sequence.

@knutwannheden
Copy link
Contributor Author

This PR causes the test ParseInvalidUTF8String236Test.testShortString236EndsWithPartialUTF8 to fail. So it looks like this code was added to deal with invalid CBOR data items.

In my use case this is causing a bit of an issue because the parser is trying to read beyond where it should, which I didn't think should be required with CBOR, as data items are self-terminating.

@knutwannheden
Copy link
Contributor Author

I will have a look at that failing test, but I am wondering if it would not be better if the parser can attempt some kind of error recovery once it detects that a data item was incorrect. But given the current comment regarding surrogates, I was thinking that the code to buffer 3 extra bytes could just be deleted.

A bit more about my use case: I read CBOR data in a streaming manner straight from a socket and when reading a string as the last data item from the server, I've experienced that this operation can hang, because the server didn't send and data after that. So now the client is blocked and the server in turn is waiting for the client to send another request. I could solve this somehow at the application protocol level, but when reading the code and the comment regarding surrogate pairs, this looked like a bug to me.

@cowtowncoder
Copy link
Member

Yes, the problem is the case of invalid UTF-8 content in which code points would suggest needing to read more bytes than what CBOR String indicates. The issue was found by OSSFuzz project fuzzer.
I agree that reads should never be done beyond actual indicated length (I think this is what you are saying) -- we just need to figure out how this can be done without compromising integrity of decoder.

As to error recovery: I think at first it's just important to enforce invariants (no reads past end) and UTF-8 decoder integrity (no partial/incomplete characters).

And the comment about surrogate pairs: that sounds wrong: surrogate pairs are wrt Java's in-memory char representation: and although JSON has weird rules about allowing pairs, UTF-8 for CBOR (and Smile) shouldn't really be concerned with those.
But the general point is relevant still, wrt minimum bytes needed being more than just 1.

@knutwannheden
Copy link
Contributor Author

knutwannheden commented Oct 13, 2024

Yes, the problem is the case of invalid UTF-8 content in which code points would suggest needing to read more bytes than what CBOR String indicates. The issue was found by OSSFuzz project fuzzer.
I agree that reads should never be done beyond actual indicated length (I think this is what you are saying) -- we just need to figure out how this can be done without compromising integrity of decoder.

Yes, you understood me correctly. I agree that if the string data item indicated the wrong length (number of bytes), then the parser isn't required to be able to recover. I think we should only have to catch the ArrayIndexOutOfBoundsException and wrap it with a JsonParseException, whenever it occurs (as is the case in the failing test). As the UTF-8 code points are themselves also self-delimiting, we should never get any incomplete characters.

@knutwannheden
Copy link
Contributor Author

Linking original issue: #316

@knutwannheden
Copy link
Contributor Author

For an incorrect byte sequence like 0x61, 0xdb, 0xa0 where the first byte indicates that it is a string with one byte and the last byte is an empty map data item and the middle UTF-8 byte requires a second byte (where 0xa0 would be legal), the parser must throw an exception rather than just returning a single string data item. As far as I can tell the parser already has logic for this.

Adding an ArrayIndexOutOfBoundsException exception handler isn't the most elegant solution, but maybe it is acceptable for this rather rare situation.

@knutwannheden
Copy link
Contributor Author

FWIW, cbor-x seems to be a popular CBOR implementation for JS and it looks like it suffers from the problem I describe. So the following code would end up printing a single data item and silently ignore the fact that the length of the string was actually wrong:

import { decodeMultiple } from 'cbor-x';
console.log(decodeMultiple(new Uint8Array([0x61, 0xdb, 0xa0])));

@cowtowncoder
Copy link
Member

Quick note on ArrayIndexOutOfBoundsException: that unfortunately would not work in general, because:

  1. While broken UTF-8 character might be the last thing in document, it is not necessarily at the end of input buffer (byte[]), so read would just be done (incorrectly) past the end of content but not outside array buffer
  2. But more importantly, invalid character (and token) may be followed by other tokens

So the check needs to be done wrt end of content for the token.

@knutwannheden
Copy link
Contributor Author

I've now "resurrected" that _reportTruncatedUTF8InString() method and copied the code that was already present for detecting the same type of issue with map keys.

I also added one more test case, so that we have one test case for the end-of-stream and one test case, where there are subsequent data items.

@knutwannheden knutwannheden changed the title Only read bytes required to complete CBOR string No reading past end for CBOR strings Oct 14, 2024
@cowtowncoder
Copy link
Member

@knutwannheden Ok: looks good, would be happy to merge. Thank you a lot for providing the fix!

But 2 things before merging.

First: this is based against 2.19 branch which is ok (it is the current default branch for new 2.x versions). But it will be months until 2.19.0 is released -- would it make sense to re-base against 2.18 (2.18.1 will be released within 1-2 weeks)?
I can probably do this post-merge too (with cherry-picking) actually, if github UI doesn't allow this easily.

Second: before merging, I need to get CLA (unless you have already sent one earlier -- this is one-time thing). It's from https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf and the usual way is to print it, fill & sign, scan/photo, email to cla at fasterxml dot com.
Once this is in, I can proceed (to whatever branch is decided as target).

Looking forward to merging this fix!

@knutwannheden
Copy link
Contributor Author

If we can get this into 2.18.1, that would be even better. I will rebase and force push the PR. I will also sign the CLA.

Thanks for your help on this!

@cowtowncoder
Copy link
Member

Perfect!

The string parsing now no longer reads more bytes than indicated by the data item. For any incomplete UTF-8 code points an exception with a corresponding message is thrown.
@knutwannheden knutwannheden changed the base branch from 2.19 to 2.18 October 17, 2024 13:14
@knutwannheden
Copy link
Contributor Author

I've now changed the merge branch of this PR (never used the edit button of the PR to do that before) and I've also sent an email with a signed CLA.

@cowtowncoder
Copy link
Member

Quick note: CLA received so that's good.

Hoping to do final code review tomorrow, get PR merged.

@cowtowncoder cowtowncoder changed the title No reading past end for CBOR strings Should not read past end for CBOR string values Oct 19, 2024
@cowtowncoder cowtowncoder merged commit 265b3c4 into FasterXML:2.18 Oct 19, 2024
4 checks passed
cowtowncoder added a commit that referenced this pull request Oct 19, 2024
@cowtowncoder cowtowncoder added this to the 2.18.1 milestone Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants