-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should not read past end for CBOR string values #518
Conversation
This PR causes the test In my use case this is causing a bit of an issue because the parser is trying to read beyond where it should, which I didn't think should be required with CBOR, as data items are self-terminating. |
I will have a look at that failing test, but I am wondering if it would not be better if the parser can attempt some kind of error recovery once it detects that a data item was incorrect. But given the current comment regarding surrogates, I was thinking that the code to buffer 3 extra bytes could just be deleted. A bit more about my use case: I read CBOR data in a streaming manner straight from a socket and when reading a string as the last data item from the server, I've experienced that this operation can hang, because the server didn't send and data after that. So now the client is blocked and the server in turn is waiting for the client to send another request. I could solve this somehow at the application protocol level, but when reading the code and the comment regarding surrogate pairs, this looked like a bug to me. |
Yes, the problem is the case of invalid UTF-8 content in which code points would suggest needing to read more bytes than what CBOR String indicates. The issue was found by OSSFuzz project fuzzer. As to error recovery: I think at first it's just important to enforce invariants (no reads past end) and UTF-8 decoder integrity (no partial/incomplete characters). And the comment about surrogate pairs: that sounds wrong: surrogate pairs are wrt Java's in-memory |
Yes, you understood me correctly. I agree that if the string data item indicated the wrong length (number of bytes), then the parser isn't required to be able to recover. I think we should only have to catch the |
Linking original issue: #316 |
For an incorrect byte sequence like Adding an |
FWIW, cbor-x seems to be a popular CBOR implementation for JS and it looks like it suffers from the problem I describe. So the following code would end up printing a single data item and silently ignore the fact that the length of the string was actually wrong: import { decodeMultiple } from 'cbor-x';
console.log(decodeMultiple(new Uint8Array([0x61, 0xdb, 0xa0]))); |
Quick note on
So the check needs to be done wrt end of content for the token. |
I've now "resurrected" that I also added one more test case, so that we have one test case for the end-of-stream and one test case, where there are subsequent data items. |
@knutwannheden Ok: looks good, would be happy to merge. Thank you a lot for providing the fix! But 2 things before merging. First: this is based against 2.19 branch which is ok (it is the current default branch for new 2.x versions). But it will be months until 2.19.0 is released -- would it make sense to re-base against 2.18 (2.18.1 will be released within 1-2 weeks)? Second: before merging, I need to get CLA (unless you have already sent one earlier -- this is one-time thing). It's from https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf and the usual way is to print it, fill & sign, scan/photo, email to Looking forward to merging this fix! |
If we can get this into 2.18.1, that would be even better. I will rebase and force push the PR. I will also sign the CLA. Thanks for your help on this! |
Perfect! |
The string parsing now no longer reads more bytes than indicated by the data item. For any incomplete UTF-8 code points an exception with a corresponding message is thrown.
5581a07
to
5bfa87a
Compare
I've now changed the merge branch of this PR (never used the edit button of the PR to do that before) and I've also sent an email with a signed CLA. |
Quick note: CLA received so that's good. Hoping to do final code review tomorrow, get PR merged. |
In CBOR a string data item's initial byte (and optional additional information bytes) encode the string's length in number of UTF-8 bytes.
The parser should not have to buffer up more bytes than indicated by that length, as it must make sure to not read past that, as subsequent bytes might either not be available (end-of-stream) or should be treated as belonging to the next data item. The parser should throw an exception when the last byte(s) form an incomplete UTF-8 sequence.