-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify specification around endianness and header padding #306
Conversation
e3e11f7
to
8dc7f5f
Compare
Hi @akx , I'm very fine adding a test, and the endianness precision. Currently the space padding is used to get alignment and is a "trick" and implementation detail, not something that should ever be required. Are you ok with the proposed change about padding ? |
I'm totally okay with specifying that only trailing whitespace (or null, but that's presently not accepted by serde-json) padding is allowed, but that doesn't reflect the reality of this current canonical implementation (which evidently allows leading whitespace padding). Whether it's okay that this implementation is more lenient than the spec is up to you - if we do hammer it down in the spec that only trailing spacing is allowed, I would suggest we add a check that the first byte of the header is Ps. I'm just boarding a plane so I'm not able to reply immediately 😁 |
Educated guess but this prevents polyglot files because all C-strings are null terminated.
That's actually a neat idea. |
Or maybe just that NUL isn't at all in the JSON spec whereas the documented ws characters are, and it's easier for a parser to just consume them when outside a string literal regardless of other parser state :) Slightly sleep-addled idea - flights are being late - but the |
8dc7f5f
to
f036926
Compare
@Narsil Okiedoke, I amended the second commit to
WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just the nit of removing other paddings in the spec.
* note the data must begin with a `{` byte * note the data may be padded at the end with whitespace
30dbf94
to
ea86a53
Compare
Co-authored-by: Nicolas Patry <[email protected]>
ea86a53
to
75e932a
Compare
@Narsil Rebased + ran |
I'm making a release without this in. So we can have the deepspeed fix (DS shares tensors without actually shared tensors). I will then go ahead and merge this so and advertise to other safetensors implementors see if what we're doing here is actually break free for them. (Which it shouldn't , if it is we can just revert). |
What does this PR do?
This PR clarifies the readme around the spec a little; the endianness of the header size was not specified.
Further, as discussed in #291 (comment)
serde_json
(and in fact any spec-compliant JSON parser, I suppose) allows padding the JSON data with whitespace characters, so note that – and add a test for that, too.EDIT: As discussed in the comments below, this also nails down the spec that the padding may only be trailing.
As an aside: it's not very clear for a contributor that you shouldn't modify
README.md
directly... 😁