Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CidFromReader #126

Closed
rvagg opened this issue Jun 21, 2021 · 9 comments
Closed

add CidFromReader #126

rvagg opened this issue Jun 21, 2021 · 9 comments
Labels
need/analysis Needs further analysis before proceeding

Comments

@rvagg
Copy link
Member

rvagg commented Jun 21, 2021

We now have 2 CID decoder functions in go-car that really belong here and we should move (and properly test) them. The basic functionality is that I have either an io.Reader or just a []byte but I don't know how big the CID is but I'm pretty sure I know where the start of it is. I should be able to extract the CID and get an offset to the end of the parsed CID bytes.

ReadCid(buf []byte) (cid.Cid, int, error) - read a Cid from buf and tell me the offset after read: https://github.com/ipld/go-car/blob/71cfa2fc2a619d646606373c5946282934270bd4/util/util.go#L22

ReadCid(store io.ReaderAt, at int64) (cid.Cid, int, error) - read a Cid from store and tell me the offset after read: https://github.com/ipld/go-car/blob/wip/v2/v2/internal/io/cid.go (wip/v2 branch, not yet in master).

We have decodeFirst(bytes) in js-multiformats to serve a very similar purpose. Having it in the core library has uncovered some other uses outside of CAR decoding too.

Suggestions for more explicit naming of these functions welcome!

@rvagg rvagg added the need/triage Needs initial labeling and prioritization label Jun 21, 2021
@ipfs ipfs deleted a comment from welcome bot Jun 21, 2021
@mvdan
Copy link
Contributor

mvdan commented Jun 21, 2021

I think we should just expose the []byte based API; one can always write the ReaderAt version in terms of it, assuming we can somewhat reliably predict how many bytes the CID will be. I guess technically the hash could be many many bytes, but in practice I imagine one could read a chunk of bytes (like 1024) and be pretty sure that the whole CID is contained within that read buffer.

@rvagg
Copy link
Member Author

rvagg commented Jun 21, 2021

True, although the increasing use of identity CIDs makes this a bit messy. But, I now recall that I coupled decodeFirst() with another utility function in js-multiformats, inspectBytes() which could tell you how many you needed if you're in a situation where you might need to be concerned: https://github.com/multiformats/js-multiformats/blob/2fb9b6815d4b9cbb8212e2f1975b355d2d370e4b/src/cid.js#L312-L324

It only needs the first few bytes, a maximum of maybe 10. So it's a nice combo that could be used to safely deal with the flexibility challenges.

A maximal implementation could look like this and could also be used internally by a ReadCid() to do part of the decoding since it needs all of these things which are gleaned from the first few bytes anyway (at least this is what we do in JS now):

InspectBytes(buf []byte) (version int, codec int, multihashCode int, digestSize int, multihashSize int, size int)

That's probably too many return values and probably should be a struct and then I guess there'll be arguments about too many allocations, so a minimal form could just return the size I suppose.

@rvagg
Copy link
Member Author

rvagg commented Jun 21, 2021

Thanks to @ribasushi for reminding me of a relevant thread on identity lengths just now! multiformats/multihash#130

@mvdan
Copy link
Contributor

mvdan commented Jun 21, 2021

What you say makes sense; an "inspect" API for the cases where one wants to support potentially huge CIDs, and a "read" API for the cases where one doesn't need to support them - so erroring with e.g. io.ErrUnexpectedEOF would be fine.

arguments about too many allocations

Returning a non-pointer struct shouldn't allocate; it should be practically equivalent to returning the list of parameters. It's more of a question of which is nicer in terms of API. You could also deduplicate the types in the return parameters, like:

InspectBytes(buf []byte) (version, codec, multihashCode, digestSize, multihashSize, size int)

@aschmahmann aschmahmann added status/ready Ready to be worked need/analysis Needs further analysis before proceeding and removed need/triage Needs initial labeling and prioritization status/ready Ready to be worked labels Jun 25, 2021
@Stebalien
Copy link
Member

We already have CidFromBytes which I believe is what you're looking for. We could also add a CidFromReader if you need it to work with an io.Reader.

@rvagg
Copy link
Member Author

rvagg commented Jun 26, 2021

mm, you're right, I thought that was just for a strictly correct length number of bytes but that seems to do mostly what's needed here. We should try and use it in go-car to see what the limits are and what else we need to add.

@mvdan
Copy link
Contributor

mvdan commented Jul 1, 2021

Confirming that CidFromBytes works for the cases where we're decoding from a buffer: ipld/go-car#131

That said, I think a CidFromReader would still be very useful. We need that for stream-like reading of CARv1 files, for example: https://ipld.io/specs/transport/car/carv1/#cid

Right now we implement that in an internal package inside go-car, which is not ideal. Plus, we didn't implement CIDv0 decoding in that copy, so that's biting @raulk when he tries to use our carv2 module 🤦

I'll send a PR for CidFromReader shortly.

@mvdan
Copy link
Contributor

mvdan commented Jul 1, 2021

I'm also not discarding having an "inspect" API in the future, but it's not useful for go-car right now, and it doesn't consume a fixed number of bytes, so... It's not clear to me that it's a clear win and needed right now. I'd put it off until someone actually needs it.

@mvdan mvdan changed the title TODO: implement ReadCid(bytes) and ReadCid(io.ReaderAt) add CidFromReader Jul 1, 2021
mvdan added a commit that referenced this issue Jul 2, 2021
And reuse a CidFromBytes test for it, which includes both CIDv0 and
CIDv1 cases as inputs.

Fixes #126.
@mvdan
Copy link
Contributor

mvdan commented Jul 2, 2021

Now at #127.

mvdan added a commit that referenced this issue Jul 2, 2021
And reuse a CidFromBytes test for it, which includes both CIDv0 and
CIDv1 cases as inputs.

Fixes #126.
mvdan added a commit that referenced this issue Jul 2, 2021
And reuse a CidFromBytes test for it, which includes both CIDv0 and
CIDv1 cases as inputs.

Fixes #126.
mvdan added a commit that referenced this issue Jul 2, 2021
And reuse a CidFromBytes test for it, which includes both CIDv0 and
CIDv1 cases as inputs.

Fixes #126.
mvdan added a commit that referenced this issue Jul 14, 2021
And reuse two CidFromBytes tests for it, which includes both CIDv0 and
CIDv1 cases as inputs, as well as some inputs that should error.

Fixes #126.
mvdan added a commit that referenced this issue Jul 14, 2021
And reuse two CidFromBytes tests for it, which includes both CIDv0 and
CIDv1 cases as inputs, as well as some inputs that should error.

Fixes #126.
@mvdan mvdan closed this as completed in c4c8760 Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding
Projects
None yet
Development

No branches or pull requests

4 participants