Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cid implementation roundup #70

Merged
merged 7 commits into from
Aug 27, 2018
Merged

cid implementation roundup #70

merged 7 commits into from
Aug 27, 2018

Conversation

warpfork
Copy link
Member

It's been discussed in several issues and PRs already that we might want to explore various ways of implementing CIDs for maximum performance and ease-of-use because they show up extremely often. Current CIDs are pointers, which generally speaking means you can't get one without a malloc; and also, they're not particularly well-suited for use in map keys.

This branch is to attempt to consolidate all the proposals so far (#38, #64, #47, more?) -- and do so in a single branch which can be checked out and contains all the proposals at once, because this will make it easy to do benchmarks and compare all of the various ways we could implement this in one place (and also easier for humans to track what the latest of each proposal is, since they're all in one place). (I ended up re-writing much of it because some of the existing work was a pain to cherry-pick apart, either because it was stale and had conflicts with latest master or because it mixed other topics like multibase, but this does borrow heavily from @dignifiedquire and @kevina 's code on the other branches.)

There are two/three major implementations here: one as struct (familiar), one as typedef of string, and one as an interface (implemented by the other two).

I composed a README.md with a summary of what I learned. Some of it became obvious just from thrashing the code around; some of it is supported with benchmarks (also included in the diff).

There's no attempt here to change the implementation of multihash... however, it's likely that doing so could further shift some of the performance numbers quite a bit. Skim the alloc counts in the benchmarks for impressions.

This whole hullabaloo is compartmentalized under a _rsrch directory... which means it won't typically be built or tested by e.g. go test ./... unless you're already in the _rsrch directory. It also does not change any of the existing packages. Therefore, it is completely safe to merge to master as sheer documentation of our research, even if we choose not to pursue any of these refactors at this time.

Hopefully this is good further food for thought and can be iterated on to get us closer to a committal choice about this stuff. (Or at least merged as research, rather than abandoned to drift as an out-of-sync branch. I dunno if you can tell, but those are my least favorite thing.)

It's been discussed in several issues and PRs already that we might want to
explore various ways of implementing CIDs for maximum performance and
ease-of-use because they show up extremely often.  Current CIDs are pointers,
which generally speaking means you can't get one without a malloc; and also,
they're not particularly well-suited for use in map keys.

This branch is to attempt to consolidate all the proposals so far -- and do so
in a single branch which can be checked out and contains all the proposals at
once, because this will make it easy to do benchmarks and compare all of the
various ways we could implement this in one place (and also easier for humans
to track what the latest of each proposal is, since they're all in one place).

To start with: a Cid implementation backed by a string; and matching interface.

(I'm also taking this opportunity to be as minimalistic as possible in what
I port over into these experimental new Cid implementations.  This might not
last; but as long as all this work is to be done, it's a more convenient time
than usual to see what can be stripped down and still get work done.)

More to come.
Added back in some of the parser methods.  (These were previously named "Cast"
and I think that's silly and wrong so I fixed it.)

Functions are named overly-literally with their type (e.g. ParseCidString and
ParseCidStruct rather than ParseCid or even just Parse) because for this
research package I don't want to bother with many sub-packages.  (Maybe I'll
regret this, but at the moment it seems simpler to hold back on sub-packages.)

Functions that produce Cids are literal with their return types, as well.
Part of the purpose of this research package is going to be to concretely
benchmark exactly how much performance overhead there is to using interfaces
(which will likely cause a lot of boxing and unboxing in practice) -- since we
want to explore where this boxing happens and how much it costs, it's important
that none of our basic implementation functions do the boxing!

The entire set of codec enums came along in this commit.  Ah well; they would
have eventually anyway, I guess.  But it's interesting to note the only thing
that dragged them along so far is the reference to 'DagProtobuf' when
constructing v0 CIDs; otherwise, this enum is quite unused here.
Using interfaces as a map's key type can cause some things that were otherwise
compile-time checks to be pushed off into runtime checks instead.
This is a pretty major "caveat emptor" if you use interface-keyed maps.
Right now this is mostly this is to document the behavior of interface-keyed
maps.  I suspect some of those caveats might be non-obvious to a lot of folks.
And writeup on same.

tl;dr interfaces are not cheap if you're already at the scale where you
started caring about whether or not you have pointers.
@ghost ghost assigned warpfork Aug 24, 2018
@ghost ghost added the status/in-progress In progress label Aug 24, 2018
@warpfork
Copy link
Member Author

As a result of this, I'm mostly coming around to being a pretty vigorous advocate for type Cid string.

It might be nice to have a type CidInterface interface{...} at the same time... but its usage should be minimized. In particular, I think map[CidInterface]* is a really bad idea and should be avoided at all costs. (func CanonicalizeCid(CidInterface) Cid would be a fine thing to have, though.)

@kevina kevina self-assigned this Aug 24, 2018
@kevina
Copy link
Contributor

kevina commented Aug 24, 2018

In particular, I think map[CidInterface]* is a really bad idea and should be avoided at all costs.

I agree and it was never my intention to propose such usage. That is what CidString is for. It is an implementation of a Cid as a string for use in maps.

@Stebalien
Copy link
Member

Let's do it!

@kevina
Copy link
Contributor

kevina commented Aug 24, 2018

@Stebalien

Let's do it!

What?

I would like a chance to review these benchmarks.

In addition, there where several things about @dignifiedquire that I did not like and changed when I implemented my version of it in #65. We do not have to go with an interface, but I would like my changes to the String representation to go in.

We should really also change Multihash to be a string if a Cid is string, otherwise we we be creating a lot of needless alloc. We should also consider enhancing the multibase interface to (in addition) provide functions that take a string so that we can avoid an alloc there, but I see that as a lower priority.

I would be happy to take charge of this.

@Stebalien
Copy link
Member

Let's do it!

What?

We've been grumbling about this for ages and it's a significant usability (can't use CIDs in maps) and performance (allocation/conversion) issue.

We should really also change Multihash to be a string if a Cid is string, otherwise we we be creating a lot of needless alloc.

I agree.

but I would like my changes to the String representation to go in.

Which changes?

I would be happy to take charge of this.

Let's finish the in-progress endeavors first. Are they blocked on something?

@kevina
Copy link
Contributor

kevina commented Aug 25, 2018

Which changes?

There in #65. I separate out the String stuff from just the interface stuff.

@Stebalien
Copy link
Member

Ah, carrying the multibase with the Cid? I've commented on that PR. I understand that it's a solution to the CID base issue but I'm far from convinced that it's the right one (or even a palatable one).

@kevina
Copy link
Contributor

kevina commented Aug 25, 2018

Ah, carrying the multibase with the Cid? I've commented on that PR. I understand that it's a solution to the CID base issue but I'm far from convinced that it's the right one (or even a palatable one).

I will create a new P.R. with just the string stuff, stay tuned.... see #71

@kevina
Copy link
Contributor

kevina commented Aug 25, 2018

@warpfork I would be interested in how much the extra allocations cost us when we have to get the multihash in the string implementation. Also see #71 which is a continuation and optimization of #47. In particular it avoids an alloc when calling Type() which could be significant depending on how the Cid is used.

@whyrusleeping
Copy link
Member

I think theres no reason not to merge this, ya? Let's just do so and use the information in the benchmarks to move forward

@warpfork
Copy link
Member Author

Okay, so I'm gonna get ready to hit the merge button. Reminder, from a purely workflow management perspective, the goal in this PR was to make something mergeable which does not need to be the full final solution -- note the absolute lack of change in the main package -- and can also be kept in history as research notes for why we choose what we do. It shouldn't make merge conflicts with anything else in progress either.

P.S. yeah the uvarint impl speciated for strings in #71... 👍 I bet that's going to be impactful. I didn't write benchmarks here yet either for how much so, but there's no way evading the byte slice conversion there isn't gonna be helpful.

@warpfork warpfork merged commit 5ddbe21 into master Aug 27, 2018
@ghost ghost removed the status/in-progress In progress label Aug 27, 2018
@warpfork warpfork deleted the rsrch branch August 27, 2018 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants