-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cid implementation roundup #70
Conversation
It's been discussed in several issues and PRs already that we might want to explore various ways of implementing CIDs for maximum performance and ease-of-use because they show up extremely often. Current CIDs are pointers, which generally speaking means you can't get one without a malloc; and also, they're not particularly well-suited for use in map keys. This branch is to attempt to consolidate all the proposals so far -- and do so in a single branch which can be checked out and contains all the proposals at once, because this will make it easy to do benchmarks and compare all of the various ways we could implement this in one place (and also easier for humans to track what the latest of each proposal is, since they're all in one place). To start with: a Cid implementation backed by a string; and matching interface. (I'm also taking this opportunity to be as minimalistic as possible in what I port over into these experimental new Cid implementations. This might not last; but as long as all this work is to be done, it's a more convenient time than usual to see what can be stripped down and still get work done.) More to come.
Added back in some of the parser methods. (These were previously named "Cast" and I think that's silly and wrong so I fixed it.) Functions are named overly-literally with their type (e.g. ParseCidString and ParseCidStruct rather than ParseCid or even just Parse) because for this research package I don't want to bother with many sub-packages. (Maybe I'll regret this, but at the moment it seems simpler to hold back on sub-packages.) Functions that produce Cids are literal with their return types, as well. Part of the purpose of this research package is going to be to concretely benchmark exactly how much performance overhead there is to using interfaces (which will likely cause a lot of boxing and unboxing in practice) -- since we want to explore where this boxing happens and how much it costs, it's important that none of our basic implementation functions do the boxing! The entire set of codec enums came along in this commit. Ah well; they would have eventually anyway, I guess. But it's interesting to note the only thing that dragged them along so far is the reference to 'DagProtobuf' when constructing v0 CIDs; otherwise, this enum is quite unused here.
Using interfaces as a map's key type can cause some things that were otherwise compile-time checks to be pushed off into runtime checks instead. This is a pretty major "caveat emptor" if you use interface-keyed maps.
Right now this is mostly this is to document the behavior of interface-keyed maps. I suspect some of those caveats might be non-obvious to a lot of folks.
And writeup on same. tl;dr interfaces are not cheap if you're already at the scale where you started caring about whether or not you have pointers.
As a result of this, I'm mostly coming around to being a pretty vigorous advocate for It might be nice to have a |
I agree and it was never my intention to propose such usage. That is what CidString is for. It is an implementation of a Cid as a string for use in maps. |
Let's do it! |
What? I would like a chance to review these benchmarks. In addition, there where several things about @dignifiedquire that I did not like and changed when I implemented my version of it in #65. We do not have to go with an interface, but I would like my changes to the String representation to go in. We should really also change Multihash to be a string if a Cid is string, otherwise we we be creating a lot of needless alloc. We should also consider enhancing the multibase interface to (in addition) provide functions that take a string so that we can avoid an alloc there, but I see that as a lower priority. I would be happy to take charge of this. |
We've been grumbling about this for ages and it's a significant usability (can't use CIDs in maps) and performance (allocation/conversion) issue.
I agree.
Which changes?
Let's finish the in-progress endeavors first. Are they blocked on something? |
There in #65. I separate out the String stuff from just the interface stuff. |
Ah, carrying the multibase with the Cid? I've commented on that PR. I understand that it's a solution to the CID base issue but I'm far from convinced that it's the right one (or even a palatable one). |
I will create a new P.R. with just the string stuff, |
@warpfork I would be interested in how much the extra allocations cost us when we have to get the multihash in the string implementation. Also see #71 which is a continuation and optimization of #47. In particular it avoids an alloc when calling Type() which could be significant depending on how the Cid is used. |
I think theres no reason not to merge this, ya? Let's just do so and use the information in the benchmarks to move forward |
Okay, so I'm gonna get ready to hit the merge button. Reminder, from a purely workflow management perspective, the goal in this PR was to make something mergeable which does not need to be the full final solution -- note the absolute lack of change in the main package -- and can also be kept in history as research notes for why we choose what we do. It shouldn't make merge conflicts with anything else in progress either. P.S. yeah the uvarint impl speciated for strings in #71... 👍 I bet that's going to be impactful. I didn't write benchmarks here yet either for how much so, but there's no way evading the byte slice conversion there isn't gonna be helpful. |
It's been discussed in several issues and PRs already that we might want to explore various ways of implementing CIDs for maximum performance and ease-of-use because they show up extremely often. Current CIDs are pointers, which generally speaking means you can't get one without a malloc; and also, they're not particularly well-suited for use in map keys.
This branch is to attempt to consolidate all the proposals so far (#38, #64, #47, more?) -- and do so in a single branch which can be checked out and contains all the proposals at once, because this will make it easy to do benchmarks and compare all of the various ways we could implement this in one place (and also easier for humans to track what the latest of each proposal is, since they're all in one place). (I ended up re-writing much of it because some of the existing work was a pain to cherry-pick apart, either because it was stale and had conflicts with latest master or because it mixed other topics like multibase, but this does borrow heavily from @dignifiedquire and @kevina 's code on the other branches.)
There are two/three major implementations here: one as struct (familiar), one as typedef of string, and one as an interface (implemented by the other two).
I composed a README.md with a summary of what I learned. Some of it became obvious just from thrashing the code around; some of it is supported with benchmarks (also included in the diff).
There's no attempt here to change the implementation of multihash... however, it's likely that doing so could further shift some of the performance numbers quite a bit. Skim the alloc counts in the benchmarks for impressions.
This whole hullabaloo is compartmentalized under a
_rsrch
directory... which means it won't typically be built or tested by e.g.go test ./...
unless you're already in the_rsrch
directory. It also does not change any of the existing packages. Therefore, it is completely safe to merge to master as sheer documentation of our research, even if we choose not to pursue any of these refactors at this time.Hopefully this is good further food for thought and can be iterated on to get us closer to a committal choice about this stuff. (Or at least merged as research, rather than abandoned to drift as an out-of-sync branch. I dunno if you can tell, but those are my least favorite thing.)