Thread splitting is expensive due to OT extension setup time #40

jackdoerner · 2017-08-27T04:10:44Z

Under the current codebase, if you split the YGC protocol, you have to wait something like 300ms for OT extensions to be set up, which is a very long time, comparatively speaking. I am opening this thread to discuss solutions before actually implementing anything. @samee previously suggested that it might be possible to make the threads share a single OT setup, but alternate rows. It seems to me that repeated splitting could cause problems there. Admittedly, it shouldn't be a common case, but it still seems like one we should handle gracefully. The alternative AFAICT is to have the threads share a single OT setup and use an atomic increment (GCC has a builtin for this) to ensure two threads don't end up using the same row. Thoughts or alternatives?

On a semi-related note, COT was lost somewhere along the way. Is it worth re-enabling?

samee · 2017-08-27T19:44:59Z

COT is worth re-enabling, but let's dig into history to see how it got lost.

Atomic increments can be slow: at the end of the day it's a point of serialization. You can still try it to check performance, since I believe it is a very small change.

It would be much nicer if we could minimize the use of shared structures as much as possible. This would mean somehow a centralized structure is being used to partition the counter space among all interested threads. Hence my prior suggestion of alternating rows, but as you pointed out that can get out of hand as well.

I think it is worth reconsidering the split paradigm a bit here, just the same way fork isn't the best way to do multi-processing. Maybe we only allow splits to go to one depth: we will only have split_n functions that split protocols n-way, but forbid further splits from the children. This allows us to have one single function that knows exactly how the counter-space gets divided up into threads. We can have each thread count up by n from some offset.

What do you think?

jackdoerner · 2017-09-02T19:33:21Z

RE: COT: OK. I probably should read that paper in full before I look too closely at the code, but if I find some time, I'll take a look (and if you get to it first, that's OK too).

Atomic increments can be slow

In principle, yes; I think the trick here would be to allocate indices to threads in blocks of maybe 2^10 or so, rather than involving the multithreading primitives on each OT. Also, IIRC intel has single-instruction atomic add-and-fetch instructions, which are slower than the nonatomic equivalents, but much faster than locks. Of course, this doesn't help on other architectures, but I'm hoping that maybe GCC's builtin is smart enough to use the appropriate equivalent. On that note: thoughts on GCC builtins? I can imagine that maybe we want to retain the option to use compcert or clang in the future, but I also sort of think we should worry about crossing those bridges when we come to them.

I think it is worth reconsidering the split paradigm a bit here

I agree with you, but I'm not quite sold on your proposed alternative. I think it matches the way we ideally want protocols to be used pretty well (i.e. mad recursive splitting should probably be frowned upon), and it's easy to reason about, which is really nice, but in practice, it may prevent libraries from working. Consider my case: I am splitting the protocol inside an ORAM library. If an application programmer decides to initialize an ORAM inside a protocol that's already been forked, then it will fail, and it doesn't seem to have any mechanism to "do the right thing" other than to fall back to a single protocol.

For now, I'll do a performance check on the shared-OT approach and report back.

jackdoerner · 2017-09-08T01:04:07Z

OK, I have a candidate implementation over here. Note: so far it doesn't actually split the OT object when it splits threads, but it does have most of the necessary functionality: it uses atomics, does batched nonce allocations, reference-counts the master nonce, etc. I cannot detect a performance difference between this code and the original on my laptop.

Although I haven't actually added a function call to split the OT object, I do foresee a problem: synchronizing nonce allotments between threads. There are two obvious solutions AFAICT:

send the nonce with each OT. It looks like the nonces start from 0 in the original code, so this shouldn't have any security implications I think, but it adds eight bytes of communication to each OT. Whether this is meaningful depends on how many OTs you're doing.
give the nonce distributor a network channel so the sender can tell the receiver how to allot nonces. This has a much lower network cost (probably negligible), but it's more complicated to implement, more prone to error, and won't work with stdio.

Thoughts?

jackdoerner · 2017-09-17T07:36:01Z

I started another issue to deal with correlated OT: #43. Note that the implementation referenced by that issue depends upon the code I've written for this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread splitting is expensive due to OT extension setup time #40

Thread splitting is expensive due to OT extension setup time #40

jackdoerner commented Aug 27, 2017

samee commented Aug 27, 2017

jackdoerner commented Sep 2, 2017 •

edited

Loading

jackdoerner commented Sep 8, 2017

jackdoerner commented Sep 17, 2017

Thread splitting is expensive due to OT extension setup time #40

Thread splitting is expensive due to OT extension setup time #40

Comments

jackdoerner commented Aug 27, 2017

samee commented Aug 27, 2017

jackdoerner commented Sep 2, 2017 • edited Loading

jackdoerner commented Sep 8, 2017

jackdoerner commented Sep 17, 2017

jackdoerner commented Sep 2, 2017 •

edited

Loading