-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scrypt: Adds parallel feature and max_memory argument #178
base: master
Are you sure you want to change the base?
Conversation
Thanks for opening this PR. Definitely looks like interesting work! But it might take some time to review, just FYI. |
No worries. Regarding the failing automated test: looks like it's because of an unused argument which occurs when the parallel feature is off. I can fix that once the exact API is settled. |
Is this still being pursued? |
This is fine but needs a rebase and the test failures fixed |
I can rebase and fix this up when I get a chance, but I haven't seen any discussion on the API from the maintainers. |
@fpgaminer I think the API is mostly fine. I like the explicit parameterization of Regarding error handling, you might add an |
...gated under the `alloc` feature. Allocates a `Vec<u8>` to return an individual `XofReader::read` invocation into. Unlike `ExtensibleOutput::finalize_vec`, it doesn't consume the reader and can be called an unlimited number of times.
WARNING: Do NOT merge yet; discussion is required.
Overview
Hello! This is my first contribution to this project. I hope it is helpful. I welcome comments and am happy to update my pull request as needed.
This pull request adds parallelism to the scrypt implementation, gated behind a new crate feature
parallel
. It uses rayon internally. The originalscrypt
function should work exactly as it did before, and does not use parallelism. Instead a newscrypt_parallel
function is added which allows the user to specifynum_threads
.scrypt_parallel
also features a newmax_memory
argument that can be used to limit the memory usage of scrypt. Setting this to 1GB, for example, will makescrypt
use less than 1GB regardless of scrypt params and number of threads (though it may use slightly more than 1GB on the order of hundreds of bytes; specifically the storage of B and temporary variables is not counted).New tests and benchs were added to cover the new features.
Implementation details
The parallelism is straightforward and just uses rayon across the
p
parameter.max_memory
is a bit more involved. This is accomplished through the use of a memory-compute trade off for the scrypt algorithm.romix::scrypt_ro_mix
has been modified with alog_f
argument. Atlog_f = 0
it operates just as it did before. Atlog_f=1
it uses half the memory at the cost of more compute. And so forth.It's easiest to understand
log_f
in terms of the size ofV
and the computational cost ofscrypt_ro_mix
.V
must ben >> log_f
blocks in length. The total BlockMix operations performed byscrypt_ro_mix
is equal toops(n, log_f) = 2 * n + 0.5 * n * (2**log_f - 1)
.The addition of this
log_f
functionality allows us to implement themax_memory
argument. This means anyscrypt
parameters can be computed even on machines that otherwise don't have the memory for it. It's also very useful for parallelism as the memory usage of scrypt normally scales linearly with the number of threads used (since each thread needs its ownV
).scrypt_log_f
is the new workhorse, implementing the higher level scrypt algorithm, with the addition ofnum_threads
to control parallelism andlog_f
to control memory usage.scrypt
now callsscrypt_log_f
internally withlog_f = 0
andnum_threads = 1
to emulate previous behavior.scrypt_parallel
callsscrypt_log_f
internally as well, but it calculateslog_f
automatically from the providedmax_memory
parameter.It's always better to use all available cores on a machine to compute scrypt, even if that means increasing
log_f
to make that possible. Consider for example computingscrypt(log_n=20, r=8, p=4)
on a four core, 1GB machine. That machine only has enough memory to computescrypt(log_n=20, r=8, p=1)
, so withoutlog_f
it can only use a single core. Increasinglog_f
to 2 allows the use of all four cores. It turns out that, despitelog_f
increasing the amount of work, it's still overall faster as that compute is offset by the additional cores that can work.(See the section "log_f proof" for the proof. Note that we ignore the overhead bytes needed for B and temps, to simplify.)
So
scrypt_parallel
can usemax_memory
to automatically setlog_f
to the optimal value, assumingnum_threads
is less than or equal to the number of cores on the machine.Benchmarks
Speed of scrypt remains unaffected by these changes (within the margin of error).
Discussion
I don't think the API I picked for the new
scrypt_*
functions is ideal. I'm not sure what the ideal API is for the new features. So feedback would be great!Since
log_f
always has an optimal value based on desired memory usage and number of threads, it seems likescrypt_parallel
presents the more useful API to the user. Hence why I hidscrypt_log_f
.But maybe some users might want direct control of
log_f
for some reason?And
max_memory
is also kind of a tricky argument. It requires the user to know how much memory is available on the machine. The sysinfo crate provides these kinds of stats, so I guess I would expect users to use something like that, possibly dividing it by some factor so they don't use all of the memory, and then feeding that asmax_memory
.I thought about including that kind of functionality directly in this crate. But the sysinfo crate didn't look very straightforward and robust, so I don't personally feel comfortable including it in a cryptography crate. I guess we could put it behind a feature flag?
Of course an API user is welcome to just set
max_memory
to a huge value if they "don't care", and get the same behavior asscrypt
but with the addition of parallelism.Finally, the addition of the
max_memory
argument adds another failure case toscrypt_parallel
. There are situations wheremax_memory
is too small for the given parameters, soscrypt_parallel
needs to error out. Since there were no existing errors I just put in apanic
for now. Should a new error be added?It's important to note that the existing
scrypt
parameter would also panic in low memory situations, due to OOM. So it's more likescrypt_parallel
makes that situation explicit. (Thoughscrypt_parallel
can still OOM since it doesn't currently account for all allocations.)log_f proof
Side Note: SSE
I didn't do it in this pull request, but while digging into the code and such I also noticed that the use of SSE would make scrypt run significantly faster. I figured I'd put a note here about it, just so there's note about it somewhere, even though it's not relevant to this pull request.
Implementing scrypt using SSE is relatively straightforward. Colin Percival's scrypt implementation in tarsnap has a good example. The easiest way to do this is to just directly port that implementation over to Rust using
core::arch
. I did that and saw a nice improvement from46ns
per salsa20_8 call to33ns
per salsa20_8 call. That should translate toscrypt
as a whole running in ~70% of the time!But
core::arch
would introduceunsafe
into this crate and isn't future proof. I tried to find a way to re-write the existing Rust code in such a way that the compiler itself could infer SIMD instructions. See this thread: https://users.rust-lang.org/t/can-the-compiler-infer-sse-instructions/59976While the compiler is or will be capable of doing that, for some permutations of the code, those optimization passes are really unstable now, for example varying their outputs depending on whether you use
rotate_left
or a manual implementation.Again, this isn't relavant to this pull request. Just wanted to put those findings down somewhere.
TODOs
These todos should block merging until resolved:
scrypt_parallel
return an error on low memory, or panic?romix
Final Note
Thank you for taking the time to review my pull request. I hope it's useful! Sorry for the long pull request notes, but I wanted to be sure to explain everything thoroughly, as this is a security sensitive codebase.