Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with waylonflinn/weblas #126

Closed
mike1808 opened this issue Jul 11, 2017 · 22 comments
Closed

Comparison with waylonflinn/weblas #126

mike1808 opened this issue Jul 11, 2017 · 22 comments

Comments

@mike1808
Copy link

mike1808 commented Jul 11, 2017

Hey, you guys really rock with this project! Did you compared the performance of some popular kernels with waylonflinn/weblas? It will be very interesting to see how fast/slow your library for these kernels:

  • sscal - Matrix (and Vector) Scale (with addition)
  • sgemm - Matrix Multiply
  • sdwns - Matrix (and Image) Downsample (for Max Pooling)
  • sclmp - Matrix clamp (for ReLU)
@waylonflinn
Copy link

waylonflinn commented Jul 11, 2017

I'm the author of the linked library. I'd also be interested in seeing a detailed comparison.

Here's a (very hastily done) comparison for matrix multiply (gemm) on a 512x512 matrix. Time is given in milliseconds.

library time
gpu.js 85ms
weblas 14ms

Time for gpu.js is from gpu.rocks.

This is a very interesting library with a lot of flexibility. I'd love to see how performance compares on this benchmark across a range of matrix sizes.

@robertleeplummerjr
Copy link
Member

Some factors to consider:

  • Gpu build time (compilation, kernel generation, etc. take some time)
  • Transfer to and from cpu and to and from an array to a texture takes more time
  • The hard truth is that likely a single matrix transformation will not be greatly more performant, if at all, but if you were to stack a bunch of matrices, and do all their transformations on the gpu, and have a single transfer to and from the cpu to gpu, you'll see a substantial gain in performance.

Firefox, for example, runs about 120-130 times faster than the cpu mode when using textures.

Another factor that we are considering to add as an optional setting is floating point precision. Currently we have 32 bit floating point precision, we could allow for lower precision. In neural nets, for example, this number can be reduced to say, 16 bit, or even 8 bit, and the net can compensate for it (imagine looking at a blurry picture of yourself, and knowing it is yourself), which (I'm not a mathematician) should be about an order of magnitude faster.

We are working very hard to get v1 finished up. I've got a job and family, and was only recently added to the team, but we are making great progress (before work, during breaks, during lunch, etc.).

@fuzzie360
Copy link
Member

Hi @waylonflinn, thanks for the quick matchup!

I'm a big fan of the BLAS and LAPACK libraries, so I found reading your glsl code really eye opening.

Looks like gpu.js has a lot of work on our hands. Currently, even with a 85ms / 4 = 21.25ms theoretical speedup with a vectorizing SIMD compiler (which doesn't exist yet), it looks like we will not be able to even scratch weblas's timings!

Do you mind if we borrow your encode_float for an alternative fast implementation? It seems useful to have this as a configurable option.

@robertleeplummerjr
Copy link
Member

@waylonflinn I totally missed that your library is gpu based, lol. Been a fun morning. Very interesting though!

@robertleeplummerjr
Copy link
Member

@waylonflinn looking at http://waylonflinn.github.io/DeepBeliefSDK/, one word: fantastic

It has been a dream of mine for some time to see a convolutional neural net like this in the browser/js.

@waylonflinn
Copy link

waylonflinn commented Jul 11, 2017

@fuzzie360 please feel free to use encode_float. If you do end up using it, I have an open issue for testing here: waylonflinn/weblas#11 . Any help would be greatly appreciated!

@waylonflinn
Copy link

@robertleeplummerjr You might also be interested in: https://github.com/transcranial/keras-js

I've been collaborating with the author to make full use of weblas. It's still in the early stages, but I have high expectations for it!

@robertleeplummerjr
Copy link
Member

very cool

@fuzzie360
Copy link
Member

@waylonflinn I've encountered the numerical stability issues myself and gotten around it: https://github.com/gpujs/gpu.js/blob/develop/src/backend/web-gl/shader-frag.js#L45-L82

Have verified it to be working on notorious GPUs like Intel HD 2000. I'm actually been thinking of moving from my safe implementation and moving back to the unsafe implementation by detecting the special rounding characteristics of the GPU and choosing a correct implementation.

@waylonflinn
Copy link

@fuzzie360 very nice! I'm hoping that universal support for floating point textures in WebGL 2.0 will remove the need for the float encode altogether.

@robertleeplummerjr
Copy link
Member

@waylonflinn do you have the specific source for your benchmark here #126 (comment)?

@robertleeplummerjr
Copy link
Member

To answer the original question from @mike1808, we have not yet, as we are still mostly in alpha, but the libraries you mention are very finite in what they solve, whereas gpu.js is very open ended.

@waylonflinn
Copy link

@robertleeplummerjr I ran the benchmarks this morning on my personal development machine. You can replicate this with the command npm run benchmark, as described in the benchmarks section of the weblas README.

@robertleeplummerjr
Copy link
Member

robertleeplummerjr commented Jul 17, 2017

I was able to find and fix a flaw in our compilation that gives us a 300% boost over previous benchmarks. @waylonflinn We're coming for you!
😋

@robertleeplummerjr
Copy link
Member

Yoohoo, @waylonflinn... #206 (comment)

Jaws theme plays...

@robertleeplummerjr
Copy link
Member

Note: the performance here isn't really fair, as it is showing off texture mode, which is like pipeline mode in weblas (which I would love to see the numbers on, and totally expect it to be faster than gpu.js), but look at those numbers!

512 x 512 matrix multiplication:

3 milliseconds

@robertleeplummerjr
Copy link
Member

Landed in dev today, fyi.

@waylonflinn
Copy link

Very much fast.

I'm still working out how to do reliable benchmarks in pipeline mode for weblas. Everytime I do it I get results that seem impossibly fast. I'll try to work something up and post it here for comparison.

@robertleeplummerjr
Copy link
Member

@waylonflinn I very much look forward to it, and possibly collaborating in the future!

@fuzzie360
Copy link
Member

fuzzie360 commented Oct 24, 2017

Hi @robertleeplummerjr, sorry its been a long time since I last checked in. But I need to say this: you really cannot use benchmark.js to test the timing for texture mode as it is not a fair representation

This is what is being timed by benchmark.js on texture mode

timing  +------+
cpu     +------+             +-----------+
copying        +---+    +----+
gpu                +----+

What you really want to time is this if you don't want to take into consideration the time taken to retrieve the data back into CPU.

timing  +---------------+
cpu     +------+             +-----------+
copying        +---+    +----+
gpu                +----+

There is really no way to do it with a single kernel launch (for e.g. a single matrix multiplication). To measure real world texture mode performance you need to do something like raising a matrix to a power (multiple matrix multiplications in texture mode and get result back).

@robertleeplummerjr
Copy link
Member

Yea, I know it really isn't fair, which I did mention. What I was trying to portray is that it is just really fast. In the case of machine learning, where gpu.js has my fascination, once values are on the GPU they don't need to come back to the CPU unless you are trying to see an output, or if you want to look at error rate.

@robertleeplummerjr
Copy link
Member

Nice comparisons!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants