Skip to content
This repository has been archived by the owner on Apr 28, 2020. It is now read-only.

Benchmark behaviour on defined class is slow. #3

Open
bon opened this issue Sep 1, 2016 · 8 comments
Open

Benchmark behaviour on defined class is slow. #3

bon opened this issue Sep 1, 2016 · 8 comments

Comments

@bon
Copy link

bon commented Sep 1, 2016

The benchmarks provided are for methods on the built-in lisp types number, fixnum and double-float. To test the behaviour on defined classes we added a simple boxing class and found that peformance degraded when using inlined-generic-functions, inlined. We found the following numbers of processor cycles for the four methods in playground.lisp, respectively:

     333,033
     331,839
   2,144,814
     585,272

Experiment on sbcl 1.3.5.24

See bon@8b6e4d5

So my question is whether this indicates that inlined-generic-functions only speed up on built-in types and not on defined classes?

@guicho271828
Copy link
Owner

it seems normal-plus is running w/o boxing, right?

@bon
Copy link
Author

bon commented Sep 1, 2016

Correct! Fixed in bon@76d1eb6

Processor cycles are now

    588,650
    586,253
  1,889,394
    550,351

@guicho271828
Copy link
Owner

phew.

@guicho271828
Copy link
Owner

I just tested your version. On my machine, the result is still in favor of the inlined version.

Evaluation took:
  0.001 seconds of real time
  0.004000 seconds of total run time (0.004000 user, 0.000000 system)
  400.00% CPU
  638,640 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  608,634 processor cycles
  163,808 bytes consed

Evaluation took:
  0.003 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  4,543,020 processor cycles
  655,184 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  389,169 processor cycles
  163,808 bytes consed

What is this difference? In your result I-g-function is performing better, but not much better.
I use SBCL 1.3.8 on roswell on

$ uname -a
Linux guicho-x61 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo
...
model name  : Intel(R) Core(TM)2 Duo CPU     T7100  @ 1.80GHz
...

@bon
Copy link
Author

bon commented Sep 3, 2016

For me the numbers of cycles vary wildly from run to run. Sometimes the igf gets a little quicker, sometimes slower. One example is shown below.

But the more interesting question is why the igf showed a 10x speedup on numbers but hardly any difference on defined classes? Of course I would be very happy to see a 10x speedup on defined classes too!

$ cat /proc/cpuinfo  | ag 'model name' | head -1
model name  : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
$ uname -a
Linux tie 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 x86_64 GNU/Linux
$ ros use sbcl
$ ~/.roswell/impls/x86-64/linux/sbcl/1.3.9/bin/sbcl --version
SBCL 1.3.9
$ ros run
$ rlwrap ros run
* (ql:quickload :inlined-generic-function)

...

* (load "benchmark.lisp")

...

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  424,334 processor cycles
  131,024 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  100.00% CPU
  362,358 processor cycles
  163,792 bytes consed

Evaluation took:
  0.001 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  2,060,160 processor cycles
  655,200 bytes consed

Evaluation took:
  0.000 seconds of real time
  0.003333 seconds of total run time (0.003333 user, 0.000000 system)
  100.00% CPU
  493,287 processor cycles
  163,792 bytes consed

@guicho271828
Copy link
Owner

guicho271828 commented Sep 3, 2016

the reason of not achieving 10x speedup is due to the type information and the cost of slot access.

  1. The contents slot of box is not typed, so the (+ (contents a) b) part is always calling a generic-+, not the optimized machine assembly. You should check the disassembly result.
  2. The accessor contents is a normal generic function. So the slot access is slow.

Imagine the total cost is 10X for normal GF and X for IGF. Above two factor adds two overheads, resulting in 10X+A+B vs X+A+B. Then obviously 10 times speedup is not achievable since A+B could be very large.

@guicho271828
Copy link
Owner

I updated the environment and noticed that the examples in playground.lisp getting slow. It looks like the function is prevented from inlining.

@guicho271828
Copy link
Owner

(push :inline-generic-function *features*) still successfully forces the functions being inlined, but I don't like this solution...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants