Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCG works slowly and type warning questions #7

Closed
sunoru opened this issue Jun 8, 2016 · 16 comments
Closed

PCG works slowly and type warning questions #7

sunoru opened this issue Jun 8, 2016 · 16 comments

Comments

@sunoru
Copy link
Member

sunoru commented Jun 8, 2016

I tested one common PCG declared and initialized as:

using RNG.PCG
r = PermutedCongruentialGenerator(PCGStateSetseq, PCG_XSH_RR, UInt64)
srand(r, (42 % UInt64, 54 % UInt64))

It passed all the tests of the TestU01 big crush battery as expected, but the long CPU time frightened me. It took over 21 hours on my laptop.

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:
 Number of statistics:  160
 Total CPU time:   21:32:37.44

 All tests were passed

For comparison, the test results of Base.Random.rand are:

========= Summary results of BigCrush =========

 Version:          TestU01 1.2.3
 Generator:
 Number of statistics:  160
 Total CPU time:   03:49:04.43
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 80  LinearComp, r = 0              1 - eps1
 81  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

Obviously it's unacceptable with such slow speed. I've tried to use @code_warntype to find where could be the bottleneck, and it shows some strange positions of codes are analyzed to return Any, which was thought to be optimized. For example,

julia> @code_warntype rand(r, UInt32)
Variables:
  #self#::Base.Random.#rand
  pcg::RNG.PCG.PermutedCongruentialGenerator{RNG.PCG.PCGStateSetseq{StateUIntType<:Union{UInt128,UInt16,UInt32,UInt64,UInt8},MethodType<:Union{Val{:RXS_M_XS},Val{:XSH_RR},Val{:XSH_RS},Val{:XSL_RR_RR},Val{:XSL_RR}}},Val{:XSH_RR},UInt64,UInt32}
  #unused#::Type{UInt32}

Body:
  begin 
      $(Expr(:meta, :inline)) # /home/sunoru/.julia/v0.5/RNG/src/./PCG/main.jl, line 40:
      return (top(typeassert))((RNG.PCG.pcg_random)((top(getfield))(pcg::RNG.PCG.PermutedCongruentialGenerator{RNG.PCG.PCGStateSetseq{StateUIntType<:Union{UInt128,UInt16,UInt32,UInt64,UInt8},MethodType<:Union{Val{:RXS_M_XS},Val{:XSH_RR},Val{:XSH_RS},Val{:XSL_RR_RR},Val{:XSL_RR}}},Val{:XSH_RR},UInt64,UInt32},:state)::RNG.PCG.PCGStateSetseq{StateUIntType<:Union{UInt128,UInt16,UInt32,UInt64,UInt8},MethodType<:Union{Val{:RXS_M_XS},Val{:XSH_RR},Val{:XSH_RS},Val{:XSL_RR_RR},Val{:XSL_RR}}})::Any,$(Expr(:static_parameter, 4)))::UInt32
  end::UInt32

How can I fix this problem?

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

By the way, I thought it may be an issue of RNGTest.wrappedRNG at first, but the time test also shows the problem.

julia> function testn(n, rng)
       for i = 1:n
       rand(rng);
       end
       end

julia> @time testn(100000, Base.Random.GLOBAL_RNG)
  0.024788 seconds (3.13 k allocations: 131.090 KB)

julia> @time testn(1000000, Base.Random.GLOBAL_RNG)
  0.007132 seconds (4 allocations: 160 bytes)

julia> @time testn(10000000, Base.Random.GLOBAL_RNG)
  0.065980 seconds (4 allocations: 160 bytes)

julia> @time testn(100000, r)
  0.026353 seconds (200.00 k allocations: 3.052 MB)

julia> @time testn(1000000, r)
  0.126789 seconds (2.00 M allocations: 30.518 MB, 10.10% gc time)

julia> @time testn(10000000, r)
  0.786922 seconds (20.00 M allocations: 305.176 MB, 2.50% gc time)

The allocations are also...

@simonbyrne
Copy link

Well, it is called BigCrush for a reason..

But those timings do look slow: have you tried profiling?

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

Yes, the backtraces seem usual.

@simonbyrne
Copy link

What does @profile testn(10000000, r); Profile.print() give?

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

476 ./event.jl:46; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
 476 ./REPL.jl:62; eval_user_input(::Any, ::Base.REPL.REPLBackend)
  476 ./boot.jl:236; eval(::Module, ::Any)
   476 ./profile.jl:16; anonymous
    4   ./REPL[4]:2; testn(::Int64, ::RNG.PCG.PermutedCongruentialGenerator{RNG.PCG.PCGStateSetseq{StateUIntTyp...
    469 ./REPL[4]:3; testn(::Int64, ::RNG.PCG.PermutedCongruentialGenerator{RNG.PCG.PCGStateSetseq{StateUIntTyp...
     4  /home/sunoru/.julia/v0.5/RNG/src/./PCG/bases.jl:0; pcg_random(::RNG.PCG.PCGStateSetseq{UInt64,Val{:XSH_RR}})
     2  /home/sunoru/.julia/v0.5/RNG/src/./PCG/bases.jl:231; pcg_random(::RNG.PCG.PCGStateSetseq{UInt64,Val{:XSH_RR}})
     10 /home/sunoru/.julia/v0.5/RNG/src/./PCG/bases.jl:232; pcg_random(::RNG.PCG.PCGStateSetseq{UInt64,Val{:XSH_RR}})
     24 /home/sunoru/.julia/v0.5/RNG/src/./PCG/bases.jl:233; pcg_random(::RNG.PCG.PCGStateSetseq{UInt64,Val{:XSH_RR}})

@simonbyrne
Copy link

Hmm that's not much help, I guess because of the inlining.

Since the code is quite short, the best option might be to write it out all the code in 1 function and see if you can make it faster, then compare it with what you have here.

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

All right

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

Actually I have looked at the code_llvm of pcg_random and found the functions inlined correctly. Directly calling pcg_random doesn't help either.

And I noticed that with the convention of UInt to Float, rand takes as twice time as that output UInt directly. But it's still not at the same order of magnitude as Base.Random.GLOBAL_RNG.

@sunoru
Copy link
Member Author

sunoru commented Jun 8, 2016

code_native(RNG.PCG.pcg_random, (RNG.PCG.PCGStateSetseq{UInt64,Val{:XSH_RR}},))
        .text
Filename: bases.jl
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 231
        movq    (%rdi), %rcx
Source line: 206
        movabsq $6364136223846793005, %rax # imm = 0x5851F42D4C957F2D
        imulq   %rcx, %rax
        addq    8(%rdi), %rax
        movq    %rax, (%rdi)
Source line: 50
        movq    %rcx, %rax
        shrq    $18, %rax
        xorq    %rcx, %rax
        shrq    $27, %rax
        shrq    $59, %rcx
Source line: 12
        movl    %ecx, %edx
        negl    %edx
        movl    %eax, %esi
        shrl    %cl, %esi
        movb    %dl, %cl
        shll    %cl, %eax
        orl     %esi, %eax
        popq    %rbp
        retq
        nop

I don't have much knowledge about assembly but I think this seems simple and clear enough?


Oh, OK, then I found that the pcg_rotr function is not optimized to rorq when inlined in other functions...

@simonbyrne
Copy link

I don't think it's just the rorq. Here's what I had in mind:
https://gist.github.com/simonbyrne/51c1f29205c8d3374030d28b932f0f00

After warmup, I get:

julia> @time foo(0x185706b82c2e03f8, 10_000_000)
  0.021174 seconds (5 allocations: 176 bytes)
0x004c507bc616b135

julia> @time bar(r,10_000_000)
  0.592998 seconds (20.00 M allocations: 305.176 MB, 4.28% gc time)
0x004c507bc616b135

which suggests there is some sort of type instability or codegen problem here

I think you're making your code way more complex than it needs to be, and this could be causing performance problems somewhere (it certainly makes it harder to find the performance problems). I would suggest:

  • Rather than try to be clever with symbols in parametric types, just declare a new type for each generator (i.e. type PcgXshRr ...). While type parameters are incredibly powerful, they do add complexity: my rough rule of thumb is that you should try to limit yourself to 2 parameters.
  • Get rid of the PermutedCongruentialGenerator wrapper type: does it serve any purpose? It seems like it might be better served by an abstract type (e.g. AbstractPCGRNG).
  • As I mentioned in the chat last week, I don't think any of the code actually requires @eval.

@sunoru
Copy link
Member Author

sunoru commented Jun 9, 2016

wow, this is amazing... I didn't realize the parametric types could affect so much on performance...

@simonbyrne
Copy link

simonbyrne commented Jun 9, 2016 via email

@sunoru
Copy link
Member Author

sunoru commented Jun 9, 2016

Yes, I'm trying to figure it out, and rewriting some codes to make it more clear. I have a problem while replacing the @eval, how to declare the unit types to return if I don't use @eval?

@simonbyrne
Copy link

The output type is usually half the state, isn't it? So one way would be to define a function

halfwidth(::Type{UInt64}) = UInt32
halfwidth(::Type{UInt32}) = UInt16

etc., and then call this inside the function, i.e.

O = halfwidth(typeof(state))
rotr((((state >> p1) $ state) >> p2) % O,
     (state >> p3) % O)

@sunoru
Copy link
Member Author

sunoru commented Jun 9, 2016

OK, I see.

@sunoru
Copy link
Member Author

sunoru commented Jun 14, 2016

julia> @time foo(0x185706b82c2e03f8, 10_000_000)
  0.064160 seconds (5 allocations: 176 bytes)
0x004c507bc616b135

julia> @time bar(r,10_000_000)
  0.034669 seconds (5 allocations: 176 bytes)
0x004c495e0178e5df

It seems fixed now.

@sunoru sunoru closed this as completed Jul 6, 2016
sunoru pushed a commit that referenced this issue Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants