Optimizing Hamiltonian constructor for GPU acceleration #41

naezzell · 2020-12-08T01:41:22Z

In a standard anneal, a user will use the standard_driver function

Lines 104 to 110 in c567d61

    
           function standard_driver(num_qubit; sp = false) 
        
               res = "" 
        
               for idx = 1:num_qubit 
        
                   res = res * "I"^(idx - 1) * "X" * "I"^(num_qubit - idx) * "+" 
        
               end 
        
               q_translate(res[1:end-1], sp = sp) 
        
           end

This generates a matrix of type Array{Complex{Float64},2}. While we've shown that casting this as a CuArray, i.e.cu(standard_driver(n)), is sufficient for a speed-up, this is not optimal. Ideally, the GPU should only deal with Float32s, and perhaps even better, with real numbers only.

Furthermore, the DenseHamiltonian constructor performs "scalar operations" by indexing the m array (see

OpenQuantumBase.jl/src/hamiltonian/dense_hamiltonian.jl

Lines 31 to 48 in c567d61

    
           function DenseHamiltonian(funcs, mats; unit = :h, EIGS = EIGEN_DEFAULT) 
        
               if any((x) -> size(x) != size(mats[1]), mats) 
        
                   throw(ArgumentError("Matrices in the list do not have the same size.")) 
        
               end 
        
               if is_complex(funcs, mats) 
        
                   mats = complex.(mats) 
        
               end 
        
               hsize = size(mats[1]) 
        
               # use static array for size smaller than 100 
        
               if hsize[1] <= 10 
        
                   mats = [SMatrix{hsize[1],hsize[2]}(unit_scale(unit) * m) for m in mats] 
        
               else 
        
                   mats = unit_scale(unit) * mats 
        
               end 
        
               cache = similar(mats[1]) 
        
               EIGS = EIGS(cache) 
        
               DenseHamiltonian{eltype(mats[1])}(funcs, mats, cache, hsize, EIGS) 
        
           end

This can be turned off with CUDA.allowscalar(false) and CuArray.allowscalar(false) or something like this.

Questions/ things to resolve:
1.) Does converting matrices to Array{Complex{Float32},2} before casting as CuArray help GPU performance? If so, add this support.
2.) Is there any speed to be gained by converting complex numbers to two reals numbers instead of Complex type? Does CUDA handle that for us?
3.) Does CUDA.allowscalar(false) actually help us? If not, is there a way to remove scalar operations from DenseHamiltonian constructor in the first place so that scalar operations don't occur on GPU?

The text was updated successfully, but these errors were encountered:

SuperElephant · 2020-12-08T23:21:20Z

I tried to disable scalar in try_gpu_accel.jl (by just adding CUDA.allowscalar(false) in line 64.). However, it shows up that is not so trivial as adding a single line. It gives errors as follows. I suppose some changes for function DenseHamiltonian are needed in order to disable scalar.

ERROR: LoadError: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar(::String) at /home1/chaoxian/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:41
 [3] getindex at /home1/chaoxian/.julia/packages/GPUArrays/uaFZh/src/host/indexing.jl:96 [inlined]
 [4] macro expansion at /home1/chaoxian/.julia/packages/StaticArrays/l7lu2/src/convert.jl:46 [inlined]
 [5] unroll_tuple at /home1/chaoxian/.julia/packages/StaticArrays/l7lu2/src/convert.jl:43 [inlined]
 [6] _convert at /home1/chaoxian/.julia/packages/StaticArrays/l7lu2/src/convert.jl:35 [inlined]
 [7] convert at /home1/chaoxian/.julia/packages/StaticArrays/l7lu2/src/convert.jl:32 [inlined]
 [8] StaticArrays.SArray{Tuple{8,8},T,2,L} where L where T(::CuArray{Complex{Float64},2}) at /home1/chaoxian/.julia/packages/StaticArrays/l7lu2/src/convert.jl:7
 [9] (::OpenQuantumBase.var"#180#182"{Symbol,Tuple{Int64,Int64}})(::CuArray{Complex{Float64},2}) at /home1/chaoxian/.julia/packages/OpenQuantumBase/YkEiX/src/base_util.jl:0
 [10] iterate at ./generator.jl:47 [inlined]
 [11] collect(::Base.Generator{Array{CuArray{Complex{Float64},2},1},OpenQuantumBase.var"#180#182"{Symbol,Tuple{Int64,Int64}}}) at ./array.jl:665
 [12] DenseHamiltonian(::Array{Function,1}, ::Array{CuArray{Complex{Float64},2},1}; unit::Symbol, EIGS::typeof(EIGEN_DEFAULT)) at /home1/chaoxian/.julia/packages/OpenQuantumBase/YkEiX/src/hamiltonian/dense_hamiltonian.jl:41
 [13] anneal_spin_glass_gpu(::Int64, ::Int64) at /home1/chaoxian/final_project/accelqat/cuda/try_gpu_accel_ds.jl:58
 [14] top-level scope at ./util.jl:175
 [15] include(::Module, ::String) at ./Base.jl:377
 [16] exec_options(::Base.JLOptions) at ./client.jl:288
 [17] _start() at ./client.jl:484
in expression starting at /home1/chaoxian/final_project/accelqat/cuda/try_gpu_accel_ds.jl:67

Just to remind, as shown in stacktrace, line 41 in dense_hamiltonian.jl will need changes. Also there might be more changes needed other then that as mentioned by @naezzell .

OpenQuantumBase.jl/src/hamiltonian/dense_hamiltonian.jl

Line 41 in c567d61

mats = [SMatrix{hsize[1],hsize[2]}(unit_scale(unit) * m) for m in mats]

naezzell assigned naezzell, SuperElephant and brysonch and unassigned naezzell Dec 8, 2020

naezzell added the GPU improvements related to GPU acceleration label Dec 8, 2020

SuperElephant mentioned this issue Dec 10, 2020

add CuHamiltonian with no scalar indexing #44

Merged

neversakura added this to the GPU-implementation of the Schrodinger solver milestone Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Hamiltonian constructor for GPU acceleration #41

Optimizing Hamiltonian constructor for GPU acceleration #41

naezzell commented Dec 8, 2020

SuperElephant commented Dec 8, 2020

Optimizing Hamiltonian constructor for GPU acceleration #41

Optimizing Hamiltonian constructor for GPU acceleration #41

Comments

naezzell commented Dec 8, 2020

SuperElephant commented Dec 8, 2020