-
Notifications
You must be signed in to change notification settings - Fork 2
PARALUTION Exercise
PARALUTION is a library which enables you to perform various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices. Based on C++, it provides generic and flexible design which allows seamless integration with other scientific software packages. The project is available on http://www.paralution.com
ssh -Y todi
module swap PrgEnv-cray PrgEnv-gnu
module load cmake
module load cudatoolkit
If you did not allocate a compute node, please do so
salloc -N 1
wget "http://www.paralution.com/downloads/paralution-0.3.0.tar.gz"
tar zxvf paralution-0.3.0.tar.gz
cd paralution-0.3.0
mkdir build
cd build/
cmake ..
make -j
cd ..
Now the library is ready!
(in ~/paralution-0.3.0/)
mkdir mat
cd mat
wget "http://www.cise.ufl.edu/research/sparse/MM/McRae/ecology2.tar.gz"
wget "http://www.cise.ufl.edu/research/sparse/MM/AMD/G3_circuit.tar.gz"
wget "http://www.cise.ufl.edu/research/sparse/MM/Schmid/thermal2.tar.gz"
wget "ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/laplace/gr_30_30.mtx.gz"
tar zxvf ecology2.tar.gz
tar xvf G3_circuit.tar.gz
tar xvf thermal2.tar.gz
gzip -d gr_30_30.mtx.gz
cd ..
ecology2, G3_circuit, thermal2 are large matrices
gr_30_30 is a small matrix
(in ~/paralution-0.3.0/)
aprun build/bin/cg mat/gr_30_30.mtx
You should get similar output:
lukarski@todi4:~/paralution-0.3.0> aprun build/bin/cg mat/gr_30_30.mtx Number of GPU devices in the sytem: 1 Paralution platform is initialized OpenMP threads:1 No MKL support Selected GPU devices: 0 ------------------------------------------------ Device number: 0 Device name: Tesla K20X totalGlobalMem: 5759 MByte clockRate: 732000 computeMode: 3 ECCEnabled: 1 ------------------------------------------------ No OpenCL support ReadFileMTX: filename=mat/gr_30_30.mtx; reading... ReadFileMTX: filename=mat/gr_30_30.mtx; done LocalMatrix name=mat/gr_30_30.mtx; rows=900; cols=900; nnz=7744; prec=64bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={GPU(CUDA)}; current=CPU(OpenMP) *** warning: LocalMatrix::SymbolicPower() is performed on the host *** warning: LocalMatrix::ILUpFactorize() is performed on the host LocalMatrix name=mat/gr_30_30.mtx; rows=900; cols=900; nnz=7744; prec=64bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={GPU(CUDA)}; current=GPU(CUDA) PCG solver starts, with preconditioner: Multicolored ILU preconditioner (power(q)-pattern method), ILU(1,2) number of colors = 9; ILU nnz = 13260 IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; max iter=1000000 IterationControl initial residual = 30 IterationControl RELATIVE criteria has been reached: res norm=1.17875e-05; rel val=3.92915e-07; iter=17 PCG ends Solver execution:0.111718 sec Application 2089884 resources: utime ~2s, stime ~1s, Rss ~271540, inblocks ~3957, outblocks ~7571
#include <iostream> #include <paralution.hpp> #include <sys/time.h> using namespace paralution; int main(int argc, char* argv[]) { init_paralution(); info_paralution(); // Time measurement struct timeval now; double tick, tack; gettimeofday(&now, NULL); tick = now.tv_sec*1000000.0+(now.tv_usec); // Do some work ... gettimeofday(&now, NULL); tack = now.tv_sec*1000000.0+(now.tv_usec); std::cout << "Time:" << (tack-tick)/1000000.0 << std::endl; stop_paralution(); }
g++ -fopenmp test.cpp -o test build/lib/libparalution.a -Ibuild/inc/ -lcudart -lcublas -lcusparse -L$CUDATOOLKIT_HOME/lib64
lukarski@todi4:~/paralution-0.3.0> g++ -fopenmp test.cpp -o test build/lib/libparalution.a -Ibuild/inc/ -lcudart -lcublas -lcusparse -L$CUDATOOLKIT_HOME/lib64 lukarski@todi4:~/paralution-0.3.0> aprun ./test mat/gr_30_30.mtx Number of GPU devices in the sytem: 1 Paralution platform is initialized OpenMP threads:16 No MKL support Selected GPU devices: 0 ------------------------------------------------ Device number: 0 Device name: Tesla K20X totalGlobalMem: 5759 MByte clockRate: 732000 computeMode: 3 ECCEnabled: 1 ------------------------------------------------ No OpenCL support Time:0 Application 2090083 resources: utime ~2s, stime ~1s, Rss ~269892, inblocks ~3257, outblocks ~5792 lukarski@todi4:~/paralution-0.3.0>
- Read the matrix from a file (MTX format)
- Create vectors x and y (with the corresponding sizes)
- Set x=1; y=0;
- Do SpMV
LocalVector<double> x, y; LocalMatrix<double> A; A.ReadFileMTX(std::string(argv[1])); x.Allocate("x", A.get_ncol()); y.Allocate("y", A.get_nrow()); x.Ones(); y.Zeros(); A.Apply(x, &y);
Just before the SpMV (A.Apply(x, &y);
) move all the objects to the accelerator:
A.MoveToAccelerator(); x.MoveToAccelerator(); y.MoveToAccelerator();
- Build a SpMV test program
- Offload the SpMV operation to the GPU
- Perform 100-1000 SpMV and measure their time
- Test various matrices
- Test various matrix formats (via Matrix.ConvertToXXX(); XXX=[CSR,MCSR,DIA,ELL,HYB])
- Run the SpMV test without GPU (1): run the same program (without recompiling it) on the login-node with a the small test matrix -- the platform will detect no GPU devices and will force the execution on the host.
- Run the SpMV test without GPU (2): unload the cudatoolkit module, recompile the code and run the program again -- the accelerator (GPU) backend will be disabled during the compilation and the .MoveToAccelerator() calls will be ignored.
You can control the OpenMP threads with bash variable
export OMP_NUM_THREADS=16
or within your code by the set_omp_threads_paralution(int nthreads)
function.