PARALUTION Exercise

Preliminaries

PARALUTION is a library which enables you to perform various sparse iterative solvers and preconditioners on multi/many-core CPU and GPU devices. Based on C++, it provides generic and flexible design which allows seamless integration with other scientific software packages. The project is available on http://www.paralution.com

PARALUTION User Manual (PDF)

PARALUTION Doxygen

Compile the library on TODI (with GPU support)

Log in to TODI and load GNU/cmake/CUDA modules

ssh -Y todi

module swap PrgEnv-cray PrgEnv-gnu

module load cmake

module load cudatoolkit

If you did not allocate a compute node, please do so

salloc -N 1

Get PARALUTION on TODI

wget "http://www.paralution.com/downloads/paralution-0.3.0.tar.gz"

Unpack it

tar zxvf paralution-0.3.0.tar.gz

Build it

cd paralution-0.3.0

mkdir build

cd build/

cmake ..

make -j

cd ..

Now the library is ready!

Download some test matrices

(in ~/paralution-0.3.0/)

mkdir mat

cd mat

wget "http://www.cise.ufl.edu/research/sparse/MM/McRae/ecology2.tar.gz"

wget "http://www.cise.ufl.edu/research/sparse/MM/AMD/G3_circuit.tar.gz"

wget "http://www.cise.ufl.edu/research/sparse/MM/Schmid/thermal2.tar.gz"

wget "ftp://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/laplace/gr_30_30.mtx.gz"

tar zxvf ecology2.tar.gz

tar xvf G3_circuit.tar.gz

tar xvf thermal2.tar.gz

gzip -d gr_30_30.mtx.gz

cd ..

ecology2, G3_circuit, thermal2 are large matrices

gr_30_30 is a small matrix

Run a CG test with PARALUTION

(in ~/paralution-0.3.0/)

aprun build/bin/cg mat/gr_30_30.mtx

You should get similar output:

lukarski@todi4:~/paralution-0.3.0> aprun build/bin/cg mat/gr_30_30.mtx
Number of GPU devices in the sytem: 1
Paralution platform is initialized
OpenMP threads:1
No MKL support
Selected GPU devices: 0
 ------------------------------------------------
Device number: 0
Device name: Tesla K20X
totalGlobalMem: 5759 MByte
clockRate: 732000
computeMode: 3
ECCEnabled: 1
 ------------------------------------------------
No OpenCL support
ReadFileMTX: filename=mat/gr_30_30.mtx; reading...
ReadFileMTX: filename=mat/gr_30_30.mtx; done
LocalMatrix name=mat/gr_30_30.mtx; rows=900; cols=900; nnz=7744; prec=64bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={GPU(CUDA)}; current=CPU(OpenMP)
*** warning: LocalMatrix::SymbolicPower() is performed on the host
*** warning: LocalMatrix::ILUpFactorize() is performed on the host
LocalMatrix name=mat/gr_30_30.mtx; rows=900; cols=900; nnz=7744; prec=64bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={GPU(CUDA)}; current=GPU(CUDA)
PCG solver starts, with preconditioner:
Multicolored ILU preconditioner (power(q)-pattern method), ILU(1,2)
number of colors = 9; ILU nnz = 13260
IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; max iter=1000000
IterationControl initial residual = 30
IterationControl RELATIVE criteria has been reached: res norm=1.17875e-05; rel val=3.92915e-07; iter=17
PCG ends
Solver execution:0.111718 sec
Application 2089884 resources: utime ~2s, stime ~1s, Rss ~271540, inblocks ~3957, outblocks ~7571

A PARALUTION skeleton file (test.cpp in ~/paralution-0.3.0/)

#include <iostream>
#include <paralution.hpp>
#include <sys/time.h>

using namespace paralution;

int main(int argc, char* argv[]) {

  init_paralution();

  info_paralution();

  // Time measurement
  struct timeval now;
  double tick, tack;

  gettimeofday(&now, NULL);
  tick = now.tv_sec*1000000.0+(now.tv_usec);

  // Do some work ...

  gettimeofday(&now, NULL);
  tack = now.tv_sec*1000000.0+(now.tv_usec);

  std::cout << "Time:" << (tack-tick)/1000000.0 << std::endl;


  stop_paralution();

}

Compile the skeleton file

g++ -fopenmp test.cpp -o test build/lib/libparalution.a -Ibuild/inc/ -lcudart -lcublas -lcusparse -L$CUDATOOLKIT_HOME/lib64

lukarski@todi4:~/paralution-0.3.0> g++ -fopenmp test.cpp -o test build/lib/libparalution.a -Ibuild/inc/ -lcudart -lcublas -lcusparse -L$CUDATOOLKIT_HOME/lib64
lukarski@todi4:~/paralution-0.3.0> aprun ./test mat/gr_30_30.mtx 
Number of GPU devices in the sytem: 1
Paralution platform is initialized
OpenMP threads:16
No MKL support
Selected GPU devices: 0
 ------------------------------------------------
Device number: 0
Device name: Tesla K20X
totalGlobalMem: 5759 MByte
clockRate: 732000
computeMode: 3
ECCEnabled: 1
 ------------------------------------------------
No OpenCL support
Time:0
Application 2090083 resources: utime ~2s, stime ~1s, Rss ~269892, inblocks ~3257, outblocks ~5792
lukarski@todi4:~/paralution-0.3.0>

SpMV (Host)

Read the matrix from a file (MTX format)
Create vectors x and y (with the corresponding sizes)
Set x=1; y=0;
Do SpMV

  LocalVector<double> x, y;
  LocalMatrix<double> A;

  A.ReadFileMTX(std::string(argv[1]));
  x.Allocate("x", A.get_ncol());
  y.Allocate("y", A.get_nrow());

  x.Ones();
  y.Zeros();

  A.Apply(x, &y);

SpMV (Accelerator)

Just before the SpMV (A.Apply(x, &y);) move all the objects to the accelerator:

  A.MoveToAccelerator();
  x.MoveToAccelerator();
  y.MoveToAccelerator();

Your tasks:

Build a SpMV test program
Offload the SpMV operation to the GPU
Perform 100-1000 SpMV and measure their time
Test various matrices
Test various matrix formats (via Matrix.ConvertToXXX(); XXX=[CSR,MCSR,DIA,ELL,HYB])
Run the SpMV test without GPU (1): run the same program (without recompiling it) on the login-node with a the small test matrix -- the platform will detect no GPU devices and will force the execution on the host.
Run the SpMV test without GPU (2): unload the cudatoolkit module, recompile the code and run the program again -- the accelerator (GPU) backend will be disabled during the compilation and the .MoveToAccelerator() calls will be ignored.

Tips

OpenMP threads control

You can control the OpenMP threads with bash variable

export OMP_NUM_THREADS=16

or within your code by the set_omp_threads_paralution(int nthreads) function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly