-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the FoMICS Autumn School GPU_Libraries_2013 wiki!
This page contains exercises and other information for the the FoMICS Autumn School 2013 on GPU-enabled libraries, given at the Universita Svizzera Italiana (USI) on Sep. 14-15, 2013, as a prelude to the Domain Decomposition 22 Conference.
Instructions for wireless will be given at start of course.
As mentioned in the course requirements, all participants should bring along an X-windows-capable laptop. You will do all the hands-on training on a Linux "Todi" platform: http://user.cscs.ch/hardware/todi_cray_xk7/index.html
- Log into the CSCS front-end ela.cscs.ch
ssh -Y [email protected] # -Y allows X protocol
The user number XX and the password will be provided in the course.
- Log into Todi:
ssh -Y todi # -Y allows X protocol
At this point you are on log-in node; cross-compiled code will not run here, but rather only on a "compute node"
- Load the module with the GIT version control system (you may need this to check out software for exercises)
module load git
- Swap in the GNU programming environment
module swap PrgEnv-cray PrgEnv-gnu
Now install the support for NVIDIA GPUs:
module load craype-accel-nvidia35
Note: this will load the CUDA toolkit version 5.0, which is not the newest, but is sufficient for our exercises.
- When ready to do an exercise, allocate 1 compute node for your work (valid for one hour). This should suffice for one exercise.
salloc -N 1
With this allocation you should be able to compile and run software interactively. One node provides 16 CPU cores, e.g., for 16 MPI processes each with a single thread, one process with 16 threads, eight processes with two threads each, etc. The salloc command (part of the SLURM batch system) adopts the environment already defined, so it is not necessary to load or swap modules again.
- Later in the course you may experiment with the code distributed over multiple nodes, e.g.,
salloc -N 4
4 nodes will sensibly support up to 64 mpi processes, or a combination of processes or threads (but still with maximum 16 threads per nodes, since threads must be in a shared memory space).