-
Notifications
You must be signed in to change notification settings - Fork 172
Feature Highlight: Multi GPU Training
In Minerva, device
is a concept for any computing resource. Normally, device
is CPU, GPU#0, GPU#1, etc. But in Minerva's design, device
could beyond the limitation of single computing resource. For example, one could write a device
for hybrid use of CPU and GPU or even on multiple machine. Although, currently, we only use device
to represent single GPU or CPU, it is easy to be extended in the future.
Minerva exposes following interfaces for creating/switching device
s.
In C++:
uint64_t MinervaSystem::CreateCpuDevice();
uint64_t MinervaSystem::CreateGpuDevice(int which);
void MinervaSystem::SetDevice(uint64_t devid);
In Python:
devid = owl.create_cpu_device()
devid = owl.create_gpu_device(gpuid)
owl.set_device(devid)
-
Creation: The
create_xxx_device
function will return an internal unique id to represent the device. -
Switch: The
set_device
function tells Minerva all following computations should be executed on the given device till anotherset_device
call. For example,gpu0 = owl.create_gpu_device(0) gpu1 = owl.create_gpu_device(1) owl.set_device(gpu0) x = owl.zeros([100, 200]) owl.set_device(gpu1) y = owl.zeros([100, 200])
, where
x
is created on GPU#0 whiley
is created on GPU#1
Let us look at the example again,
gpu0 = owl.create_gpu_device(0)
gpu1 = owl.create_gpu_device(1)
owl.set_device(gpu0)
x = owl.zeros([100, 200])
owl.set_device(gpu1)
y = owl.zeros([100, 200])
z = x + y
We now understand that the first zeros
and the second zeros
will be executed on different cards, but it seems that the two cards are used one by one but not simultaneously. How could we utilize multiple GPUs at the same time?
The answer is Lazy Evaluation.
Recall in Feature-Highlight: Dataflow engine, we introduce how Minerva parallelizes codes using lazy evaluation and dataflow engine. In a word, if two operations are independent, they will be executed at the same time. Therefore, in the above example, not only x
and y
are created on two cards, but also they are created concurrently.
Also note that for z = x + y
, x
is on GPU#0 while y
and z
are on GPU#1, how the data is transferred? In fact, Minerva handle the data transmission transparently, so you do not need to worry about that.
By using above concept, it is very easy to use data parallelism to train neural network on multiple GPUs.
Data parallelism is first dispatching partitions of mini-batches to different training units (i.e, one GPU); each unit would train on that part of mini-batch with the same weights; then they will accumulate the gradient generated during training, update the weight and start the next mini-batch. Such paradigm of parallelism is called Bulk Synchronous Parallel. We will soon tell you how to use Minerva's API to express such parallelism in several lines of codes.
Suppose we have a single card training algorithm like follows (Pseudo-code):
train_set = load_train_set(minibatch_size=256)
for epoch in range(MAX_EPOCH):
for mbidx in range(len(train_set)):
(data, label) = train_set[mbidx]
grad = ff_and_bp(data, label)
update(grad)
Recall the above training algorithm structure in this wiki page. We could convert it to use data parallelism as follows:
gpus = [owl.create_gpu_device(i) for i in range(num_gpu)]
train_set = load_train_set(minibatch_size=256/num_gpu)
for epoch in range(MAX_EPOCH):
for mbidx in range(0, len(train_set), num_gpu):
grads = []
for i in range(num_gpu):
owl.set_device(gpus[i]) # calculate each gradient on different GPU
(data, label) = train_set[mbidx + i]
grad_each = ff_and_bp(data, label)
grads.append(grad_each)
owl.set_device(gpus[0]) # (optional) choose GPU#0 for update
grad = accumulate(grads)
update(grad)
In the above example, each GPU will take in charge of forward and backward propagation on one small mini-mini batch. The gradients are accumulated and weights are updated on GPU#0. Likewise, you do not need to worry about the data transmission among different GPUs. They are handled automatically for you.
Actually, we implement almost the same logic for all our multi-GPU training algorithm. You could find them in here:
- mnist_cnn_2gpu.cpp (in /path/to/minerva/apps).
- mnist_cnn.py (in /path/to/minerva/owl/apps/mnist).
- owl.net.trainer.py (in /path/to/minerva/owl/owl/net/trainer.py)