-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79
Comments
Can you provide full codes for your experiment? Sometimes multi-cpu won't get any boost or can even slow down training since overhead of hardware communication. |
Ok, my code is shown as follows:
The training data is very large(>100G), so I divide it into 20 file pairs, and load the data periodically for each epoch by using the generator of keras, is that related to the speed of training via multi-gpus? Thanks! @kevinthesun |
@Wendison You can benchmark pure training time without data IO to see if data IO is the bottleneck. |
Hi, I have some questions about the training speed when using multi-gpus with mxnet as the backend for keras. According to https://mxnet.incubator.apache.org/how_to/multi_devices.html, which said "By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model." I think when the batch size b is fixed, each gpu calculates gradients on b/k examples, compared to the gradients calculation on b examples with single gpu, the former should comsume less time. As a result, with the same batch size, the speed of weights updating by using multi-gpus should be faster than that by using single gpu for each iteration. But through the experiments, I found the speed of training using multi-gpus is slower than that using single gpu .
below are parts of my code, where I used the fully-connected network
model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(257))
model.summary()
opt=SGD()
NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
gpu_list.append('gpu(%d)' % i)
batch_size=128
model.compile(loss=my_loss, optimizer=opt, context=gpu_list)
I don't know whether my understanding is right, why no speed improvment can be obatined with multi-gpus? Can anyone solve my questions? Thanks!
The text was updated successfully, but these errors were encountered: