forked from keras-team/keras
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use Batchnormalization with multi-GPUs in MXNet backend #63
Comments
Good catch. We are not going to fix it until letting kvstore to accept str
key id instead of int.
…On Fri, Jun 2, 2017 at 2:30 PM, Sandeep Krishnamurthy < ***@***.***> wrote:
Unable to use batchnormalization with MXNet backend when using multiple
GPUs. After debugging the issue, I found that there is a mismatch in the
shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being
initialized with a (64,) shape but is being tried to update with a
(256,64,1,1) shape.
Below is the stack trace and my debug messages from "initialize_kvstore"
and "update_params_on_kvstore" functions. Observe that param shape at index
4, there is a mismatch.
In initialize kvstore
kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0>
len of param_arrays - 304
len of arg_params - 304
len of param_names - 304
update_on_kvstore - True
Index - 0
Param name - normal1
Arg params - <NDArray 3x7x7 @cpu <https://github.com/cpu>(0)>
Index - 1
Param name - convolution2d_1_b
Arg params - <NDArray 1 @cpu <https://github.com/cpu>(0)>
Index - 2
Param name - batchnormalization_1_running_mean
Arg params - <NDArray 1 @cpu <https://github.com/cpu>(0)>
Index - 3
Param name - batchnormalization_1_running_std
Arg params - <NDArray 1 @cpu <https://github.com/cpu>(0)>
Index - 4
Param name - batchnormalization_1_gamma
Arg params - <NDArray 1 @cpu <https://github.com/cpu>(0)>
arg_params in idx 4 - <NDArray 64 @cpu <https://github.com/cpu>(0)>
param name at idx 4 - batchnormalization_1_gamma
IIn update_params_on_kvstore
param_arrays - 304
grad_arrays - 304
kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0>
Index - 0
arg_list[0] <NDArray 64x3x7x7 @gpu <https://github.com/gpu>(0)>
Current index - 0
Index - 1
arg_list[0] <NDArray 64 @gpu <https://github.com/gpu>(0)>
Current index - 1
Index - 2
arg_list[0] <NDArray 64 @gpu <https://github.com/gpu>(0)>
Current index - 2
Index - 3
arg_list[0] <NDArray 64 @gpu <https://github.com/gpu>(0)>
Current index - 3
Index - 4
arg_list[0] <NDArray 256x64x1x1 @gpu <https://github.com/gpu>(0)>
Current index - 4
Len of arg_list - 16
Len of grad_list - 16
Arg list[0] - <NDArray 256x64x1x1 @gpu <https://github.com/gpu>(0)>
Grad list[0] - <NDArray 256x64x1x1 @gpu <https://github.com/gpu>(0)>
[16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304:
[16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() ==
to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.
egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.
egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546)
[0x7fc12154c056]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.
egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6R
educeEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.
egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal
4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175)
[0x7fc121928015]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.
egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c)
[0x7fc0b1ef8e40]
[bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb)
[0x7fc0b1ef88ab]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f)
[0x7fc0ba1083df]
[bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82)
[0x7fc0ba10cd82]
[bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
Traceback (most recent call last):
File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in
run_time, memory_usage = profile(train_model)
File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile
func_to_profile()
File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in
train_model
validation_data=(X_test, Y_test))
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.
7.egg/keras/engine/training.py", line 1559, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.
7.egg/keras/engine/training.py", line 1322, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.
7.egg/keras/engine/training.py", line 1959, in train_function
self._mod.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-
py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update
self._curr_module.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-
py2.7.egg/mxnet/module/module.py", line 575, in update
self._kvstore)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py",
line 132, in _update_params_on_kvstore
kvstore.push(index, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py",
line 162, in push
ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py",
line 85, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check
failed: from.shape() == to->shape() operands shape mismatchfrom.shape =
(256,64,1,1) to.shape=(64,)
Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.
@mli <https://github.com/mli> @piiswrong <https://github.com/piiswrong>
@madjam <https://github.com/madjam>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZv4RouSXjvHRYK3PWD7Xb2rpKzzdfaks5sAH72gaJpZM4Nu0NN>
.
|
Thanks Mu. Most of the convolution and fully connected networks would require batchnormalization. Can we still go ahead with keras beta release with this as a known issue? |
another possible fix is that not pushing the batchnorm parameters into kvstore. They actually don't need to be synchronized. |
@mli Can you please clarify what is causing this problem? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
Unable to use batchnormalization with MXNet backend when using multiple GPUs. After debugging the issue, I found that there is a mismatch in the shape of batchnorm param in KVStore. in mxnet/model.py -> KVStore is being initialized with a (64,) shape but is being tried to update with a (256,64,1,1) shape.
Stacktrace and Debug messages
Below is the stack trace and my debug messages from "initialize_kvstore" and "update_params_on_kvstore" functions. Observe that param shape at index 4, there is a mismatch.
In initialize kvstore
kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0>
len of param_arrays - 304
len of arg_params - 304
len of param_names - 304
update_on_kvstore - True
Index - 0
Param name - normal1
Arg params - <NDArray 3x7x7 @cpu(0)>
Index - 1
Param name - convolution2d_1_b
Arg params - <NDArray 1 @cpu(0)>
Index - 2
Param name - batchnormalization_1_running_mean
Arg params - <NDArray 1 @cpu(0)>
Index - 3
Param name - batchnormalization_1_running_std
Arg params - <NDArray 1 @cpu(0)>
Index - 4
Param name - batchnormalization_1_gamma
Arg params - <NDArray 1 @cpu(0)>
arg_params in idx 4 - <NDArray 64 @cpu(0)>
param name at idx 4 - batchnormalization_1_gamma
In update_params_on_kvstore
param_arrays - 304
grad_arrays - 304
kvstore - <mxnet.kvstore.KVStore object at 0x7fbfdb6729d0>
Index - 0
arg_list[0] <NDArray 64x3x7x7 @gpu(0)>
Current index - 0
Index - 1
arg_list[0] <NDArray 64 @gpu(0)>
Current index - 1
Index - 2
arg_list[0] <NDArray 64 @gpu(0)>
Current index - 2
Index - 3
arg_list[0] <NDArray 64 @gpu(0)>
Current index - 3
Index - 4
arg_list[0] <NDArray 256x64x1x1 @gpu(0)>
Current index - 4
Len of arg_list - 16
Len of grad_list - 16
Arg list[0] - <NDArray 256x64x1x1 @gpu(0)>
Grad list[0] - <NDArray 256x64x1x1 @gpu(0)>
[16:14:11] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:304: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc120b0a46c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayEPS0_i+0x546) [0x7fc12154c056]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore10CommDevice6ReduceEiRKSt6vectorINS_7NDArrayESaIS3_EEi+0x384) [0x7fc121925de4]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal4PushERKSt6vectorIiSaIiEERKS2_INS_7NDArrayESaIS7_EEi+0x175) [0x7fc121928015]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(MXKVStorePush+0x7b0) [0x7fc1218cbbc0]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc0b1ef8e40]
[bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc0b1ef88ab]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7fc0ba1083df]
[bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7fc0ba10cd82]
[bt] (9) python(PyObject_Call+0x43) [0x4b0cb3]
Traceback (most recent call last):
File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 131, in
run_time, memory_usage = profile(train_model)
File "/home/ubuntu/keras_benchmarks/profiler.py", line 84, in profile
func_to_profile()
File "/home/ubuntu/keras_benchmarks/test_cifar_resnet.py", line 125, in train_model
validation_data=(X_test, Y_test))
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1559, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1322, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.2-py2.7.egg/keras/engine/training.py", line 1959, in train_function
self._mod.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/bucketing_module.py", line 408, in update
self._curr_module.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/module/module.py", line 575, in update
self._kvstore)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/model.py", line 132, in _update_params_on_kvstore
kvstore.push(index, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/kvstore.py", line 162, in push
ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/base.py", line 85, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:14:11] src/ndarray/ndarray.cc:319: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (256,64,1,1) to.shape=(64,)
Note: I used Resnet50 architecture on CIFAR dataset with batchsize=32.
@mli @piiswrong @madjam @bhavinthaker
The text was updated successfully, but these errors were encountered: