-
Notifications
You must be signed in to change notification settings - Fork 55
[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9
base: master
Are you sure you want to change the base?
Conversation
Regarding the Maybe not? The InceptionV3 model provided by apple also doesn't have any weights in their ReLU layers. Still strange that a lot of values are 0 after those layers. |
After debugging some more, it seems that the very first batch norm layer already has different results than the pytorch version. Not sure if this is relevant? Also the results are only about 10% off. |
I think 10% is a big enough difference. May be you could try to test layers outputs using numpy.testing.assert_array_almost_equal? |
Thanks for the suggestion, that function seems really useful!
Could this be due to bad rounding? If I add the second layer (the Batch Norm one), this is the assertion result:
While the individual items of the array are only off, by about 10%, the whole array is off by 99.9%. Could this be due to the error in the first layer? Or this there something wrong with the Batch Norm implementation / conversion? |
Interesting, so the first nan values "appear" after the 18 layer (a DepthConcat one). They appear in the filters 384 to 479 and all of them in the "first" pixel:
The network in question:
So I removed the Batch Norm layers from the 18th layer (DepthConcat layer) and the nan values did not occur anymore. Maybe something is off with the Batch Norm layers? Because they shouldn't produce nan, right? Removing the Batch Norm layers again from the 20th layer finally gives a somewhat meaningful output:
After some more debugging, it seems like running_var is a nan array for some batch norm layers in the 18th and 20th Depth Concat layer. Do you have any suggestions why that might be the case? |
I finally got it to work! Problem was that I had to run a random tensor through the model in torch, to get it to load in pytorch (see pytorch/pytorch#956). However, when forwarding the random tensor, I used a batchsize of 1. This creates a division by zero, because torch does some more calculation for the variance where batchsize > 1 is assumed (https://discuss.pytorch.org/t/nan-when-i-use-batch-normalization-batchnorm1d/322/14). When forwarding a random batch with batchsize = 2, the coreml model outputs a "correct prediction". The prediction is still ~2% off though. Do you think that's due to rounding or something else? Anyways, sorry for flooding this pr with walls of text 😬 . I will clean up the code over the next few days, and then it should be ready to be merged. |
I am trying to load the Openface model into Pytorch. But I do not know how to save it in Pytorch after I loaded it. |
I didn't, I converted it into a coreml model. |
These layers are used in the openface project, which I also included as a sample.
Unfortunately, in the current state the model compiles fine, but the prediction is always a
nan
array.Please leave some feedback on what you think I am doing wrong. (So this is kinda an issue and a PR :)
Because the conversion does not work, this is still WIP. The scripts could use some cleanup as well.