Skip to content
This repository has been archived by the owner on Sep 25, 2023. It is now read-only.

[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

galli-leo
Copy link

@galli-leo galli-leo commented Oct 6, 2017

These layers are used in the openface project, which I also included as a sample.

Unfortunately, in the current state the model compiles fine, but the prediction is always a nan array.

Please leave some feedback on what you think I am doing wrong. (So this is kinda an issue and a PR :)

Because the conversion does not work, this is still WIP. The scripts could use some cleanup as well.

@galli-leo galli-leo changed the title Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. [WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. Oct 6, 2017
@galli-leo
Copy link
Author

galli-leo commented Oct 6, 2017

Regarding the nan issue. I think it has to do with a lot of values becoming 0. This seems to happen especially after ReLU layers. After digging through the coreml code, shouldn't the ReLU layers have weights? Maybe because the weights aren't set, a lot of the values are negative before the relu, making them null? Then a later layer probably has a problem with many 0 values.

Maybe not? The InceptionV3 model provided by apple also doesn't have any weights in their ReLU layers. Still strange that a lot of values are 0 after those layers.

@galli-leo
Copy link
Author

After debugging some more, it seems that the very first batch norm layer already has different results than the pytorch version. Not sure if this is relevant? Also the results are only about 10% off.

@opedge
Copy link
Member

opedge commented Oct 10, 2017

I think 10% is a big enough difference. May be you could try to test layers outputs using numpy.testing.assert_array_almost_equal?

@galli-leo
Copy link
Author

Thanks for the suggestion, that function seems really useful!
So if I only use the first layer, the assertion still fails, but the arrays aren't visibly different:

Arrays are not almost equal to 6 decimals
(mismatch 2.49701605903%)
 x: array([[[  1.943880e+00,  -1.138287e-01,  -1.804606e+00, ...,
          -3.091014e+00,  -3.372690e+00,  -1.128708e+00],
        [  2.956385e+00,   5.726140e-01,  -2.016481e+00, ...,...
 y: array([[[  1.943880e+00,  -1.138287e-01,  -1.804606e+00, ...,
          -3.091014e+00,  -3.372690e+00,  -1.128708e+00],
        [  2.956385e+00,   5.726140e-01,  -2.016481e+00, ...,...

Could this be due to bad rounding?

If I add the second layer (the Batch Norm one), this is the assertion result:

Arrays are not almost equal to 6 decimals

(mismatch 99.9715169271%)
 x: array([[[ 1.829906,  0.397447, -0.779576, ..., -1.6751  , -1.871186,
         -0.309054],
        [ 2.534754,  0.875309, -0.927071, ..., -2.028347, -3.363676,...
 y: array([[[  1.661550e+00,   3.555828e-01,  -7.175032e-01, ...,
          -1.533949e+00,  -1.712720e+00,  -2.885307e-01],
        [  2.304157e+00,   7.912476e-01,  -8.519741e-01, ...,...

While the individual items of the array are only off, by about 10%, the whole array is off by 99.9%. Could this be due to the error in the first layer? Or this there something wrong with the Batch Norm implementation / conversion?

@galli-leo
Copy link
Author

galli-leo commented Oct 11, 2017

Interesting, so the first nan values "appear" after the 18 layer (a DepthConcat one). They appear in the filters 384 to 479 and all of them in the "first" pixel:

>>> numpy.argwhere(numpy.isnan(output))
array([[384,   1,   1],
       [385,   1,   1],
       [386,   1,   1],
...
       [477,   1,   1],
       [478,   1,   1],
       [479,   1,   1]])

The network in question:

(nn.Sequential {
  [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> output]
  (0): nn.SpatialConvolution(3 -> 64, 7x7, 2, 2, 3, 3)
  (1): nn.SpatialBatchNormalization
  (2): nn.ReLU
  (3): nn.SpatialMaxPooling(3x3, 2, 2, 1, 1)
  (4): nn.SpatialCrossMapLRN
  (5): nn.SpatialConvolution(64 -> 64, 1x1)
  (6): nn.SpatialBatchNormalization
  (7): nn.ReLU
  (8): nn.SpatialConvolution(64 -> 192, 3x3, 1, 1, 1, 1)
  (9): nn.SpatialBatchNormalization
  (10): nn.ReLU
  (11): nn.SpatialCrossMapLRN
  (12): nn.SpatialMaxPooling(3x3, 2, 2, 1, 1)
  (13): nn.DepthConcat
  (14): nn.DepthConcat
  (15): nn.DepthConcat
  (16): nn.DepthConcat
  (17): nn.DepthConcat
  (18): nn.DepthConcat
}, (1L, 736L, 3L, 3L))

So I removed the Batch Norm layers from the 18th layer (DepthConcat layer) and the nan values did not occur anymore. Maybe something is off with the Batch Norm layers? Because they shouldn't produce nan, right?

Removing the Batch Norm layers again from the 20th layer finally gives a somewhat meaningful output:

'Prediction: ', {u'output': array([ 0.05325931, -0.16461314, -0.01190347,  0.03564042, -0.08715865,
        0.1428728 , -0.08130208,  0.0138449 ,  0.01105048, -0.11602819,
        0.03985004,  0.01261801,  0.02080127, -0.11630452, -0.06753173,
        0.01370422, -0.08911988,  0.00598601, -0.00617075,  0.00460384,
...
        0.09547406, -0.00316235,  0.11150559, -0.02214371, -0.04056014,
       -0.0322006 , -0.07185692,  0.09166312, -0.04801789, -0.02850034,
        0.01139403,  0.05652488,  0.01659156])})

After some more debugging, it seems like running_var is a nan array for some batch norm layers in the 18th and 20th Depth Concat layer. Do you have any suggestions why that might be the case?

@galli-leo
Copy link
Author

I finally got it to work! Problem was that I had to run a random tensor through the model in torch, to get it to load in pytorch (see pytorch/pytorch#956). However, when forwarding the random tensor, I used a batchsize of 1. This creates a division by zero, because torch does some more calculation for the variance where batchsize > 1 is assumed (https://discuss.pytorch.org/t/nan-when-i-use-batch-normalization-batchnorm1d/322/14). When forwarding a random batch with batchsize = 2, the coreml model outputs a "correct prediction". The prediction is still ~2% off though. Do you think that's due to rounding or something else?

Anyways, sorry for flooding this pr with walls of text 😬 . I will clean up the code over the next few days, and then it should be ready to be merged.

@duag
Copy link

duag commented Oct 11, 2017

I am trying to load the Openface model into Pytorch. But I do not know how to save it in Pytorch after I loaded it.
As far as I understand the model is a legacy.nn model which is not supported by Pytorch's save function, so I have to use something like this converter
https://github.com/clcarwin/convert_torch_to_pytorch
Yet every Torch to Pytorch converter which I found was not supporting nn.DepthConcat.
How did you manage to save the Openface model as a Pytorch model?

@galli-leo
Copy link
Author

I didn't, I converted it into a coreml model.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants