[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

galli-leo · 2017-10-06T09:43:21Z

These layers are used in the openface project, which I also included as a sample.

Unfortunately, in the current state the model compiles fine, but the prediction is always a nan array.

Please leave some feedback on what you think I am doing wrong. (So this is kinda an issue and a PR :)

Because the conversion does not work, this is still WIP. The scripts could use some cleanup as well.

galli-leo · 2017-10-06T13:53:44Z

Regarding the nan issue. I think it has to do with a lot of values becoming 0. This seems to happen especially after ReLU layers. After digging through the coreml code, shouldn't the ReLU layers have weights? Maybe because the weights aren't set, a lot of the values are negative before the relu, making them null? Then a later layer probably has a problem with many 0 values.

Maybe not? The InceptionV3 model provided by apple also doesn't have any weights in their ReLU layers. Still strange that a lot of values are 0 after those layers.

galli-leo · 2017-10-10T14:43:22Z

After debugging some more, it seems that the very first batch norm layer already has different results than the pytorch version. Not sure if this is relevant? Also the results are only about 10% off.

opedge · 2017-10-10T21:40:10Z

I think 10% is a big enough difference. May be you could try to test layers outputs using numpy.testing.assert_array_almost_equal?

galli-leo · 2017-10-11T07:24:26Z

Thanks for the suggestion, that function seems really useful!
So if I only use the first layer, the assertion still fails, but the arrays aren't visibly different:

Arrays are not almost equal to 6 decimals
(mismatch 2.49701605903%)
 x: array([[[  1.943880e+00,  -1.138287e-01,  -1.804606e+00, ...,
          -3.091014e+00,  -3.372690e+00,  -1.128708e+00],
        [  2.956385e+00,   5.726140e-01,  -2.016481e+00, ...,...
 y: array([[[  1.943880e+00,  -1.138287e-01,  -1.804606e+00, ...,
          -3.091014e+00,  -3.372690e+00,  -1.128708e+00],
        [  2.956385e+00,   5.726140e-01,  -2.016481e+00, ...,...

Could this be due to bad rounding?

If I add the second layer (the Batch Norm one), this is the assertion result:

Arrays are not almost equal to 6 decimals

(mismatch 99.9715169271%)
 x: array([[[ 1.829906,  0.397447, -0.779576, ..., -1.6751  , -1.871186,
         -0.309054],
        [ 2.534754,  0.875309, -0.927071, ..., -2.028347, -3.363676,...
 y: array([[[  1.661550e+00,   3.555828e-01,  -7.175032e-01, ...,
          -1.533949e+00,  -1.712720e+00,  -2.885307e-01],
        [  2.304157e+00,   7.912476e-01,  -8.519741e-01, ...,...

While the individual items of the array are only off, by about 10%, the whole array is off by 99.9%. Could this be due to the error in the first layer? Or this there something wrong with the Batch Norm implementation / conversion?

galli-leo · 2017-10-11T08:02:30Z

Interesting, so the first nan values "appear" after the 18 layer (a DepthConcat one). They appear in the filters 384 to 479 and all of them in the "first" pixel:

>>> numpy.argwhere(numpy.isnan(output))
array([[384,   1,   1],
       [385,   1,   1],
       [386,   1,   1],
...
       [477,   1,   1],
       [478,   1,   1],
       [479,   1,   1]])

The network in question:

(nn.Sequential {
  [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> output]
  (0): nn.SpatialConvolution(3 -> 64, 7x7, 2, 2, 3, 3)
  (1): nn.SpatialBatchNormalization
  (2): nn.ReLU
  (3): nn.SpatialMaxPooling(3x3, 2, 2, 1, 1)
  (4): nn.SpatialCrossMapLRN
  (5): nn.SpatialConvolution(64 -> 64, 1x1)
  (6): nn.SpatialBatchNormalization
  (7): nn.ReLU
  (8): nn.SpatialConvolution(64 -> 192, 3x3, 1, 1, 1, 1)
  (9): nn.SpatialBatchNormalization
  (10): nn.ReLU
  (11): nn.SpatialCrossMapLRN
  (12): nn.SpatialMaxPooling(3x3, 2, 2, 1, 1)
  (13): nn.DepthConcat
  (14): nn.DepthConcat
  (15): nn.DepthConcat
  (16): nn.DepthConcat
  (17): nn.DepthConcat
  (18): nn.DepthConcat
}, (1L, 736L, 3L, 3L))

So I removed the Batch Norm layers from the 18th layer (DepthConcat layer) and the nan values did not occur anymore. Maybe something is off with the Batch Norm layers? Because they shouldn't produce nan, right?

Removing the Batch Norm layers again from the 20th layer finally gives a somewhat meaningful output:

'Prediction: ', {u'output': array([ 0.05325931, -0.16461314, -0.01190347,  0.03564042, -0.08715865,
        0.1428728 , -0.08130208,  0.0138449 ,  0.01105048, -0.11602819,
        0.03985004,  0.01261801,  0.02080127, -0.11630452, -0.06753173,
        0.01370422, -0.08911988,  0.00598601, -0.00617075,  0.00460384,
...
        0.09547406, -0.00316235,  0.11150559, -0.02214371, -0.04056014,
       -0.0322006 , -0.07185692,  0.09166312, -0.04801789, -0.02850034,
        0.01139403,  0.05652488,  0.01659156])})

After some more debugging, it seems like running_var is a nan array for some batch norm layers in the 18th and 20th Depth Concat layer. Do you have any suggestions why that might be the case?

galli-leo · 2017-10-11T10:57:39Z

I finally got it to work! Problem was that I had to run a random tensor through the model in torch, to get it to load in pytorch (see pytorch/pytorch#956). However, when forwarding the random tensor, I used a batchsize of 1. This creates a division by zero, because torch does some more calculation for the variance where batchsize > 1 is assumed (https://discuss.pytorch.org/t/nan-when-i-use-batch-normalization-batchnorm1d/322/14). When forwarding a random batch with batchsize = 2, the coreml model outputs a "correct prediction". The prediction is still ~2% off though. Do you think that's due to rounding or something else?

Anyways, sorry for flooding this pr with walls of text 😬 . I will clean up the code over the next few days, and then it should be ready to be merged.

duag · 2017-10-11T14:15:27Z

I am trying to load the Openface model into Pytorch. But I do not know how to save it in Pytorch after I loaded it.
As far as I understand the model is a legacy.nn model which is not supported by Pytorch's save function, so I have to use something like this converter
https://github.com/clcarwin/convert_torch_to_pytorch
Yet every Torch to Pytorch converter which I found was not supporting nn.DepthConcat.
How did you manage to save the Openface model as a Pytorch model?

galli-leo · 2017-10-11T17:22:30Z

I didn't, I converted it into a coreml model.

First commit still not working :(

7d5c1aa

galli-leo changed the title ~~Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers.~~ [WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. Oct 6, 2017

galli-leo mentioned this pull request Oct 31, 2017

CoreML conversion fails: "ValueError: Only channel and sequence concatenation are supported." vsyw/Keras-OpenFace#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

galli-leo commented Oct 6, 2017 •

edited

Loading

galli-leo commented Oct 6, 2017 •

edited

Loading

galli-leo commented Oct 10, 2017

opedge commented Oct 10, 2017 •

edited

Loading

galli-leo commented Oct 11, 2017

galli-leo commented Oct 11, 2017 •

edited

Loading

galli-leo commented Oct 11, 2017

duag commented Oct 11, 2017

galli-leo commented Oct 11, 2017

[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

Are you sure you want to change the base?

[WIP] Added Reshape, DepthConcat (Inception), SpatialCrossMapLRN, Normalize and SpatialLPPooling layers. #9

Conversation

galli-leo commented Oct 6, 2017 • edited Loading

galli-leo commented Oct 6, 2017 • edited Loading

galli-leo commented Oct 10, 2017

opedge commented Oct 10, 2017 • edited Loading

galli-leo commented Oct 11, 2017

galli-leo commented Oct 11, 2017 • edited Loading

galli-leo commented Oct 11, 2017

duag commented Oct 11, 2017

galli-leo commented Oct 11, 2017

galli-leo commented Oct 6, 2017 •

edited

Loading

galli-leo commented Oct 6, 2017 •

edited

Loading

opedge commented Oct 10, 2017 •

edited

Loading

galli-leo commented Oct 11, 2017 •

edited

Loading