The weight initilize for deep learning depend on the activation and the size of the model (network).
Note:
- The code to calculated the parameters of weight initilization formula
# shape is the shape of tensor W, if dense layer,
# shape is (layer_i nodes, layer_i+1 nodes),
# else if cnn, shape is (num_blocks, c_input,h_kernel,w_kernel)
n_j = shape[0] if len(shape) == 2 else np.prod(shape[1:])
n_jPlus = shape[1] if len(shape) == 2 else shape[0]
- Normally, the input image is normalized to help the model more stable
If the weight value is too small, the output from activate function will be step by step come to
Else if the weight value is too big, the output from activate function will be saturate as 1 in case tanh and sigmoid
Use Xavier initilization, with formula:
If we use normal distirbution for initilizes the parameters, we will get
Therefore, He introduce new ReLu called PReLu and one way to optimize the initilization with that activation is:
$f(y_i)=\begin{cases} y_i &\text{if } y_i>0 \ a_iy_i &\text{if } y_i\leq0 \end{cases}$
where