困学纪闻注:前馈神经网络一览——反向传播算法白垩纪

前馈神经网络是怎么传播的?

\[\begin{array}{l}{\mathbf{z}^{(l)}=W^{(l)} \cdot \mathbf{a}^{(l-1)}+\mathbf{b}^{(l)},} \\ {\mathbf{a}^{(l)}=f_{l}\left(\mathbf{z}^{(l)}\right)}\end{array} \]

\[\mathbf{x}=\mathbf{a}^{(0)} \rightarrow \mathbf{z}^{(1)} \rightarrow \mathbf{a}^{(1)} \rightarrow \mathbf{z}^{(2)} \rightarrow \cdots \rightarrow \mathbf{a}^{(L-1)} \rightarrow \mathbf{z}^{(L)} \rightarrow \mathbf{a}^{(L)} \]

反向传播算法

链式求导
链式求导

三个正菜

对第 \(l\) 层中的参数\(W^{(l)}\)\(\mathbf{b}^{(l)}\) 计算偏导数

\[\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial W_{i j}^{(l)}}=\left(\frac{\partial \mathbf{z}^{(l)}}{\partial W_{i j}^{(l)}}\right)^{\mathrm{T}} \frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l)}} \]

\[\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{b}^{(l)}}=\left(\frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}}\right)^{\mathrm{T}} \frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l)}} \]

于是,我们计算这三个就行了:

\[\frac{\partial \mathbf{z}^{(l)}}{\partial W_{i j}^{(l)}} \] \[\frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}}\] \[\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l)}}\]

  1. 计算 \(\frac{\partial \mathbf{z}^{(l)}}{\partial W_{i j}^{(l)}} = a_{j}^{(l-1)}\)
  2. 计算 \(\frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}} = 1\)
  3. 计算第 \(l\) 层神经元的误差项 \(\delta^{(l)}=\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l)}} \in \mathbb{R}^{m^{(l)}}\) \[\begin{aligned} \delta^{(l)} & \triangleq \frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l)}} \\ &=\frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}} \cdot \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{a}^{(l)}} \cdot \frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{z}^{(l+1)}} \\ &=\operatorname{diag}\left(f_{l}^{\prime}\left(\mathbf{z}^{(l)}\right)\right) \cdot\left(W^{(l+1)}\right)^{\mathrm{T}} \cdot \delta^{(l+1)} \\ &=f_{l}^{\prime}\left(\mathbf{z}^{(l)}\right) \odot\left(\left(W^{(l+1)}\right)^{\mathrm{T}} \delta^{(l+1)}\right) \end{aligned} \]

计算完上面三项可以得到:

\[\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial W^{(l)}}=\delta^{(l)}\left(\mathbf{a}^{(l-1)}\right)^{\mathrm{T}} \longrightarrow \frac{\partial \mathbf{z}^{(l)}}{\partial W_{i j}^{(l)}} \]

\[\frac{\partial \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{b}^{(l)}}=\delta^{(l)} \longrightarrow \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}} \]

在计算出每一层的误差项之后,我们就可以得到每一层参数的梯度。因此,基于误差反向传播算法(backpropagation,BP)的前馈神经网络训练过程可以分为以下三步:

  1. 前馈计算每一层的净输入\(\mathbf{z}^{(l)}\) 和激活值\(\mathbf{a}^{(l)}\),直到最后一层;
  2. 反向传播计算每一层的误差项\(\delta^{(l)}\)
  3. 计算每一层参数的偏导数,并更新参数。
BP 算法
BP 算法

总结

记住一个公式就行了,上面的推导慢慢看

\[\delta^{(l)}=f_{l}^{\prime}\left(\mathbf{z}^{(l)}\right) \odot\left(W^{(l+1)}\right)^{\mathrm{T}} \delta^{(l+1)} \]

这个公式很要命,因为 Sigmoid 型函数导数的值域都小于\(1\)!!!

\[\sigma^{\prime}(x)=\sigma(x)(1-\sigma(x)) \in[0,0.25] \]

\[\tanh ^{\prime}(x)=1-(\tanh (x))^{2} \in[0,1] \]

这会导致梯度消失问题,解决办法是\(ReLU\),就是这么简单。