强化学习所需的深度学习知识 | 神经网络之MLP (2024)

0 前言

MLP即多层感知器(Multi-Layer Perceptron),它的输入是x,输出y是 \[{{h}_{W,b}}\left( x \right)\] ,我们通过调整其参数W(权重weight)和b(偏置bias)去拟合已有训练数据 \left( x,y \right) 所反映出来的输入输出关系。

1 从单神经元讲起

一个3输入、1输出的单神经元(neuron)如下图所示:

强化学习所需的深度学习知识 | 神经网络之MLP (1)

图中,“+1”表示偏置b。该神经元进行了如下计算:

{{h}_{W,b}}\left( x \right)=f\left( Wx+b \right)=f\left( \sum\limits_{i=1}^{3}{{{W}_{i}}{{x}_{i}}}+b \right)

其中, W\in {{\mathbb{R}}^{1\times 3}}x\in {{\mathbb{R}}^{3\times 1}} 。上述公式中的 f\left( \cdot \right) 即为激活函数(activation function)。有以下三种最常见的激活函数:

  • sigmoid function: f\left( z \right)=\frac{1}{1+{{e}^{-z}}}
  • tanh function: f\left( z \right)=\tanh \left( z \right)=\frac{{{e}^{z}}-{{e}^{-z}}}{{{e}^{z}}+{{e}^{-z}}}
  • rectified linear function(ReLU): f\left( z \right)=\max \left( 0,x \right)

这三种激活函数对应的曲线图为:

强化学习所需的深度学习知识 | 神经网络之MLP (2)

sigmoid function的导数为 {f}'\left( z \right)=f\left( z \right)\left( 1-f\left( z \right) \right) ;tanh function的导数为 {f}'\left( z \right)=1-{{\left( f\left( z \right) \right)}^{2}} ;当z<0时rectified linear function的导数为0,当z>0时其导数为1,其在z=0处不可导。

2 神经网络模型

MLP的特点是层与层之间的神经元全部相互连接(这种层可称为全连接层)。如下图是一个小型的MLP:

强化学习所需的深度学习知识 | 神经网络之MLP (3)

如上图所示,最左为输入层(input layer),最右为输出层(output layer),中间为隐藏层(hidden layer)。这是一个有着3个输入、隐藏层神经元个数为3、输出层神经元个数为1的神经网络。一般来说,MLP有多个隐藏层。

该神经网络进行了如下计算:

\begin{align} & {{z}^{\left( 2 \right)}}={{W}^{\left( 1 \right)}}x+{{b}^{\left( 1 \right)}} \\ & =\left[ \begin{matrix} W_{11}^{\left( 1 \right)} & W_{12}^{\left( 1 \right)} & W_{13}^{\left( 1 \right)} \\ W_{21}^{\left( 1 \right)} & W_{22}^{\left( 1 \right)} & W_{23}^{\left( 1 \right)} \\ W_{31}^{\left( 1 \right)} & W_{32}^{\left( 1 \right)} & W_{33}^{\left( 1 \right)} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} \\ {{x}_{2}} \\ {{x}_{3}} \\ \end{matrix} \right]+\left[ \begin{matrix} b_{1}^{\left( 1 \right)} \\ b_{2}^{\left( 1 \right)} \\ b_{3}^{\left( 1 \right)} \\ \end{matrix} \right]=\left[ \begin{matrix} z_{1}^{\left( 2 \right)} \\ z_{2}^{\left( 2 \right)} \\ z_{3}^{\left( 2 \right)} \\ \end{matrix} \right] \end{align}

\begin{align} & {{a}^{\left( 2 \right)}}=f\left( {{z}^{\left( 2 \right)}} \right) \\ & =f\left( \left[ \begin{matrix} z_{1}^{\left( 2 \right)} \\ z_{2}^{\left( 2 \right)} \\ z_{3}^{\left( 2 \right)} \\ \end{matrix} \right] \right)=\left[ \begin{matrix} f\left( z_{1}^{\left( 2 \right)} \right) \\ f\left( z_{2}^{\left( 2 \right)} \right) \\ f\left( z_{3}^{\left( 2 \right)} \right) \\ \end{matrix} \right]=\left[ \begin{matrix} a_{1}^{\left( 2 \right)} \\ a_{2}^{\left( 2 \right)} \\ a_{3}^{\left( 2 \right)} \\ \end{matrix} \right] \end{align}

\begin{align} & {{z}^{\left( 3 \right)}}={{W}^{\left( 2 \right)}}{{a}^{\left( 2 \right)}}+{{b}^{\left( 2 \right)}} \\ & =\left[ \begin{matrix} W_{11}^{\left( 2 \right)} & W_{12}^{\left( 2 \right)} & W_{13}^{\left( 2 \right)} \\ \end{matrix} \right]\left[ \begin{matrix} a_{1}^{\left( 2 \right)} \\ a_{2}^{\left( 2 \right)} \\ a_{3}^{\left( 2 \right)} \\ \end{matrix} \right]+b_{1}^{\left( 2 \right)} \end{align}

{{a}^{\left( 3 \right)}}=f\left( {{z}^{\left( 3 \right)}} \right)={{h}_{W,b}}\left( x \right)

我们可以从中总结出一般性结论。在本文中,规定输入层为第1层。设第 l+1 层有m个神经元、第 l 层有n个神经元、输入 x={{a}^{\left( 1 \right)}} ,则第 l+1 层进行的计算为:

{{z}^{\left( l+1 \right)}}={{W}^{\left( l \right)}}{{a}^{\left( l \right)}}+{{b}^{\left( l \right)}}

{{a}^{\left( l+1 \right)}}=f\left( {{z}^{\left( l+1 \right)}} \right)

其中, \[{{W}^{\left( l \right)}}\in {{\mathbb{R}}^{m\times n}}\]\[{{a}^{\left( l \right)}}\in {{\mathbb{R}}^{n\times 1}}\]\[{{b}^{\left( l \right)}}\in {{\mathbb{R}}^{m\times 1}}\]\[{{z}^{\left( l+1 \right)}}\in {{\mathbb{R}}^{m\times 1}}\]{{a}^{\left( l+1 \right)}}\in {{\mathbb{R}}^{m\times 1}} 。我们称上述这个计算过程为前向传播(forward propagation)。MLP是一种前馈神经网络(feedforward neural network),结构中不含任何的循环结构。

3 反向传播算法

对于单个训练数据 \[\left( x,y \right)\] 而言,我们的目标是通过调整神经网络的W和b,让神经网络的输出 {{h}_{W,b}}\left( x \right) 接近y,可用如下代价函数(cost function)来衡量 {{h}_{W,b}}\left( x \right) 与y之间的差异:

J\left( W,b;x,y \right)=\frac{1}{2}{{\left\| y-{{h}_{W,b}}\left( x \right) \right\|}^{2}}

设训练数据集为 \left\{ \left( {{x}^{\left( 1 \right)}},{{y}^{\left( 1 \right)}} \right),\ldots ,\left( {{x}^{\left( m \right)}},{{y}^{\left( m \right)}} \right) \right\} ,则可以计算对所有训练数据总代价函数为:

\[J\left( W,b \right)=\left[ \frac{1}{m}\sum\limits_{i=1}^{m}{J\left( W,b;{{x}^{\left( i \right)}},{{y}^{\left( i \right)}} \right)} \right]+\frac{\lambda }{2}\sum\limits_{l=1}^{{{n}_{l}}-1}{\sum\limits_{i=1}^{{{s}_{l+1}}}{\sum\limits_{j=1}^{{{s}_{l}}}{{{\left( W_{ij}^{\left( l \right)} \right)}^{2}}}}}\]

其中, {{n}_{l}} 为神经网络总层数, {{s}_{l}} 为第 l 层的神经元总数, \lambda 用于调节前后两项重要性的相对大小。上式中,第二项是一个正则化项,也称为权重衰减(weight decay)项,它用于降低权重的大小,这有助于防止神经网络过度拟合。

J\left( W,b \right) 关于W和b的偏导数为:

\frac{\partial }{\partial W_{ij}^{\left( l \right)}}J\left( W,b \right)=\left[ \frac{1}{m}\sum\limits_{i=1}^{m}{\frac{\partial }{\partial W_{ij}^{\left( l \right)}}J\left( W,b;{{x}^{\left( i \right)}},{{y}^{\left( i \right)}} \right)} \right]+\lambda W_{ij}^{\left( l \right)}

\frac{\partial }{\partial b_{i}^{\left( l \right)}}J\left( W,b \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{\frac{\partial }{\partial b_{i}^{\left( l \right)}}J\left( W,b;{{x}^{\left( i \right)}},{{y}^{\left( i \right)}} \right)}

获得偏导数后,我们通过批量梯度下降(batch gradient descent)来更新W和b以减小 J\left( W,b \right)

W_{ij}^{\left( l \right)}=W_{ij}^{\left( l \right)}-\alpha \frac{\partial }{\partial W_{ij}^{\left( l \right)}}J\left( W,b \right)

b_{i}^{\left( l \right)}=b_{i}^{\left( l \right)}-\alpha \frac{\partial }{\partial b_{i}^{\left( l \right)}}J\left( W,b \right)

其中, \alpha 学习率(learning rate)。在进行上述梯度下降之前,我们根据一个正态分布 \left( 0,{{\varepsilon }^{2}} \right) (一般取 \varepsilon =0.01 )来生成接近0的、不同的小随机数赋给W和b作为初值。

接下来使用反向传播(backpropagation)算法来计算单个训练数据对应的偏导数 \frac{\partial }{\partial W_{ij}^{\left( l \right)}}J\left( W,b;x,y \right)\frac{\partial }{\partial b_{i}^{\left( l \right)}}J\left( W,b;x,y \right) 。下面以前述神经网络为例介绍反向传播算法。

定义“ \bullet ”为矩阵元素相乘运算(相当于MATLAB中的“.*”运算)。则该神经网络的反向传播算法可描述如下:

  • 1)进行前向传播计算(如第二节所示)
  • 2)计算输出层误差项(error term)

\begin{align} & {{\delta }^{\left( 3 \right)}}=-\left( y-{{a}^{\left( 3 \right)}} \right)\bullet {f}'\left( {{z}^{\left( 3 \right)}} \right) \\ & =-\left( {{y}_{1}}-a_{1}^{\left( 3 \right)} \right)\cdot {f}'\left( z_{1}^{\left( 3 \right)} \right) \\ & =\frac{\partial }{\partial z_{1}^{\left( 3 \right)}}\frac{1}{2}{{\left\| y-{{h}_{W,b}}\left( x \right) \right\|}^{2}} \end{align}

  • 3)计算隐藏层的误差项:

\begin{align} & {{\delta }^{\left( 2 \right)}}=\left( {{\left( {{W}^{\left( 2 \right)}} \right)}^{T}}{{\delta }^{\left( 3 \right)}} \right)\bullet {f}'\left( {{z}^{\left( 2 \right)}} \right) \\ & =\left( \left[ \begin{matrix} W_{11}^{\left( 2 \right)} \\ W_{12}^{\left( 2 \right)} \\ W_{13}^{\left( 2 \right)} \\ \end{matrix} \right]\left[ \delta _{1}^{\left( 3 \right)} \right] \right)\bullet {f}'\left( \left[ \begin{matrix} z_{1}^{\left( 2 \right)} \\ z_{2}^{\left( 2 \right)} \\ z_{3}^{\left( 2 \right)} \\ \end{matrix} \right] \right) \\ & =\left[ \begin{matrix} W_{11}^{\left( 2 \right)}\delta _{1}^{\left( 3 \right)}{f}'\left( z_{1}^{\left( 2 \right)} \right) \\ W_{12}^{\left( 2 \right)}\delta _{1}^{\left( 3 \right)}{f}'\left( z_{2}^{\left( 2 \right)} \right) \\ W_{13}^{\left( 2 \right)}\delta _{1}^{\left( 3 \right)}{f}'\left( z_{3}^{\left( 2 \right)} \right) \\ \end{matrix} \right] \end{align}

这相当于把误差 {{\delta }^{\left( 3 \right)}} 通过 {{W}^{\left( 2 \right)}} 加权平均分配成与产生 {{\delta }^{\left( 3 \right)}} 的这一层输入维度相同的误差项 {{\delta }^{\left( 2 \right)}} ,由此可得隐藏层的每个神经元应该为产生隐藏层输出误差 {{\delta }^{\left( 2 \right)}} 负多大的责任。

  • 4)利用 \delta 计算各 \frac{\partial }{\partial W_{ij}^{\left( l \right)}}J\left( W,b;x,y \right)\frac{\partial }{\partial b_{i}^{\left( l \right)}}J\left( W,b;x,y \right) (为书写简洁,以下用 J 表示 J\left( W,b;x,y \right) ):

\begin{align} & {{\nabla }_{{{W}^{\left( 2 \right)}}}}J={{\delta }^{\left( 3 \right)}}{{\left( {{a}^{\left( 2 \right)}} \right)}^{T}} \\ & =\left[ \delta _{1}^{\left( 3 \right)} \right]\left[ \begin{matrix} a_{1}^{\left( 2 \right)} & a_{2}^{\left( 2 \right)} & a_{3}^{\left( 2 \right)} \\ \end{matrix} \right] \\ & =\left[ \begin{matrix} \frac{\partial }{\partial W_{11}^{\left( 2 \right)}}J & \frac{\partial }{\partial W_{12}^{\left( 2 \right)}}J & \frac{\partial }{\partial W_{13}^{\left( 2 \right)}}J \\ \end{matrix} \right] \end{align}

{{\nabla }_{{{b}^{\left( 2 \right)}}}}J={{\delta }^{\left( 3 \right)}}=\frac{\partial }{\partial b_{1}^{\left( 2 \right)}}J

\begin{align} & {{\nabla }_{{{W}^{\left( 1 \right)}}}}J={{\delta }^{\left( 2 \right)}}{{\left( {{a}^{\left( 1 \right)}} \right)}^{T}} \\ & =\left[ \begin{matrix} \delta _{1}^{\left( 2 \right)} \\ \delta _{2}^{\left( 2 \right)} \\ \delta _{3}^{\left( 2 \right)} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} & {{x}_{2}} & {{x}_{3}} \\ \end{matrix} \right] \\ & =\left[ \begin{matrix} \frac{\partial }{\partial W_{11}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{12}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{13}^{\left( 1 \right)}}J \\ \frac{\partial }{\partial W_{21}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{22}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{23}^{\left( 1 \right)}}J \\ \frac{\partial }{\partial W_{31}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{32}^{\left( 1 \right)}}J & \frac{\partial }{\partial W_{33}^{\left( 1 \right)}}J \\ \end{matrix} \right] \end{align}

\begin{align} & {{\nabla }_{{{b}^{\left( 1 \right)}}}}J={{\delta }^{\left( 2 \right)}} \\ & =\left[ \begin{matrix} \frac{\partial }{\partial b_{1}^{\left( 1 \right)}}J \\ \frac{\partial }{\partial b_{2}^{\left( 1 \right)}}J \\ \frac{\partial }{\partial b_{3}^{\left( 1 \right)}}J \\ \end{matrix} \right] \end{align}

我们可以从中总结出一般性结论。设第 l+2 层有k个神经元,第 l+1 层有m个神经元、第 l 层有n个神经元、输入 x={{a}^{\left( 1 \right)}} ,则第 l+1 层( l+1 最小值为2)的误差项为:

{{\delta }^{\left( l+1 \right)}}=\left( {{\left( {{W}^{\left( l+1 \right)}} \right)}^{T}}{{\delta }^{\left( l+2 \right)}} \right)\bullet {f}'\left( {{z}^{\left( l+1 \right)}} \right)

其中, {{W}^{\left( l+1 \right)}}\in {{\mathbb{R}}^{k\times m}}{{\delta }^{\left( l+2 \right)}}\in {{\mathbb{R}}^{k\times 1}}\[{{z}^{\left( l+1 \right)}}\in {{\mathbb{R}}^{m\times 1}}\]{{\delta }^{\left( l+1 \right)}}\in {{\mathbb{R}}^{m\times 1}} 。特别地,当第 l+1 层为输出层时,不使用上述公式计算误差项。记 l+1{{n}_{l}} ,则输出层的误差项为:

\[{{\delta }^{\left( {{n}_{l}} \right)}}=-\left( y-{{a}^{\left( {{n}_{l}} \right)}} \right)\bullet {f}'\left( {{z}^{\left( {{n}_{l}} \right)}} \right)\]

各偏导数为:

{{\nabla }_{{{W}^{\left( l \right)}}}}J\left( W,b;x,y \right)={{\delta }^{\left( l+1 \right)}}{{\left( {{a}^{\left( l \right)}} \right)}^{T}}

{{\nabla }_{{{b}^{\left( l \right)}}}}J\left( W,b;x,y \right)={{\delta }^{\left( l+1 \right)}}

其中, {{a}^{\left( l \right)}}\in {{\mathbb{R}}^{n\times 1}}{{\nabla }_{{{W}^{\left( l \right)}}}}J\left( W,b;x,y \right)\in {{\mathbb{R}}^{m\times n}}{{\nabla }_{{{b}^{\left( l \right)}}}}J\left( W,b;x,y \right)\in {{\mathbb{R}}^{m\times 1}}

4 结束语

欢迎各位在评论区批评指正,大家共同进步,谢谢!

5 参考资料

本文内容整理自以下教程:

[1] Unsupervised Feature Learning and Deep Learning Tutorial (stanford.edu)

封面亦来自该教程。

强化学习所需的深度学习知识 | 神经网络之MLP (2024)

References

Top Articles
Latest Posts
Article information

Author: Dr. Pierre Goyette

Last Updated:

Views: 6236

Rating: 5 / 5 (70 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Dr. Pierre Goyette

Birthday: 1998-01-29

Address: Apt. 611 3357 Yong Plain, West Audra, IL 70053

Phone: +5819954278378

Job: Construction Director

Hobby: Embroidery, Creative writing, Shopping, Driving, Stand-up comedy, Coffee roasting, Scrapbooking

Introduction: My name is Dr. Pierre Goyette, I am a enchanting, powerful, jolly, rich, graceful, colorful, zany person who loves writing and wants to share my knowledge and understanding with you.