《Deep Learning with Python》第六章 6.2 理解循环神经网络（RNN）

沉下心来，踏实干，会成功的。

6.2 理解循环神经网络（RNN）

前面所有见过的神经网络模型，比如，全联结网络和卷积网络，它们最主要的特征是没有记忆。每个输入被单独处理，也没有保留输入之间的状态。在这种神经网络中，要想处理序列数据或者时序数据，那就需要一次输入整个序列到神经网络模型：把整个序列当作单个数据点。例如，在IMDB的例子中，将一个完整的影评转换成一个向量，并一次性处理。我们把这类神经网络称为前向传播神经网络（feedforward network）。

相比之下，你正在读的句子，是一个词一个词的理解，并记住前一个处理的词；这给了一个句子意思很好的表示。当生物智能处理逐渐增长的信息时，它会保存正在处理信息的中间状态，建立上一个信息到当前信息的更新。

循环神经网络也采用相同的方式，尽管只是一个极其简单的版本。它通过迭代序列数据的每个元素，并保持所见过的相应信息的状态。RNN是一种内循环的神经网络，见图6.9。RNN的状态只存在于一个序列数据中，RNN处理两个不同的、不相关的序列数据时会重置状态。所以你仍然可以把一个序列数据看作单个数据点，并作为神经网络模型的一个输入。不同的是，这个数据点不再是一步处理完，而是对序列元素进行内部迭代。

图6.9 循环神经网络（RNN）

下面用Numpy实现一个简单的前向传播的RNN，更好的说明循环（loop）和状态（state）这些术语。该RNN输入一个形状为（时间步长，特征数）[^(timesteps, input_features)]的向量序列，随着时间步长迭代。在t个步长时，它利用当前的状态和输入（形状为（input_features, ））生成输出output。接着把下一步的状态设为前一步的输出。对于第一个时间步长来说，前一步的输出没有定义，即是没有当前状态。所以初始化第一步的状态为零向量，也称为RNN的初始状态（initial state）。

以下是RNN的伪代码：

#Listing 6.19 Pseudocode RNN
'''The state at t
'''
state_t = 0
'''Iterates over sequence elements
'''
for input_t in input_sequence:
output_t = f(input_t, state_t)
'''The previous output becomes the state for the next iteration.
'''
state_t = output_t

你应该能直接写出上面的函数f：用两个矩阵W和U，以及一个偏置向量把输入和状态转换成输出。这类似于前向网络中全联结layer的转换操作。

#Listing 6.20 More detailed pseudocode for the RNN
state_t = 0
for input_t in input_sequence:
output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
state_t = output_t

为了彻底搞清楚上面的术语，这里用原生Numpy写个前向传播的简单RNN。

#Listing 6.21 Numpy implementation of a simple RNN
import numpy as np
'''Number of timesteps in
the input sequence
'''
timesteps = 100
'''Dimensionality of the
input feature space
'''
input_features = 32
'''Dimensionality of the
output feature space
'''
output_features = 64
'''Input data: random
noise for the sake of
the example
'''
inputs = np.random.random((timesteps, input_features))
'''Initial state: an
all-zero vector
'''
state_t = np.zeros((output_features,))
'''Creates random weight matrices
'''
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))
successive_outputs = []
'''input_t is a vector of shape (input_features,).
'''
for input_t in inputs:
    '''Combines the input with the current 
    state (the previous output) to obtain 
    the current output
    '''
    output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
    
    '''Stores this output in a list
    '''
    successive_outputs.append(output_t)
    
    '''Updates the state of the 
    network for the next timestep
    '''
    state_t = output_t
    
'''The final output is a 2D tensor of
shape (timesteps, output_features).
'''
final_output_sequence = np.concatenate(successive_outputs, axis=0)

看起来很容易，RNN只是一个for循环，重复利用上一个循环的计算结果，仅此而已。当然，你也可以构建许多不同类型的RNN。RNN的特征是阶跃函数（step function），比如下面的函数，见图6.10:：

1	output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)

图6.10 一个简单的、随时间展开的RNN

注意：上面例子是在时间步长 t 的最终输出：一个形状为（时间步长，特征数）[^(timesteps, input_features)]的2D张量。在处理一个输入序列时，时间步长 t 的输出张量包含从时间步长0到t的信息。因此，在许多情况下，你并不需要所有输出的序列，只要循环（loop）的最后一个输出（output_t）即可。，因为它已包含整个序列的信息。

6.2.1 Keras中的RNN layer

前面用Numpy实现的RNN实际上是Keras的SimpleRNN layer：

1	from keras.layers import SimpleRNN

但是它俩有个小小的区别：与所有其它Keras layer一样，SimpleRNN处理的是批量序列，而不是单个序列。这意味着，SimpleRNN layer的输入形状为（批大小，时间步长，特征）[^(batch_size, timesteps, input_features)]，而不是（时间步长，特征）[^(timesteps, input_features)]。

像Keras中所有RNN layer一样，SimpleRNN有两种模式：一，返回时间步长的所有输出的序列，形状为（批大小，时间步长，输出）[^(batch_size, timesteps, out_features)]；二，返回每个输入的最后一个输出，（时间步长，输出）[^(timesteps, out_features)]。这两种模式可以用参数return_sequences来控制。下面来看一个简单的SimpleRNN例子，其只返回最后一个时间步长的输出。

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
model.summary()

________________________________________________________________
Layer (type)                  Output Shape        Param #
================================================================
embedding_22 (Embedding)      (None, None, 32)    320000
________________________________________________________________
simplernn_10 (SimpleRNN)      (None, 32)          2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

下面是返回所有状态序列的例子。

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.summary()

________________________________________________________________
Layer (type)                Output Shape        Param #
================================================================
embedding_23 (Embedding)    (None, None, 32)    320000
________________________________________________________________
simplernn_11 (SimpleRNN)    (None, None, 32)    2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

有时堆叠多层RNN layer来增加神经网络模型的表征能力。在这种情况下，必须返回中间层layer输出的所有序列：

>>> model = Sequential()
>>> model.add(Embedding(10000, 32))
>>> model.add(SimpleRNN(32, return_sequences=True))
>>> model.add(SimpleRNN(32, return_sequences=True))
>>> model.add(SimpleRNN(32, return_sequences=True))
#Last layer only returns the last output
>>> model.add(SimpleRNN(32))
>>> model.summary()

________________________________________________________________
Layer (type)             Output Shape      Param #
================================================================
embedding_24 (Embedding)  (None, None, 32) 320000
________________________________________________________________
simplernn_12 (SimpleRNN)  (None, None, 32) 2080
________________________________________________________________
simplernn_13 (SimpleRNN)  (None, None, 32) 2080
________________________________________________________________
simplernn_14 (SimpleRNN)  (None, None, 32) 2080
________________________________________________________________
simplernn_15 (SimpleRNN)  (None, 32)       2080
================================================================
Total params: 328,320
Trainable params: 328,320
Non-trainable params: 0

让我们将上述模型应用于IMDB影评分类问题。首先，先处理数据。

#Listing 6.22 Preparing the IMDB data
from keras.datasets import imdb
from keras.preprocessing import sequence
'''Number of words to
consider as features
'''
max_features = 10000
'''Cuts off texts after this many words (among
the max_features most common words)
'''
maxlen = 500
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(
    num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

接着用Embedding layer和SimpleRNN layer训练简单的循环神经网络。

#Listing 6.23 Training the model with Embedding and SimpleRNN layers
from keras.layers import Dense
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

下面显示训练和验证的损失和准确度，见图6.11和6.12。

#Listing 6.24 Plotting results
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

图6.11 在IMDB影评集上使用SimpleRNN训练和验证的损失曲线

图6.12 在IMDB影评集上使用SimpleRNN训练和验证的准确度曲线

在第三章中实现的方法得到的测试准确度为88%。不幸的是，上面简单的RNN的结果竟然没有baseline的好（只有85%的验证准确度）。一部分原因是，输入文本只考虑了前500个词，而不是整个文本序列。因此RNN只获取到少量的信息；另一个问题是，SimpleRNN并不擅长处理长序列，比如文本。

这就要开始介绍高级循环神经网络了。

6.2.2 理解LSTM和GRU layer

SimpleRNN不是Keras中唯一的循环神经网络，其它两个是LSTM和GRU。一般实践中会使用后面两个循环神经网络中的一种。SimpleRNN有一个主要的问题：虽然理论上它在 t 时刻会保持 t 之前所有时刻的输入信息，但是实际是由于依赖太长而学习不到。这是由于梯度爆炸问题导致（vanishing gradient problem），随着层数加深时模型训练失败，具体理论原因由Hochreiter，Schmidhuber和Bengio在1990年代提出，LSTM和GRU layer就是为解决该问题而设计的。

LSTM（Long Short-Term Memory）算法是由Hochreiter和Schmidhuber在1997年开发的，它是SimpleRNN layer的一个变种，增加了跨时间步长的信息记忆。LSTM的本质是为后续时刻保持信息，防止处理过程中老信号的逐渐消失。

为了更好的讲解细节，我们从图6.13的SimpleRNN单元开始。由于权重矩阵较多，这里的output表达式中用字母o作为矩阵W和U的索引（Wo和Uo）。

图6.13 LSTM layer的起点：SimpleRNN

接着在上面的图中增加一条携带跨时间步长的信息流，用Ct表示，这里C表示carry。这个信息流的影响：它将整合输入连接和循环连接，影响输入到下一个时间步长的状态。相应的，carry信息流会调整下一个输出和下一个状态，见图6.14，就这么简单。

图6.14 从SimpleRNN到LSTM：增加一个carry track

计算下一时刻的carry信息流稍有不同，它涉及到三个不同的变换，类似SimpleRNN单元的表达形式：

1	y = activation(dot(state_t, U) + dot(input_t, W) + b)

但是这三个变换都有自己的权重矩阵，分别用字母i，f和k索引。如下：

#Listing 6.25 Pseudocode details of the LSTM architecture (1/2)
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)
i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)

计算新的carry状态c_t是综合i_t，f_t 和k_t。

1
2
3

#Listing 6.26 Pseudocode details of the LSTM architecture (2/2)
c_t+1 = i_t * k_t + c_t * f_t

将上面的过程添加到图6.15上，这就得到了LSTM，不复杂。

图5.15 LSTM的剖析图

LSTM的实际物理意义：c_t和f_t相乘可以认为是carry信息流中遗忘不相关的信息；同时，i_t和k_t提供当前信息，并更新carry track。时至今日，其实这些解释并不太重要，因为这些操作由参数化的权重决定，通过多轮训练学习权重。RNN单元的规格决定模型的假设空间，但这不能决定模型单元做什么，它取决于单元的权重。对于相同的模型单元，不同的权重意味着模型做的事情完全不同。所以组成RNN单元的操作可以解释为一系列的约束，而不是工程意义上的设计。

6.2.3 Keras中LSTM实践

下面使用LSTM layer在IMDB数据集上训练模型，见图6.16和6.17。神经网络结构与前面的SimpleRNN类似，你只需要设置LSTM layer的输出维度，其它参数使用默认值。

#Listing 6.27 Using the LSTM layer in Keras
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

图6.16 在IMDB影评集上使用LSTM训练和验证的损失曲线

图6.17 在IMDB影评集上使用LSTM训练和验证的准确度曲线

从上面的曲线可以看出，LSTM模型达到了89%的验证准确度。不算太差，比SimpleRNN神经网络模型好点（主要是因为LSTM解决了梯度消失的问题），也比第三章的全联结方法要好（即使比第三章用的数据少）。

但是为啥这次结果也没太好？其中的一个原因是，没有进行超参调优，比如词嵌入维度或者LSTM的输出维度。另外一个是，缺乏规则化。但是，说老实话，分析长影评并不能有效的解决情感分析问题。该问题的解决办法是在影评中计算词频，这也是第一个全联结方法所做的。

6.2.4 小结

本小节所学到的知识点：

什么是RNN？以及如何工作？
LSTM是什么？它为什么在处理长序列上比原生RNN效果好？
如何使用Keras的RNN layer处理序列数据

未完待续。。。

Enjoy!

翻译本书系列的初衷是，觉得其中把深度学习讲解的通俗易懂。不光有实例，也包含作者多年实践对深度学习概念、原理的深度理解。最后说不重要的一点，François Chollet是Keras作者。
声明本资料仅供个人学习交流、研究，禁止用于其他目的。如果喜欢，请购买英文原版。

侠天，专注于大数据、机器学习和数学相关的内容，并有个人公众号：bigdata_ny分享相关技术文章。

若发现以上文章有任何不妥，请联系我。