《Deep Learning with Python》第六章 6.1 深度学习之文本处理

6.1 深度学习之文本处理

文本是序列数据传播最广泛的形式之一，它可以理解成一个字母序列或者词序列，但是最常见的形式是词序列。后面章节介绍的深度学习序列处理模型有文档分类、情感分析、作者识别和限制语境问答（QA）。当然了，要记住的是：这些深度学习模型并不是真正意义上以人的思维去理解文字，而只是书面语的统计结构映射而已。基于深度学习的自然语言处理可以看作对字词、句子和段落的模式识别，这有点像计算机视觉中对像素的模式识别。

跟其它所有神经网络一样，深度学习模型并不是以原始文本为输入，而是数值型张量。向量化文本是将文本转换成数值张量的过程。有以下几种方式可以做向量化文本：

将文本分割为词，转换每个词为向量；
将文本分割为字（字母），转换每个字为向量；
抽取词或者字的n-gram，转换每个n-gram转换为向量。n-gram是多个连续词或者字的元组。

将文本分割为字、词或者n-gram的过程称为分词（tokenization），拆分出来的字、词或者n-gram称为token。所有文本向量化的过程都包含分词和token转换为数值型向量。这些向量封装成序列张量“喂入”神经网络模型。有多种方式可以将token转换为数值向量，但是本小节介绍两种方法：one-hot编码和词嵌入。

图6.1 文本向量化过程

n-gram和词袋的理解

n-gram是指从句子中抽取的N个连续词的组合。对于字也有相同的概念。

下面是一个简单的例子。句子“the cat sat on the mat”拆分成2-gram的集合如下：

{“The”, “The cat”, “cat”, “cat sat”, “sat”, “sat on”, “on”, “on the”, “the”, “the mat”, “mat”}

拆分成3-gram的集合如下：

{“The”, “The cat”, “cat”, “cat sat”, “The cat sat”, “sat”, “sat on”, “on”, “cat sat on”, “on the”, “the”, “sat on the”, “the mat”, “mat”, “on the mat”}

上面这些集合相应地称为2-gram的词袋，3-gram的词袋。术语词袋（bag）是指token的集合，而不是一个列表或者序列：token是无序的。所有分词方法的结果统称为词袋。

词袋是一个无序的分词方法，其丢失了文本序列的结构信息。词袋模型用于浅语言处理模型中，而不是深度学习模型。抽取n-gram是一种特征工程，但是深度学习是用一种简单粗暴的方法做特征工程，去代替复杂的特征工程。本章后面会讲述一维卷积和RNN，它们能从字、词的组合中学习表征。所以本书不再进一步展开介绍n-gram。但是记住，在轻量级模型或者浅文本处理模型（逻辑回归和随机森林）中，n-gram是一个强有力、不可替代的特征工程工具。

6.1.1 字词的one-hot编码

one-hot编码是最常见、最基本的文本向量化方法。在前面第三章的IMDB和Reuter例子中有使用过。one-hot编码中每个词有唯一的数值索引，然后将对应的索引转成大小为N的二值向量（N为字典的大小）：词所对应的索引位置的值为1，其它索引对应的值为0。

当然，字级别也可以做one-hot编码。为了予以区分，列表6.1和6.2分别展示词和字的one-hot编码。

#Listing 6.1 Word-level one-hot encoding
import numpy as np
'''
Initial data: one entry per sample (in this example, 
a sample is a sentence, 
but it could be an entire document)
'''
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
'''
 Builds an index of all tokens in the data
 '''
token_index = {}
for sample in samples:
    '''
    Tokenizes the samples via the split method. 
    In real life, you’d also strip punctuation 
    and special characters from the samples.
    '''
    for word in sample.split():
        if word not in token_index:
            '''
            Assigns a unique index to each unique word. 
            Note that you don’t attribute index 0 to anything.
            '''
            token_index[word] = len(token_index) + 1
'''
Vectorizes the samples. You’ll only consider 
the first max_length words in each sample.
'''
max_length = 10
'''
This is where you store the results.
'''
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

#Listing 6.2 Character-level one-hot encoding
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
'''
All printable ASCII characters
'''
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1)) 
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.

Keras有内建工具处理文本的one-hot编码。建议你使用这些工具，因为它们有不少功能，比如，删除指定字符，考虑数据集中最常用的N个字（严格来讲，是避免向量空间过大）。

#Listing 6.3 Using Keras for word-level one-hot encoding
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
'''
Creates a tokenizer, configured to only take into account the 1,000 most common words
'''
tokenizer = Tokenizer(num_words=1000)
'''
Builds the word index
'''
tokenizer.fit_on_texts(samples)
'''
Turns strings into lists of integer indices
'''
sequences = tokenizer.texts_to_sequences(samples)  
     
'''
You could also directly get the one-hot binary representations. Vectorization modes other than one-hot encoding are supported by this tokenizer.
'''
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
'''
How you can recover the word index that was computed
'''
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

one-hot 哈希（hash）编码是one-hot编码的一个变种，它主要用在字典太大难以处理的情况。one-hot 哈希编码是将词通过轻量级的哈希算法打散成固定长度的向量，而不是像one-hot编码将每个词分配给一个索引。one-hot 哈希编码最大的优势是节省内存和数据的在线编码。同时这种方法的一个缺点是碰到哈希碰撞冲突（hash collision），也就是两个不同词的哈希值相同，导致机器学习模型不能分辨这些词。哈希碰撞冲突的可能性会随着哈希空间的维度越大而减小。

#Listing 6.4 Word-level one-hot encoding with hashing trick
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
'''
Stores the words as vectors of size 1,000. If you have close to 1,000 words (or more), you’ll see many hash collisions, which will decrease the accuracy of this encoding method.
'''
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]: 
        '''
        Hashes the word into a random integer index 
        between 0 and 1,000
        '''
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

6.1.2 词嵌入

另外一种常用的、高效的文本向量化方法是稠密词向量，也称为词嵌入。one-hot编码得到的向量是二值的、稀疏的（大部分值为0）、高维度的（与字典的大小相同），而词嵌入是低维度的浮点型向量（意即，稠密向量），见图6.2。前面的向量是通过one-hot编码得到的，而词嵌入是由数据学习得到，最常见的词嵌入是256维、512维或者1024维。one-hot编码会导致向量的维度甚至超过20,000维（此处以20,000个词的字典举例）。所以词嵌入能够用更少的维度表示更多的信息。

图6.2 one-hot编码和词嵌入得到的向量对比

有两种获得词嵌入的方式：

在解决文档分类或者情感预测的任务中学习词嵌入。一般以随机词向量维开始，然后在训练神经网络模型权重的过程中学习到词向量。
加载预训练的词向量。预训练的词向量一般是从不同于当前要解决的机器学习任务中学习得到的。

下面学习前面的两种方法。

学习词嵌入：Embedding layer

词与稠密向量相关联的最简单方法是随机向量化。但是，这种方法使得嵌入空间变得毫无结构：比如，单词accurate和exact在大部分句子里是可互换的，但得到的嵌入可能完全不同。深度神经网络很难识别出这种噪音和非结构嵌入空间。

更抽象一点的讲，词与词之间的语义相似性在词向量空间中应该以几何关系表现出来。词嵌入可以理解成是人类语言到几何空间的映射过程。例如，你会期望同义词被嵌入为相似的词向量；更一般地说，你期望任意两个词向量的几何距离（比如，L2距离）和相关词的语义距离是有相关性。除了距离之外，词向量在嵌入空间的方向也应该是有意义的。下面举个具体的例子来说明这两点。

图6.3 词嵌入空间的实例

在图6.3中，cat、dog、wolf和tiger四个词被嵌入到二维平面空间。在这里选择的词向量表示时，这些词的语义关系能用几何变换来编码表示。比如，从cat到tiger和从dog到wolf有着相同的向量，该向量可以用“从宠物到野生动物”来解释。同样，从dog到cat和从wolf到tiger有相同的向量，该向量表示“从犬科到猫科动物”。

在实际的词嵌入空间中，常见的几何变换例子是“gender”词向量和“plural”词向量。比如，将“female”词向量加到“king”词向量上，可以得到“queen”词向量；将“plural”词向量加到“king”词向量上，可以得到“kings”词向量。

那接下来就要问了，有完美的词向量空间能匹配人类语言吗？能用来解决任意种类的自然语言处理任务吗？答案是可能有，但是现阶段暂时没有。也没有一种词向量可以向人类语言一样有很多种语言，并且是不同形的，因为它们都是在特定文化和特定环境下形成的。但是，怎么才能得到一个优秀的词嵌入空间呢？从程序实现上讲是因任务而异：英文影评情感分析模型对应完美词嵌入空间与英文文档分类模型对应的完美词嵌入空间可能不同，因为不同任务的语义关系重要性是变化的。

因此，对每个新任务来说，最好重新学习的词嵌入空间。幸运的是，反向传播算法和Keras使得学习词嵌入变得容易。下面学习Keras的Embedding layer权重。

#Listing 6.5 Instantiating an Embedding layer
from keras.layers import Embedding
'''
The Embedding layer takes at least two arguments: the number of possible tokens (here, 1,000: 1 + maximum word index) and the dimensionality of the embeddings (here, 64).
'''
embedding_layer = Embedding(1000, 64)

Embedding layer把词的整数索引映射为稠密向量。它输入整数，在中间字典中查找这些整数对应的向量。Embedding layer是一个高效的字典查表（见图6.4）。

图6.4 Embedding layer

Embedding layer的输入是一个形状为（样本，序列长度）[^（sample，sequence_length）]的 2D 整数型张量，该张量的每项都是一个整数序列。Embedding layer能嵌入变长序列：比如，可以“喂入”形状为（32，10）（长度为10的序列数据，32个为一个batch）或者（64，15）（长度为15的序列数据64个为一个batch）。同一个batch中的所有序列数据必须有相同的长度，因为它们会被打包成一个张量。所以比其它序列数据短的序列将用“0”填充，另外，太长的序列会被截断。

Embedding layer返回一个形状为（样本，序列长度，词向量大小）[^（samples，sequence_ length，embedding_dimensionality）]的3D浮点型张量，该张量可以被RNN layer或者1D 卷积layer处理。

当你实例化一个Embedding layer时，它的权重（词向量的中间字典）是随机初始化，和其它layer一样。随着模型的训练，这些词向量通过反向传播算法逐渐调整，传入下游模型使用。一旦模型训练完，嵌入空间会显现出许多结构，不同的模型会训练出不同的特定结构。

下面用熟悉的IMDB影评情感预测任务来说明上面的想法。首先，准备数据集。限制选取词频为top 10,000的常用词，只考虑影评前20个词。神经网络模型将学习8维的词嵌入，把输入的整数序列（2D整数张量）转化为嵌入序列（3D浮点张量）

#Listing 6.6 Loading the IMDB data for use with an Embedding layer
from keras.datasets import imdb
from keras import preprocessing
'''
Number of words to consider as features
'''
max_features = 10000
'''
Cuts off the text after this number of words (among the max_features most common words)
'''
maxlen = 20
'''
Loads the data as lists of integers
'''
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=max_features) 
'''
Turns the lists of integers into a 2D integer tensor of shape (samples, maxlen)
'''
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

#Listing 6.7 Using an Embedding layer and classifier on the IMDB data
from keras.models import Sequential
from keras.layers import Flatten, Dense
model = Sequential()
'''
Specifies the maximum input length to the Embedding layer so you can later flatten the embedded inputs. After the Embedding layer, the activations have shape (samples, maxlen, 8).
'''
model.add(Embedding(10000, 8, input_length=maxlen))
'''
Flattens the 3D tensor of embeddings into a 2D tensor of shape (samples, maxlen * 8)
'''
model.add(Flatten())
'''
Adds the classifier on top
'''
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train, 
                    epochs=10, 
                    batch_size=32, 
                    validation_split=0.2)

上面的代码得到了约76%的验证准确度，这对于只考虑每个影评的前20个词来说效果已经不错了。注意，仅仅摊平嵌入序列，用单个Dense layer训练模型，会将输入序列的每个词隔离开，并没有考虑词之间的关系和句子结构（例如，该模型可能认为“this movie is a bomb”和“this movie is the bomb” 两句话都是负面影评）。所以在嵌入序列之上加入RNN layer或者1D卷积layer会将句子当做整体来学习特征，后续小节会详细讲解这些。

预训练的词嵌入

有时，你只有很少的训练数据集来学习词嵌入，那怎么办呢？

你可以加载预计算好的词嵌入向量，而不用学习当前待解决任务的词嵌入。这些预计算好的词嵌入是高结构化的，具有有用的特性，其学习到了语言结构的泛化特征。在自然语言处理中使用预训练的词嵌入的基本理论，与图像分类中使用预训练的卷积网络相同：当没有足够的合适数据集来学习当前任务的特征时，你会期望从通用的视觉特征或者语义特征中学到泛化特征。

一些词嵌入是用词共现矩阵统计计算，用各种技术，有些涉及神经网络，有些没有。用非监督的方法计算词的稠密的、低维度的嵌入空间是由Bengio在2000年提出的，但是直到2013年Google的Tomas Mikolov开发出著名的Word2vec算法才开始在学术研究和工业应用上广泛推广。Word2vec可以获取语义信息。

Keras的Embedding layer有各种预训练词嵌入数据可以下载使用，Word2vec是其中之一。另外一个比较流行的词表示是GloVe（Global Vector），它是由斯坦福研究组在2014开发。GloVe是基于词共现矩阵分解的一种词嵌入技术，它的开发者预训练好了成千上万的词嵌入。

下面开始学习如何在Keras模型中使用GloVe词嵌入。其实它的使用方法与Word2vec词嵌入或者其它词嵌入数据相同。

6.1.3 从原始文本到词嵌入

这里的模型网络和上面的类似，只是换作预训练词嵌入。同时，直接从网上下载原始文本数据，而不是使用Keras分词好的IMDB数据。

下载IMDB原始文本

首先，前往http://mng.bz/0tIo下载原IMDB数据集，并解压。

接着，将单个训练影评装载为字符串列表，同时影评label装载为label的列表。

#Listing 6.8 Processing the labels of the raw IMDB data
import os
imdb_dir = '/Users/fchollet/Downloads/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

分词

开始向量化文本，准备训练集和验证集。因为预训练的词嵌入是对训练集较少时更好，这里加入步骤：取前200个样本数据集。所以你相当于只看了200条影评就开始做影评情感分类。

#Listing 6.9 Tokenizing the text of the raw IMDB data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
'''
Cuts off reviews after 100 words
'''
maxlen = 100  
'''
Trains on 200 samples
'''
training_samples = 200 
'''
Validates on 10,000 samples
'''
validation_samples = 10000 
'''
Considers only the top 10,000 words in the dataset
'''
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
'''
Splits the data into a training set and a validation set, but first shuffles the data, because you’re starting with data in which samples are ordered (all negative first, then all positive)
'''
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples] 
y_val = labels[training_samples: training_samples + validation_samples]

下载GloVe词嵌入

前往https://nlp.stanford.edu/projects/glove下载预训练的2014年英文维基百科的GloVe词嵌入。它是一个822 MB的glove.6B.zip文件，包含400,000个词的100维嵌入向量。

预处理GloVe嵌入

下面解析解压的文件（a.txt）来构建索引，能将词映射为向量表示。

#Listing 6.10 Parsing the GloVe word-embeddings file
glove_dir = '/Users/fchollet/Downloads/glove.6B'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

接着，构建能载入Embedding layer的嵌入矩阵。它的矩阵形状为（max_words, embedding_dim），其每项i是在参考词索引中为i的词对应的embedding_dim维向量。注意，索引0不代表任何词，只是个占位符。

#Listing 6.11 Preparing the GloVe word-embeddings matrix
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            '''
            Words not found in the embedding index will be all zeros.
            '''
            embedding_matrix[i] = embedding_vector

定义模型

使用前面相同的模型结构。

#Listing 6.12 Model definition
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

加载GloVe词嵌入

Embedding layer有一个权重矩阵：2D浮点型矩阵，每项i表示索引为i的词对应的词向量。在神经网络模型中加载GloVe词嵌入到Embedding layer

#Listing 6.13 Loading pretrained word embeddings into the Embedding layer 
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

此外，设置trainable为False，冻结Embedding layer。当一个模型的部分网络是预训练的（像Embedding layer）或者随机初始化（像分类），那该部分网络在模型训练过程中不能更新，避免模型忘记已有的特征。随机初始化layer会触发大的梯度更新，导致已经学习的特征丢失。

训练和评估模型

编译和训练模型。

#Listing 6.14 Training and evaluation
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

现在绘制模型随时间的表现，见图6.5和6.6。

#Listing 6.15 Plotting the results
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

图6.5 使用预训练词嵌入时的训练损失和验证损失曲线

图6.6 使用预训练词嵌入时的训练准确度和验证准确度曲线

模型训练在开始不久即出现过拟合，这在训练集较少的情况下很常见。验证准确度有高的variance，不过也到50%了。

可能你的结果不同：因为训练集太少，导致模型效果严重依赖被选择的200个样本（这里选择是随机的）。

你也可以在不加载预训练词嵌入和不冻结embedding layer的情况下训练相同的网络模型。训练集也使用前面相同的200个样本，见图6.7和6.8。

#Listing 6.16 Training the same model without pretrained word embeddings
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

图6.7 未使用预训练词嵌入时的训练损失和验证损失曲线

图6.8 未使用预训练词嵌入时的训练准确度和验证准确度曲线

这次的结果显示验证准确度不到50%。所以样本量较少的情况下，预训练词嵌入效果更优。

最后，在测试数据集上评估模型。首先，对测试数据进行分词。

#Listing 6.17 Tokenizing the data of the test set
test_dir = os.path.join(imdb_dir, 'test')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
                    if label_type == 'neg':
                       labels.append(0)
                   else:
                       labels.append(1)
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

接着，加载并评估第一个模型。

1
2
3

#Listing 6.18 Evaluating the model on the test set
model.load_weights('pre_trained_glove_model.h5') model.evaluate(x_test, y_test)

返回测试准确度56%的结果。

6.1.4 小结

你学到的知识有：

文本分词
使用Keras的Embedding layer学习特定的词嵌入
使用预训练的词嵌入提升自然语言处理问题

未完待续。。。

Enjoy!

翻译本书系列的初衷是，觉得其中把深度学习讲解的通俗易懂。不光有实例，也包含作者多年实践对深度学习概念、原理的深度理解。最后说不重要的一点，François Chollet是Keras作者。
声明本资料仅供个人学习交流、研究，禁止用于其他目的。如果喜欢，请购买英文原版。

侠天，专注于大数据、机器学习和数学相关的内容，并有个人公众号：bigdata_ny分享相关技术文章。

若发现以上文章有任何不妥，请联系我。