基于Attention机制的Encoder-Decoder框架实现机器翻译

循环神经网络(Recurrent Neural Networks)

为什么不是多层感知机(全连接前馈网络)？

多层感知机不擅长或者说不能处理可变长序列
多层感知机只能处理输入数据和输出数据具有固定长度的序列。即使可将每条数据都填充(pad)至最大长度，用以训练多层感知机，但似乎仍然不是一个很好的模型。
多层感知机不能进行参数共享
如果在每个时间点都具有相同的参数，不但不能泛化到训练时没有见过的序列长度的数据，也不能在时间上共享不同序列长度和不同位置的统计强度。比如在一句话的两个不同位置出现了两次单词-Harry，多层感知机需要对每个Harry都进行学习。

为什么是循环神经网络？

不同时间步内的参数共享
不需要学习每个位置的规则

Seq2Seq模型

Seq2Seq模型用来解决将可变长的输入序列映射到可变长的输出序列。这在许多场景中都有应用，如语音识别、机器翻译或问答等，其中训练集的输入序列和输出序列的长度通常不相同（虽然它们的长度可能相关）。

Encoder-Decoder架构

架构基本原理及组成

用于将可变长度序列映射到另一可变长度序列最简单的RNN架构是Encoder-Decoder架构。其基本思想是：编码器的每个时间步分别输入单词特征向量（采用one-hot编码或Embedding），并输出上下文向量C（通常是最终时间步的隐层状态或其简单函数），通过该向量来表征输入句子的信息。解码器在每个时间步使用上下文向量C，上一时间步输出以及上一时间步的隐层状态为输入进行预测。如下图所示为Encoder-Decoder架构示意图，其中上下文向量通常为固定长度向量。

编码器Encoder

Encoder本身是一个RNN网络，RNN的单元可以是GRU，也可是LSTM。可以为单向,也可以为双向。
对于长度为$T_x$输入序列${x_1, x_2, … x_{T_x}}$, 在时间步$t$, 将单词$x_t$的特征向量和上个时间步的隐藏状态$h_{t-1}$, Encoder的作用是将一个可变长度序列映射为固定大小的上下文向量C，用以表征输入序列的语意概要。
其中:
时间步$t$的隐层状态为: $h_t = f(x_t, h_{t-1})$
上下文向量$C$为: $C = q(h_1, h_2, h_3, … , h_T)$

解码器Decoder

Decoder也是一个RNN网络，其工作原理如下：对于输出序列的时间步$t{‘}$, 解码器以上一时间步的隐层状态$h_{t{‘}-1}$以及上下文向量$C$为输入，输出$y_{t{‘}}$的条件概率，即$P(y_{t{‘}}|y_1, y_2, … , y_{t{‘}-1}, C)$。

训练模型

根据极大似然估计，该问题为最优化问题，问题的目标函数为：
$P(y_1, y_2, … , y_{t{‘}-1}|x_1, x_2, …, x_T)$
$= \prod_{t{‘}=1}^{T{‘}}P(y_t|y_1, y_2, … , y_{t{‘}-1}, x_1, x_2, …, x_T)$
$= \prod_{t{‘}=1}^{T{‘}}P(y_t|y_1, y_2, … , y_{t{‘}-1}, C)$
该输出序列的损失函数为：
$-logP(y_1, y_2, … , y_{t{‘}-1}|x_1, x_2, …, x_T)$
$=\sum_{t{‘}=1}^{T{‘}}P(y_t|y_1, y_2, … , y_{t{‘}-1}, C)$

import tensorflow as tf
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
from tensorflow import keras

# 1. preprocessing data
# 2. build model
# 2.1 encoder
# 2.2 attention
# 2.3 decoder
# 2.4 loss & optimizer
# 2.5 train
# 3. evaluation
# 3.1 given sentence, return translated results
# 3.2 visualize results(attention)

en_spa_file_path = './data_spa_en/spa.txt'

import unicodedata


def unicode_to_ascii(s):
    '''西班牙语中的特殊字符是使用UNICODE表示的，需要转换成ASCII码。转换成ASCII后，vocabe_size=128 or 256'''
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                   if unicodedata.category(c) != 'Mn')


en_sentence = 'Then what?'
sp_sentence = '¿Entonces qué?'
print(unicode_to_ascii(en_sentence))
print(unicode_to_ascii(sp_sentence))

Then what?
¿Entonces que?

import re


def preprocess_sentence(s):
    s = unicode_to_ascii(s.lower().strip())
    # 标点符号前后加空格
    s = re.sub(r"([?.!,¿])", r" \1 ", s)
    # 多余的空格变成一个空格
    s = re.sub(r'[" "]+', " ", s)
    # 除了标点符号和字母外都是空格
    s = re.sub(r'[^a-zA-Z?.!,¿]', " ", s)
    # 去掉前后空格
    s = s.rstrip().strip()
    s = '<start> ' + s + ' <end>'
    return s


print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence))

<start> then what ? <end>
<start> ¿ entonces que ? <end>

def parse_data(filename):
    lines = open(filename, encoding='UTF-8').read().strip().split('\n')
    sentence_pairs = [line.split('\t') for line in lines]
    preprocessed_sentence_pairs = [(preprocess_sentence(en),
                                    preprocess_sentence(sp))
                                   for [en, sp] in sentence_pairs]
    return zip(*preprocessed_sentence_pairs)


en_dataset, sp_dataset = parse_data(en_spa_file_path)
print(en_dataset[-1])
print(sp_dataset[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>

def tokenizer(lang):
    lang_tokenizer = keras.preprocessing.text.Tokenizer(num_words=None,
                                                        filters='',
                                                        split=' ')
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)
    tensor = keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
    return tensor, lang_tokenizer


input_tensor, input_tokenizer = tokenizer(sp_dataset[0:30000])
print(input_tensor.shape)
output_tensor, output_tokenizer = tokenizer(en_dataset[0:30000])


def max_length(tensor):
    return max(len(t) for t in tensor)


max_length_input = max_length(input_tensor)
max_length_output = max_length(output_tensor)
print(max_length_input, max_length_output)

(30000, 16)
16 11

from sklearn.model_selection import train_test_split
input_train, input_eval, output_train, output_eval = train_test_split(
    input_tensor, output_tensor, test_size=0.2)
len(input_train), len(input_eval), len(output_train), len(output_eval)

(24000, 6000, 24000, 6000)

def convert(example, tokenizer):
    for t in example:
        if t != 0:
            print('%d --> %s' % (t, tokenizer.index_word[t]))


convert(input_train[0], input_tokenizer)
print()
convert(output_train[0], output_tokenizer)

1 --> <start>
26 --> yo
160 --> tenia
239 --> anos
20 --> en
3 --> .
2 --> <end>

1 --> <start>
4 --> i
26 --> was
33 --> in
3 --> .
2 --> <end>

def make_dataset(input_tensor, output_tensor, batch_size, epochs, shuffle):
    dataset = tf.data.Dataset.from_tensor_slices((input_tensor, output_tensor))
    if shuffle:
        dataset = dataset.shuffle(30000)
    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder=True)
    return dataset


batch_size = 64
epochs = 20
train_dataset = make_dataset(input_train, output_train, batch_size, epochs,
                             True)
eval_dataset = make_dataset(input_eval, output_eval, batch_size, 1, False)

for x, y in train_dataset.take(1):
    print(x.shape)
    print(y.shape)
    print(x)
    print(y)

(64, 16)
(64, 11)
tf.Tensor(
[[   1 7824   13 ...    0    0    0]
 [   1    6   11 ...    0    0    0]
 [   1    6   14 ...    0    0    0]
 ...
 [   1  137  497 ...    0    0    0]
 [   1   12  597 ...    0    0    0]
 [   1   16    7 ...    0    0    0]], shape=(64, 16), dtype=int32)
tf.Tensor(
[[   1  116  126   13  465    3    2    0    0    0    0]
 [   1   32 2077   20    7    2    0    0    0    0    0]
 [   1   32 1779    8   10    7    2    0    0    0    0]
 [   1    8    5 3258    7    2    0    0    0    0    0]
 [   1  199  140  657   44    3    2    0    0    0    0]
 [   1    4   18   85  473    3    2    0    0    0    0]
 [   1   28  233   33 1853    3    2    0    0    0    0]
 [   1    4   25 2415   68    3    2    0    0    0    0]
 [   1   14   42    9   69  134    3    2    0    0    0]
 [   1   16  262    6    3    2    0    0    0    0    0]
 [   1   14   11    9  443  159    3    2    0    0    0]
 [   1   21  165  919    8    3    2    0    0    0    0]
 [   1    4 1250 1111    3    2    0    0    0    0    0]
 [   1   56  185   13  201    3    2    0    0    0    0]
 [   1  992    8    9 4415    3    2    0    0    0    0]
 [   1    5 1360  596  265    3    2    0    0    0    0]
 [   1    6   23   35   17    3    2    0    0    0    0]
 [   1    4  135 1773    3    2    0    0    0    0    0]
 [   1    5  825    9  578    3    2    0    0    0    0]
 [   1    4   62  884  376    3    2    0    0    0    0]
 [   1   30   12  456   31  837    3    2    0    0    0]
 [   1   46   11  279   15  544    3    2    0    0    0]
 [   1   71    8   31  168    7    2    0    0    0    0]
 [   1    4   18  537    4  169   73    3    2    0    0]
 [   1   25    6 2944   20    7    2    0    0    0    0]
 [   1    4   29 2329   59   97    3    2    0    0    0]
 [   1    4   29   66  493    3    2    0    0    0    0]
 [   1   10   11   69   15   40   89    3    2    0    0]
 [   1   14   87   12   72  486    3    2    0    0    0]
 [   1   32   11   13 1108    7    2    0    0    0    0]
 [   1    6   92  348    3    2    0    0    0    0    0]
 [   1  441   16   25  149    3    2    0    0    0    0]
 [   1   22    6   35 1856    7    2    0    0    0    0]
 [   1    4   43  126   67    3    2    0    0    0    0]
 [   1    5   26  624   50    3    2    0    0    0    0]
 [   1    5  905   54   73    3    2    0    0    0    0]
 [   1   22    6  103   63  383    7    2    0    0    0]
 [   1   42    6  202  242    7    2    0    0    0    0]
 [   1   27   11    9  421 2264    3    2    0    0    0]
 [   1  271    6   35    9  104    7    2    0    0    0]
 [   1    5    8 1046    3    2    0    0    0    0    0]
 [   1   24   31  344  438    7    2    0    0    0    0]
 [   1   32  271    5   47    7    2    0    0    0    0]
 [   1   60 1206    6    7    2    0    0    0    0    0]
 [   1    4  472  417    9 1227    3    2    0    0    0]
 [   1    6   25   12  302   17    3    2    0    0    0]
 [   1    4   25   12   64  197   10    3    2    0    0]
 [   1    4   65  105   21 2271    3    2    0    0    0]
 [   1   16   65  160 4279    3    2    0    0    0    0]
 [   1   32  478  717    7    2    0    0    0    0    0]
 [   1  755  496    3    2    0    0    0    0    0    0]
 [   1    6   24   85  273    3    2    0    0    0    0]
 [   1    5  872  319    3    2    0    0    0    0    0]
 [   1    9  728    8    9  728    3    2    0    0    0]
 [   1    5  411 1522    3    2    0    0    0    0    0]
 [   1   28   42   10    3    2    0    0    0    0    0]
 [   1   25    4  125   10   44    7    2    0    0    0]
 [   1   28   23  649    3    2    0    0    0    0    0]
 [   1   13  850   92  293    3    2    0    0    0    0]
 [   1    4   30   12  510   43    3    2    0    0    0]
 [   1    4   18   34 2363    3    2    0    0    0    0]
 [   1  379 1020    3    2    0    0    0    0    0    0]
 [   1   10   11   21  205    3    2    0    0    0    0]
 [   1   19    8   31  385    3    2    0    0    0    0]], shape=(64, 11), dtype=int32)

# 模型的定义部分
embedding_units = 256 # 词向量.shape (embedding, )
units = 1024  # 隐层神经元个数 某时间步隐层状态.shape (units, )
input_vocab_size = len(input_tokenizer.word_index) + 1 # 词表用于embedding matrix
output_vocab_size = len(output_tokenizer.word_index) + 1 # 同上

class Encoder(keras.Model):
    def __init__(self, vocab_size, embedding_units, encoding_units,
                 batch_size):
        super(Encoder, self).__init__()  # 调用父类构造函数
        self.batch_size = batch_size  # 创建实例变量，下同
        self.encoding_units = encoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_units)
        self.gru = keras.layers.GRU(self.encoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):  # hidden是初始化的隐含状态
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return output, state

    def initialize_hidden_state(self):
        '''创建一个全是0的隐含状态，传给call函数'''
        return tf.zeros((self.batch_size, self.encoding_units))


encoder = Encoder(input_vocab_size, embedding_units, units, batch_size)
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(x, sample_hidden)

print("sample_output_shape", sample_output.shape)
print("sample_hidden_shape", sample_hidden.shape)

sample_output_shape (64, 16, 1024)
sample_hidden_shape (64, 1024)

class BahdananAttention(keras.Model):
    def __init__(self, units):
        super(BahdananAttention, self).__init__()
        self.W1 = keras.layers.Dense(units)
        self.W2 = keras.layers.Dense(units)
        self.V = keras.layers.Dense(1)

    def call(self, decoder_hidden, encoder_outputs):
        """decoder_hidden:decoder某一时间步的隐层状态，encoder_output:encoder每一时间步的输出"""
        # decoder_hidden.shape：(batch_size, units)
        # encoder_hidden.shape:(batch_size, length, units)
        # 广播机制
        # 1. 后缘维度轴长相等 2. 其中1方长度为1
        decoder_hidden_with_time_axis = tf.expand_dims(decoder_hidden, axis=1)
        
        # before:(batch_size, length, units)
        # after:(batch_size, length, 1)
        score = self.V(
            tf.nn.tanh(self.W1(encoder_outputs) + self.W2(decoder_hidden_with_time_axis)))
        # shape:(batch_size, length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector.shape = (batch_size, length, units)
        context_vector = attention_weights * encoder_outputs
        # context_vector = (batch_size, units)
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

attention_model = BahdananAttention(units = 10)
attention_results, attention_weights = attention_model(sample_hidden,
                                                       sample_output)
print("attention_results.shape:", attention_results.shape)
print("attention_weights.shape:", attention_weights.shape)

attention_results.shape: (64, 1024)
attention_weights.shape: (64, 16, 1)

class Decoder(keras.Model):
    def __init__(self, vocab_size, embedding_units, decoding_units,
                 batch_size):
        super(Decoder, self).__init__()
        self.batch_size = batch_size
        self.decoding_units = decoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_units)
        self.gru = keras.layers.GRU(self.decoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')
        self.fc = keras.layers.Dense(vocab_size)
        self.attention = BahdananAttention(self.decoding_units)

    def call(self, x, hidden, encoding_outputs):
        '''
        x: decoder 当前步的输入
        hidden: 上一步的隐层状态
        encoding_outputs: 经过注意力向量加权求和后的上下文向量
        '''
        # context_vector.shape: (batch_size, units)
        context_vector, attention_weights = self.attention(
            hidden, encoding_outputs)
        # before embedding: x.shape :(batch_size, 1)
        # after embedding: x.shape:  (batch_size, 1, embedding_units )

        x = self.embedding(x)
        combined_x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # output.shape: (batch_size, 1, decoding_units)
        # state.shape:  (batch_size, decoding_units)
        output, state = self.gru(combined_x)

        # output.shape : (batch_size, decoding_units)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output.shape: (batch_size, vocab_size)
        output = self.fc(output)
        return output, state, attention_weights


decoder = Decoder(output_vocab_size, embedding_units, units, batch_size)
outputs = decoder(tf.random.uniform((batch_size, 1)), sample_hidden,
                  sample_output)
decoder_output, decoder_hidden, decoder_aw = outputs
print(decoder_output.shape)
print(decoder_hidden.shape)
print(decoder_aw.shape)

(64, 4935)
(64, 1024)
(64, 16, 1)

optimizer = keras.optimizers.Adam()

loss_object = keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                         reduction='none')
# 去掉padding项所产生的误差后再进行聚合

# 单步损失函数
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))  # padding 为0 非padding为1
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

# 计算多步损失函数，并做梯度下降
@tf.function
def train_step(inp, targ, encoding_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        encoding_outputs, encoding_hidden = encoder(inp, encoding_hidden)

        decoding_hidden = encoding_hidden
        # eg:<start> I am here <end>
        # 1. <start> -> I
        # 2. I -> am
        # 3. am -> here
        # 4. here -> <end>

        for t in range(0, targ.shape[1] - 1): # 通过前一个单词预测后一个单词。
            decoding_input = tf.expand_dims(targ[:, t], 1)# 切片之后是一个向量（batch, ），应扩展成（batch_size * 1）的矩阵
            predictions, decoding_hidden, _ = decoder(decoding_input,
                                                      decoding_hidden,
                                                      encoding_outputs)
            loss += loss_function(targ[:, t + 1], predictions)
    batch_loss = loss / int(targ.shape[0])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss

epochs = 10
steps_per_epoch = len(input_tensor) // batch_size

for epoch in range(epochs):
    start = time.time()

    encoding_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
        '''从训练集中取出steps_per_epoch个batch的数据'''
        batch_loss = train_step(inp, targ, encoding_hidden)
        total_loss += batch_loss

        if batch % 100 == 0:
            '''每100个batch打印一次数据：Batch从0索引，Epoch从1索引，batch_loss是平均到每条样本上的loss'''
            print('Epoch {} Batch {} Loss {:.4f}'.format(
                epoch + 1, batch, batch_loss.numpy()))
            '''每遍历一个epoch后打印一次信息：其中Loss保留小数点后四位，/steps_per_epoch后为每条样本的平均误差'''
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / steps_per_epoch))
    print('Time take 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 0.8035
Epoch 1 Batch 100 Loss 0.3535
Epoch 1 Batch 200 Loss 0.3359
Epoch 1 Batch 300 Loss 0.2851
Epoch 1 Batch 400 Loss 0.2671
Epoch 1 Loss 0.3293
Time take 1 epoch 2116.8650002479553 sec

Epoch 2 Batch 0 Loss 0.2466
Epoch 2 Batch 100 Loss 0.2326
Epoch 2 Batch 200 Loss 0.2236
Epoch 2 Batch 300 Loss 0.2260
Epoch 2 Batch 400 Loss 0.1522
Epoch 2 Loss 0.2067
Time take 1 epoch 2149.2449696063995 sec

Epoch 3 Batch 0 Loss 0.1331
Epoch 3 Batch 100 Loss 0.1548
Epoch 3 Batch 200 Loss 0.1314
Epoch 3 Batch 300 Loss 0.1256
Epoch 3 Batch 400 Loss 0.0868
Epoch 3 Loss 0.1272
Time take 1 epoch 2136.0347578525543 sec

Epoch 4 Batch 0 Loss 0.0693
Epoch 4 Batch 100 Loss 0.0904
Epoch 4 Batch 200 Loss 0.0876
Epoch 4 Batch 300 Loss 0.0791
Epoch 4 Batch 400 Loss 0.0469
Epoch 4 Loss 0.0776
Time take 1 epoch 2133.8691380023956 sec

Epoch 5 Batch 0 Loss 0.0511
Epoch 5 Batch 100 Loss 0.0523
Epoch 5 Batch 200 Loss 0.0537
Epoch 5 Batch 300 Loss 0.0500
Epoch 5 Batch 400 Loss 0.0347
Epoch 5 Loss 0.0483
Time take 1 epoch 2115.8476724624634 sec

Epoch 6 Batch 0 Loss 0.0240
Epoch 6 Batch 100 Loss 0.0340
Epoch 6 Batch 200 Loss 0.0424
Epoch 6 Batch 300 Loss 0.0272
Epoch 6 Batch 400 Loss 0.0157
Epoch 6 Loss 0.0319
Time take 1 epoch 2182.366710424423 sec

Epoch 7 Batch 0 Loss 0.0208
Epoch 7 Batch 100 Loss 0.0224
Epoch 7 Batch 200 Loss 0.0275
Epoch 7 Batch 300 Loss 0.0247
Epoch 7 Batch 400 Loss 0.0153
Epoch 7 Loss 0.0224
Time take 1 epoch 2116.347582578659 sec

Epoch 8 Batch 0 Loss 0.0180
Epoch 8 Batch 100 Loss 0.0161
Epoch 8 Batch 200 Loss 0.0209
Epoch 8 Batch 300 Loss 0.0178
Epoch 8 Batch 400 Loss 0.0154
Epoch 8 Loss 0.0170
Time take 1 epoch 2139.7361178398132 sec

Epoch 9 Batch 0 Loss 0.0099
Epoch 9 Batch 100 Loss 0.0096
Epoch 9 Batch 200 Loss 0.0128
Epoch 9 Batch 300 Loss 0.0173
Epoch 9 Batch 400 Loss 0.0094
Epoch 9 Loss 0.0136
Time take 1 epoch 2131.8980412483215 sec

Epoch 10 Batch 0 Loss 0.0096
Epoch 10 Batch 100 Loss 0.0076
Epoch 10 Batch 200 Loss 0.0135
Epoch 10 Batch 300 Loss 0.0100
Epoch 10 Batch 400 Loss 0.0089
Epoch 10 Loss 0.0123
Time take 1 epoch 2125.301104068756 sec

def evaluate(input_sentence):
    attention_matrix = np.zeros((max_length_output, max_length_input))
    input_sentence = preprocess_sentence(input_sentence)
    inputs = [
        input_tokenizer.word_index[token]
        for token in input_sentence.split(' ')
    ]
    inputs = keras.preprocessing.sequence.pad_sequences(
        [inputs], maxlen=max_length_input, padding='post')
    inputs = tf.convert_to_tensor(inputs)

    results = ' '
    # encoding_hidden = encoder.initialize_hidden_state()
    encoding_hidden = tf.zeros((1, units))

    encoding_outputs, encoding_hidden = encoder(inputs, encoding_hidden)
    decoding_hidden = encoding_hidden

    # decoding_Inputs.shape: (1, 1)
    decoding_input = tf.expand_dims([output_tokenizer.word_index['<start>']],
                                    0)
    for t in range(max_length_output):
        predictions, decoding_hidden, attention_weights = decoder(
            decoding_input, decoding_hidden, encoding_outputs)
        # attention_weights.shape: (batch_size, inputs_length, 1) (1, 16, 1)
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_matrix[t] = attention_weights.numpy()

        #  predictions.shape: (batch_size, vocab_size) (1, 4935), 通过下标对第0个元素索引，拿到数组的形状为(4935, )的向量
        # 通过tf.argmax获得元素最大值的索引，其实就是单词的索引
        predicted_id = tf.argmax(predictions[0]).numpy()
        results += output_tokenizer.index_word[predicted_id] + " "

        if output_tokenizer.index_word[predicted_id] == '<end>':
            return results, input_sentence, attention_matrix
        decoding_input = tf.expand_dims([predicted_id], 0)
    return results, input_sentence, attention_matrix


def plot_attention(attention_matrix, input_sentence, predicted_sentence):
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention_matrix, cmap='viridis')
    font_dict = {'fontsize': 14}
    ax.set_xticklabels([' '] + input_sentence, fontdict=font_dict, rotation=90)
    ax.set_yticklabels([' '] + predicted_sentence, fontdict=font_dict)
    plt.show()


def translate(input_sentence):
    results, input_sentence, attention_matrix = evaluate(input_sentence)
    print("Input: %s" % (input_sentence))
    print("Predicted translation: %s" % (results))
    # attention_matrix矩阵中不是所有位置都有元素，为可视化方便，需要注意力矩阵进行切片
    attention_matrix = attention_matrix[:len(results.split(' ')), :len(
        input_sentence.split(' '))]
    plot_attention(attention_matrix, input_sentence.split(' '),
                   results.split(' '))

1	translate(u'Hace mucho frío aquí.')

Input: <start> hace mucho frio aqui . <end>
Predicted translation:  it s very cold here . <end>

1	translate(u'Esta es mi vida.')

Input: <start> esta es mi vida . <end>
Predicted translation:  this is my life . <end>

1	translate(u'¿Todavía estás en casa?')

Input: <start> ¿ todavia estas en casa ? <end>
Predicted translation:  are you still at home ? <end>

1
2