This project is part of the Nanodegree em Deep Learning Foundation taught by Udacity. The source code for running this project is available in my repository on GitHub.

In this project, I’m going to take a peek into the realm of neural network machine translation. I’ll be training a sequence to sequence model on a dataset of English and French sentences that can translate new sentences from English to French.

Step 1: Get the Data

Since translating the whole language of English to French will take lots of time to train, I have provided you with a small portion of the English corpus.

import helper
import problem_unittests as tests

source_path = 'data/small_vocab_en'
target_path = 'data/small_vocab_fr'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)

Step 2: Explore the Data

With view_sentence_range I can view different parts of the data.

view_sentence_range = (0, 10)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in source_text.split()})))

sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
print('Number of sentences: {}'.format(len(sentences)))
print('Average number of words in a sentence: {}'.format(np.average(word_counts)))

print('English sentences {} to {}:'.format(*view_sentence_range))
print('French sentences {} to {}:'.format(*view_sentence_range))

Expected outcome:
View different parts of the data

Step 3: Implement Preprocessing Function

In the function text_to_ids(), I’ll turn source_text and target_text from words to ids. However, I need to add the word id at the end of target_text. This will help the neural network predict when the sentence should end.

I can get the word id by doing:


I can get other word ids using source_vocab_to_int and target_vocab_to_int.

def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    source_ids = [[source_vocab_to_int[word] for word in line.split()] for line in source_text.split('\n')]
    target_ids = [[target_vocab_to_int[word] for word in line.split()] for line in target_text.split('\n')]
    eos = target_vocab_to_int['']
    target_ids = [line + [eos] for line in target_ids]
    return source_ids, target_ids


Step 4: Build the Neural Network

I’ll build the components necessary to build a Sequence-to-Sequence model by implementing the following functions below:

  • model_inputs.
  • process_decoder_input.
  • encoding_layer.
  • decoding_layer_train.
  • decoding_layer_infer.
  • decoding_layer.
  • seq2seq_model.


Implement the model_inputs() function to create TF Placeholders for the Neural Network.

  • Input text placeholder named “input” using the TF Placeholder name parameter with rank 2
  • Targets placeholder with rank 2
  • Learning rate placeholder with rank 0
  • Keep probability placeholder named “keep_prob” using the TF Placeholder name parameter with rank 0
  • Target sequence length placeholder named “target_sequence_length” with rank 1
  • Max target sequence length tensor named “max_target_len” getting its value from applying tf.reduce_max on the target_sequence_length placeholder. Rank 0
  • Source sequence length placeholder named “source_sequence_length” with rank 1

Return the placeholders in the following the tuple (input, targets, learning rate, keep probability, target sequence length, max target sequence length, source sequence length):

def model_inputs():
    input = tf.placeholder(tf.int32, [None, None], name="input")
    targets = tf.placeholder(tf.int32, [None, None], name="targets")

    learning_rate = tf.placeholder(tf.float32, name="learning_rate")
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")

    target_sequence_length = tf.placeholder(tf.int32, [None], name="target_sequence_length")
    max_target_sequence_length = tf.reduce_max(target_sequence_length)
    source_sequence_length = tf.placeholder(tf.int32, [None], name="source_sequence_length")

    return (input, targets, learning_rate, keep_prob, target_sequence_length, max_target_sequence_length, source_sequence_length)


Process Decoder Input

Implement process_decoder_input by removing the last word id from each batch in target_data and concat the GO ID to the begining of each batch.

def process_decoder_input(target_data, target_vocab_to_int, batch_size):
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], target_vocab_to_int['']), ending], 1)
    return dec_input



Implement encoding_layer() to create a Encoder RNN layer:

from imp import reload

def encoding_layer(rnn_inputs, 
    enc_inputs = tf.contrib.layers.embed_sequence(rnn_inputs, source_vocab_size, encoding_embedding_size)
    cell = tf.contrib.rnn.MultiRNNCell([ tf.contrib.rnn.LSTMCell(rnn_size) for _ in range(num_layers) ])
    enc_output, enc_state = tf.nn.dynamic_rnn(cell, enc_inputs, sequence_length=source_sequence_length, dtype=tf.float32)
    return enc_output, enc_state tests.test_encoding_layer(encoding_layer)

Decoding – Training

Create a training decoding layer:

def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, 
                         target_sequence_length, max_summary_length, 
                         output_layer, keep_prob):
    helper = tf.contrib.seq2seq.TrainingHelper(dec_embed_input, target_sequence_length)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder, impute_finished=True, maximum_iterations=max_summary_length)
    return dec_outputs


Decoding – Inference

Create inference decoder:

def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                         end_of_sequence_id, max_target_sequence_length,
                         vocab_size, output_layer, batch_size, keep_prob):
    start_tokens = tf.tile(tf.constant([start_of_sequence_id], dtype=tf.int32), [batch_size], name='start_tokens')
    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings, 
        start_tokens, end_of_sequence_id)
    decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, helper, encoder_state, output_layer=output_layer)
    dec_outputs, dec_state = tf.contrib.seq2seq.dynamic_decode(decoder,impute_finished=True, maximum_iterations=max_target_sequence_length)
    return dec_outputs


Build the Decoding Layer

I implement decoding_layer() to create a Decoder RNN layer.

  • Embed the target sequences
  • Construct the decoder LSTM cell (just like you constructed the encoder cell above)
  • Create an output layer to map the outputs of the decoder to the elements of our vocabulary
  • Use the your decoding_layer_train(encoder_state, dec_cell, dec_embed_input, target_sequence_length, max_target_sequence_length, output_layer, keep_prob) function to get the training logits
  • Use your decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id, end_of_sequence_id, max_target_sequence_length, vocab_size, output_layer, batch_size, keep_prob) function to get the inference logits
from tensorflow.python.layers import core as layers_core

def decoding_layer(dec_input, encoder_state,
                   target_sequence_length, max_target_sequence_length,
                   num_layers, target_vocab_to_int, target_vocab_size,
                   batch_size, keep_prob, decoding_embedding_size):
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

    dec_cell = tf.contrib.rnn.MultiRNNCell([ tf.contrib.rnn.LSTMCell(rnn_size) for _ in range(num_layers) ])
    output_layer = layers_core.Dense(target_vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    with tf.variable_scope("decoding") as decoding_scope:
        dec_outputs_train = decoding_layer_train(encoder_state, dec_cell, dec_embed_input, target_sequence_length, max_target_sequence_length, output_layer, keep_prob)

    start_of_sequence_id = target_vocab_to_int[""]
    end_of_sequence_id = target_vocab_to_int[""]
    with tf.variable_scope("decoding", reuse=True) as decoding_scope:
        dec_outputs_infer = decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id, end_of_sequence_id, max_target_sequence_length, target_vocab_size, output_layer, batch_size, keep_prob)

    return dec_outputs_train, dec_outputs_infer


Step 5: Build the Neural Network

I apply the functions implemented above to:

  • Apply embedding to the input data for the encoder
  • Encode the input using your encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, source_sequence_length, source_vocab_size, encoding_embedding_size)
  • Process target data using your process_decoder_input(target_data, target_vocab_to_int, batch_size) function
  • Apply embedding to the target data for the decoder
  • Decode the encoded input using your decoding_layer(dec_input, enc_state, target_sequence_length, max_target_sentence_length, rnn_size, num_layers, target_vocab_to_int, target_vocab_size, batch_size, keep_prob, dec_embedding_size) function
def seq2seq_model(input_data, target_data, keep_prob, batch_size,
                  source_sequence_length, target_sequence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size,
                  rnn_size, num_layers, target_vocab_to_int):
    enc_output, enc_state = encoding_layer(input_data, rnn_size, num_layers, keep_prob, source_sequence_length, source_vocab_size, enc_embedding_size)
    dec_input = process_decoder_input(target_data, target_vocab_to_int, batch_size) 
    dec_outputs_train, dec_outputs_infer = decoding_layer(dec_input, enc_state, target_sequence_length, tf.reduce_max(target_sequence_length), rnn_size, num_layers, target_vocab_to_int, target_vocab_size, batch_size, keep_prob, dec_embedding_size)
    return dec_outputs_train, dec_outputs_infer


Step 6: Neural Network Training

Tune the following parameters:

epochs = 3
batch_size = 128
rnn_size = 512
num_layers = 2
encoding_embedding_size = 256
decoding_embedding_size = 256
learning_rate = 0.001
keep_probability = 0.9
display_step = 1

Step 7: Build the Graph

I build the graph using the neural network implemented.

save_path = 'checkpoints/dev'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()
max_target_sentence_length = max([len(sentence) for sentence in source_int_text])

train_graph = tf.Graph()
with train_graph.as_default():
    input_data, targets, lr, keep_prob, target_sequence_length, max_target_sequence_length, source_sequence_length = model_inputs()

    input_shape = tf.shape(input_data)

    train_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),

    training_logits = tf.identity(train_logits.rnn_output, name='logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')

    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        cost = tf.contrib.seq2seq.sequence_loss(

        optimizer = tf.train.AdamOptimizer(lr)

        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

Batch and pad the source and target sequences.

def pad_sentence_batch(sentence_batch, pad_int):
    """Pad sentences with  so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]

def get_batches(sources, targets, batch_size, source_pad_int, target_pad_int):
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size

        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]

        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))

        pad_targets_lengths = []
        for target in pad_targets_batch:

        pad_source_lengths = []
        for source in pad_sources_batch:

        yield pad_sources_batch, pad_targets_batch, pad_source_lengths, pad_targets_lengths

Step 8: Train

I train the neural network on the preprocessed data.

def get_accuracy(target, logits):
    Calculate accuracy
    max_seq = max(target.shape[1], logits.shape[1])
    if max_seq - target.shape[1]:
        target = np.pad(
            [(0,0),(0,max_seq - target.shape[1])],
    if max_seq - logits.shape[1]:
        logits = np.pad(
            [(0,0),(0,max_seq - logits.shape[1])],

    return np.mean(np.equal(target, logits))

train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]
valid_source = source_int_text[:batch_size]
valid_target = target_int_text[:batch_size]
(valid_sources_batch, valid_targets_batch, valid_sources_lengths, valid_targets_lengths ) = next(get_batches(valid_source, valid_target, batc_size, source_vocab_to_int[''], target_vocab_to_int['']))                                                                                                  
with tf.Session(graph=train_graph) as sess:

    for epoch_i in range(epochs):
        for batch_i, (source_batch, target_batch, sources_lengths, targets_lengths) in enumerate(
                get_batches(train_source, train_target, batch_size,

            _, loss =
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths,
                 keep_prob: keep_probability})

            if batch_i % display_step == 0 and batch_i > 0:

                batch_train_logits =
                    {input_data: source_batch,
                     source_sequence_length: sources_lengths,
                     target_sequence_length: targets_lengths,
                     keep_prob: 1.0})

                batch_valid_logits =
                    {input_data: valid_sources_batch,
                     source_sequence_length: valid_sources_lengths,
                     target_sequence_length: valid_targets_lengths,
                     keep_prob: 1.0})

                train_acc = get_accuracy(target_batch, batch_train_logits)
                valid_acc = get_accuracy(valid_targets_batch, batch_valid_logits)

                print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
                      .format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))

    saver = tf.train.Saver(), save_path)
    print('Model Trained and Saved')

Step 9: Sentence to Sequence

To feed a sentence into the model for translation, I first need to preprocess it.

  • Convert the sentence to lowercase
  • Convert words into ids using vocab_to_int
    • Convert words not in the vocabulary, to the word id
def sentence_to_seq(sentence, vocab_to_int):
    return [vocab_to_int.get(word, vocab_to_int.get('')) for word in sentence.lower().split()]


Step 10: Translate

This will translate translate_sentence from English to French.

translate_sentence = 'he saw a old yellow truck .'
translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    loader = tf.train.import_meta_graph(load_path + '.meta')
    loader.restore(sess, load_path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')

    translate_logits =, {input_data: [translate_sentence]*batch_size,
                                         target_sequence_length: [len(translate_sentence)*2]*batch_size,
                                         source_sequence_length: [len(translate_sentence)]*batch_size,
                                         keep_prob: 1.0})[0]

print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  English Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))
print('  Word Ids:      {}'.format([i for i in translate_logits]))
print('  French Words: {}'.format(" ".join([target_int_to_vocab[i] for i in translate_logits])))

Expected outcome:
Translate English to French

Imperfect Translation

I might notice that some sentences translate better than others. Since the dataset I’m using only has a vocabulary of 227 English words of the thousands that you use, I’m only going to see good results using these words.

Running the project

Downloading files needed to run this project on my GitHub. There you also have additional information on preparing the environment to run it.

Listening: Mano Chao – Clandestino