Learn practical skills, build real-world projects, and advance your career

Character level language model - Dinosaurus Island

Welcome to Dinosaurus Island! 65 million years ago, dinosaurs existed, and in this assignment they are back. You are in charge of a special task. Leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go berserk, so choose wisely!

alt

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this dataset. (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. Hopefully this algorithm will keep you and your team safe from the dinosaurs' wrath!

By completing this assignment you will learn:

  • How to store text data for processing using an RNN
  • How to synthesize data, by sampling predictions at each time step and passing it to the next RNN-cell unit
  • How to build a character-level text generation recurrent neural network
  • Why clipping the gradients is important

We will begin by loading in some functions that we have provided for you in rnn_utils. Specifically, you have access to functions such as rnn_forward and rnn_backward which are equivalent to those you've implemented in the previous assignment.

Updates

If you were working on the notebook before this update...
  • The current notebook is version "3b".
  • You can find your original work saved in the notebook with the previous version name ("v3a")
  • To view the file directory, go to the menu "File->Open", and this will open a new tab that shows the file directory.
List of updates 3b
  • removed redundant numpy import
  • clip
    • change test code to use variable name 'mvalue' rather than 'maxvalue' and deleted it from namespace to avoid confusion.
  • optimize
    • removed redundant description of clip function to discourage use of using 'maxvalue' which is not an argument to optimize
  • model
    • added 'verbose mode to print X,Y to aid in creating that code.
    • wordsmith instructions to prevent confusion
      • 2000 examples vs 100, 7 displayed vs 10
      • no randomization of order
  • sample
    • removed comments regarding potential different sample outputs to reduce confusion.
import numpy as np
from utils import *
import random
import pprint

1 - Problem Statement

1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size.

data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
There are 19909 total characters and 27 unique characters in your data.