Assignment 4: Word Embeddings

Welcome to the fourth (and last) programming assignment of Course 2!

In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.

To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.
You can find a way to represent each word numerically, by a vector.
The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.

In this assignment, you will explore a classic way of generating word embeddings or representations.

You will implement a famous model called the continuous bag of words (CBOW) model.

By completing this assignment you will:

Train word vectors from scratch.
Learn how to create batches of data.
Understand how backpropagation works.
Plot and visualize your learned word vectors.

Knowing how to train these models will give you a better understanding of word vectors, which are building blocks to many applications in natural language processing.

1. The Continuous bag of words model

Let's take a look at the following sentence:

'I am happy because I am learning'.

In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).
For example, if you were to choose a context half-size of say $C = 2$ , then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:

$C$ words before: [I, am]

$C$ words after: [because, I]

In other words:

$context = [I,am, because, I]$
$target = happy$

The structure of your model will look like this:

Figure 1

Where $\bar x$ is the average of all the one hot vectors of the context words.

Figure 2

Once you have encoded all the context words, you can use $\bar x$ as the input to your model.

The architecture you will be implementing is as follows:

\begin{align}
h &= W_1 \ X + b_1 \tag{1} \
a &= ReLU(h) \tag{2} \
z &= W_2 \ a + b_2 \tag{3} \
\hat y &= softmax(z) \tag{4} \
\end{align}

# Import Python libraries and helper functions (in utils2) 
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
from collections import Counter
from utils2 import sigmoid, get_batches, compute_pca, get_dict

# Download sentence tokenizer
nltk.data.path.append('.')#adds download directory to nltk path