Assignment 3: Hello Vectors

Welcome to this week's programming assignment on exploring word vectors.
In natural language processing, we represent each word as a vector consisting of numbers.
The vector encodes the meaning of the word. These numbers (or weights) for each word are learned using various machine
learning models, which we will explore in more detail later in this specialization. Rather than make you code the
machine learning models from scratch, we will show you how to use them. In the real world, you can always load the
trained word vectors, and you will almost never have to train them from scratch. In this assignment, you will:

Predict analogies between words.
Use PCA to reduce the dimensionality of the word embeddings and plot them in two dimensions.
Compare word embeddings by using a similarity measure (the cosine similarity).
Understand how these vector space models work.

1.0 Predict the Countries from Capitals

In the lectures, we have illustrated the word analogies
by finding the capital of a country from the country.
We have changed the problem a bit in this part of the assignment. You are asked to predict the countries
that corresponds to some capitals.
You are playing trivia against some second grader who just took their geography test and knows all the capitals by heart.
Thanks to NLP, you will be able to answer the questions properly. In other words, you will write a program that can give
you the country by its capital. That way you are pretty sure you will win the trivia game. We will start by exploring the data set.

1.1 Importing the data

As usual, you start by importing some essential Python libraries and then load the dataset.
The dataset will be loaded as a Pandas DataFrame,
which is very a common method in data science.
This may take a few minutes because of the large size of the data.

# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from utils import get_vectors

data = pd.read_csv('capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']

# print first five elements in the DataFrame
data.head(5)

To Run This Code On Your Own Machine:

Note that because the original google news word embedding dataset is about 3.64 gigabytes,
the workspace is not able to handle the full file set. So we've downloaded the full dataset,
extracted a sample of the words that we're going to analyze in this assignment, and saved
it in a pickle file called word_embeddings_capitals.p

If you want to download the full dataset on your own and choose your own set of word embeddings,
please see the instructions and some helper code.

Download the dataset from this page.
Search in the page for 'GoogleNews-vectors-negative300.bin.gz' and click the link to download.

Copy-paste the code below and run it on your local machine after downloading
the dataset to the same directory as the notebook.

import nltk
from gensim.models import KeyedVectors


embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
f = open('capitals.txt', 'r').read()
set_words = set(nltk.word_tokenize(f))
select_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
for w in select_words:
    set_words.add(w)

def get_word_embeddings(embeddings):

    word_embeddings = {}
    for word in embeddings.vocab:
        if word in set_words:
            word_embeddings[word] = embeddings[word]
    return word_embeddings


# Testing your function
word_embeddings = get_word_embeddings(embeddings)
print(len(word_embeddings))
pickle.dump( word_embeddings, open( "word_embeddings_subset.p", "wb" ) )