Word Embeddings: Ungraded Practice Notebook

In this ungraded notebook, you'll try out all the individual techniques that you learned about in the lecture. Practicing on small examples will prepare you for the graded assignment, where you will combine the techniques in more advanced ways to create word embeddings from a real-life corpus.

This notebook is made of two main parts: data preparation, and the continuous bag-of-words (CBOW) model.

To get started, import and initialize all the libraries you will need.

import sys
!{sys.executable} -m pip install emoji

Collecting emoji
  Downloading emoji-0.6.0.tar.gz (51 kB)
     |████████████████████████████████| 51 kB 14.3 MB/s eta 0:00:01
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... done
  Created wheel for emoji: filename=emoji-0.6.0-py3-none-any.whl size=49715 sha256=deb828c181c105d3b3f6a24006c822c4798a2fc6aac773c3451f38f37c2ca14d
  Stored in directory: /home/jovyan/.cache/pip/wheels/4e/bf/6b/2e22b3708d14bf6384f862db539b044d6931bd6b14ad3c9adc
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.6.0
WARNING: You are using pip version 20.1; however, version 20.2.2 is available.
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.

import re
import nltk
from nltk.tokenize import word_tokenize
import emoji
import numpy as np

from utils2 import get_dict

nltk.download('punkt')  # download pre-trained Punkt tokenizer for English

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

True

Data preparation

In the data preparation phase, starting with a corpus of text, you will:

Clean and tokenize the corpus.
Extract the pairs of context words and center word that will make up the training data set for the CBOW model. The context words are the features that will be fed into the model, and the center words are the target values that the model will learn to predict.
Create simple vector representations of the context words (features) and center words (targets) that can be used by the neural network of the CBOW model.