Language Models: Auto-Complete

In this assignment, you will build an auto-complete system. Auto-complete system is something you may see every day

When you google something, you often have suggestions to help you complete your search.
When you are writing an email, you get suggestions telling you possible endings to your sentence.

By the end of this assignment, you will develop a prototype of such a system.

Outline

1 Load and Preprocess Data
1.1: Load the data
1.2 Pre-process the data
2 Develop n-gram based language models
- Exercise 08
- Exercise 09
3 Perplexity
- Exercise 10
4 Build an auto-complete system
- Exercise 11

A key building block for an auto-complete system is a language model.
A language model assigns the probability to a sequence of words, in a way that more "likely" sequences receive higher scores. For example,

"I have a pen"
is expected to have a higher probability than
"I am a pen"
since the first one seems to be a more natural sentence in the real world.

You can take advantage of this probability calculation to develop an auto-complete system.
Suppose the user typed

"I eat scrambled"
Then you can find a word x such that "I eat scrambled x" receives the highest probability. If x = "eggs", the sentence would be
"I eat scrambled eggs"

While a variety of language models have been developed, this assignment uses N-grams, a simple but powerful method for language modeling.

N-grams are also used in machine translation and speech recognition.

Here are the steps of this assignment:

Load and preprocess data
- Load and tokenize data.
- Split the sentences into train and test sets.
- Replace words with a low frequency by an unknown marker <unk>.
Develop N-gram based language models
- Compute the count of n-grams from a given data set.
- Estimate the conditional probability of a next word with k-smoothing.
Evaluate the N-gram models by computing the perplexity score.
Use your own model to suggest an upcoming word given your sentence.

import math
import random
import numpy as np
import pandas as pd
import nltk
nltk.data.path.append('.')

Part 1: Load and Preprocess Data

Part 1.1: Load the data

You will use twitter data.
Load the data and view the first few sentences by running the next cell.

Notice that data is a long string that contains many many tweets.
Observe that there is a line break "\n" between tweets.