Building the language model
Count matrix
To calculate the n-gram probability, you will need to count frequencies of n-grams and n-gram prefixes in the training dataset. In some of the code assignment exercises, you will store the n-gram frequencies in a dictionary.
In other parts of the assignment, you will build a count matrix that keeps counts of (n-1)-gram prefix followed by all possible last words in the vocabulary.
The following code shows how to check, retrieve and update counts of n-grams in the word count dictionary.
# manipulate n_gram count dictionary
n_gram_counts = {
('i', 'am', 'happy'): 2,
('am', 'happy', 'because'): 1}
# get count for an n-gram tuple
print(f"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}")
# check if n-gram is present in the dictionary
if ('i', 'am', 'learning') in n_gram_counts:
print(f"n-gram {('i', 'am', 'learning')} found")
else:
print(f"n-gram {('i', 'am', 'learning')} missing")
# update the count in the word count dictionary
n_gram_counts[('i', 'am', 'learning')] = 1
if ('i', 'am', 'learning') in n_gram_counts:
print(f"n-gram {('i', 'am', 'learning')} found")
else:
print(f"n-gram {('i', 'am', 'learning')} missing")
count of n-gram ('i', 'am', 'happy'): 2
n-gram ('i', 'am', 'learning') missing
n-gram ('i', 'am', 'learning') found
The next code snippet shows how to merge two tuples in Python. That will be handy when creating the n-gram from the prefix and the last word.
# concatenate tuple for prefix and tuple with the last word to create the n_gram
prefix = ('i', 'am', 'happy')
word = 'because'
# note here the syntax for creating a tuple for a single word
n_gram = prefix + (word,)
print(n_gram)
('i', 'am', 'happy', 'because')
In the lecture, you've seen that the count matrix could be made in a single pass through the corpus. Here is one approach to do that.