Assignment 2: Parts-of-Speech Tagging (POS)

Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text. Tagging is difficult because some words can represent more than one part of speech at different times. They are Ambiguous. Let's look at the following example:

The whole team played well. [adverb]
You are doing well for yourself. [adjective]
Well, this assignment took me forever to complete. [interjection]
The well is dry. [noun]
Tears were beginning to well in her eyes. [verb]

Distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will:

Learn how parts-of-speech tagging works
Compute the transition matrix A in a Hidden Markov Model
Compute the emission matrix B in a Hidden Markov Model
Compute the Viterbi algorithm
Compute the accuracy of your own model

Outline

0 Data Sources
1 POS Tagging
- 1.1 Training
  - Exercise 01
- 1.2 Testing
  - Exercise 02
2 Hidden Markov Models
- 2.1 Generating Matrices
  - Exercise 03
  - Exercise 04
3 Viterbi Algorithm
4 Predicting on a data set
- Exercise 08

# Importing packages and loading in the data set 
from utils_pos import get_word_tag, preprocess  
import pandas as pd
from collections import defaultdict
import math
import numpy as np

Part 0: Data Sources

This assignment will use two tagged data sets collected from the Wall Street Journal (WSJ).

Here is an example 'tag-set' or Part of Speech designation describing the two or three letter tag and their meaning.

One data set (WSJ-2_21.pos) will be used for training.
The other (WSJ-24.pos) for testing.
The tagged training data has been preprocessed to form a vocabulary (hmm_vocab.txt).
The words in the vocabulary are words from the training set that were used two or more times.
The vocabulary is augmented with a set of 'unknown word tokens', described below.

The training set will be used to create the emission, transmission and tag counts.

The test set (WSJ-24.pos) is read in to create y.

This contains both the test text and the true tag.
The test set has also been preprocessed to remove the tags to form test_words.txt.
This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in utils_pos.py.
This forms the list prep, the preprocessed text used to test our POS taggers.

A POS tagger will necessarily encounter words that are not in its datasets.

To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag.
For example, the suffix 'ize' is a hint that the word is a verb, as in 'final-ize' or 'character-ize'.
A set of unknown-tokens, such as '--unk-verb--' or '--unk-noun--' will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.

Implementation note:

For python 3.6 and beyond, dictionaries retain the insertion order.
Furthermore, their hash-based lookup makes them suitable for rapid membership tests.
- If di is a dictionary, key in di will return True if di has a key key, else False.

The dictionary vocab will utilize these features.