Parts-of-Speech Tagging - First Steps: Working with text files, Creating a Vocabulary and Handling Unknown Words

In this lecture notebook you will create a vocabulary from a tagged dataset and learn how to deal with words that are not present in this vocabulary when working with other text sources. Aside from this you will also learn how to:

read text files
work with defaultdict
work with string data

import string
from collections import defaultdict

Read Text Data

A tagged dataset taken from the Wall Street Journal is provided in the file WSJ_02-21.pos.

To read this file you can use Python's context manager by using the with keyword and specifying the name of the file you wish to read. To actually save the contents of the file into memory you will need to use the readlines() method and store its return value in a variable.

Python's context managers are great because you don't need to explicitly close the connection to the file, this is done under the hood: