Learn practical skills, build real-world projects, and advance your career

Assignment 3 - Named Entity Recognition (NER)

Welcome to the third programming assignment of Course 3. In this assignment, you will learn to build more complicated models with Trax. By completing this assignment, you will be able to:

  • Design the architecture of a neural network, train it, and test it.
  • Process features and represents them
  • Understand word padding
  • Implement LSTMs
  • Test with your own sentence

Outline

Introduction

We first start by defining named entity recognition (NER). NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc.

For example:

alt

Is labeled as follows:

  • French: geopolitical entity
  • Morocco: geographic entity
  • Christmas: time indicator

Everything else that is labeled with an O is not considered to be a named entity. In this assignment, you will train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then, you will load in the exact version of your model, which was trained for a longer period of time. You could then evaluate the trained version of your model to get 96% accuracy! Finally, you will be able to test your named entity recognition system with your own sentence.

#!pip -q install trax==1.3.1

import trax 
from trax import layers as tl
import os 
import numpy as np
import pandas as pd


from utils import get_params, get_vocab
import random as rnd

# set random seeds to make this notebook easier to replicate
trax.supervised.trainer_lib.init_random_number_generators(33)
DeviceArray([ 0, 33], dtype=uint32)

Part 1: Exploring the data

We will be using a dataset from Kaggle, which we will preprocess for you. The original data consists of four columns, the sentence number, the word, the part of speech of the word, and the tags. A few tags you might expect to see are:

  • geo: geographical entity
  • org: organization
  • per: person
  • gpe: geopolitical entity
  • tim: time indicator
  • art: artifact
  • eve: event
  • nat: natural phenomenon
  • O: filler word
# display original kaggle data
data = pd.read_csv("ner_dataset.csv", encoding = "ISO-8859-1") 
train_sents = open('data/small/train/sentences.txt', 'r').readline()
train_labels = open('data/small/train/labels.txt', 'r').readline()
print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)
print('ORIGINAL DATA:\n', data.head(5))
del(data, train_sents, train_labels)
SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O ORIGINAL DATA: Sentence # Word POS Tag 0 Sentence: 1 Thousands NNS O 1 NaN of IN O 2 NaN demonstrators NNS O 3 NaN have VBP O 4 NaN marched VBN O