Assignment 4: Question duplicates

Welcome to the fourth assignment of course 3. In this assignment you will explore Siamese networks applied to natural language processing. You will further explore the fundamentals of Trax and you will be able to implement a more complicated structure using it. By completing this assignment, you will learn how to implement models with different architectures.

Outline

Overview
Part 1: Importing the Data
Part 2: Defining the Siamese model
- 2.1 Understanding Siamese Network
  - Exercise 02
- 2.2 Hard Negative Mining
  - Exercise 03
Part 3: Training
- 3.1 Training the model
  - Exercise 04
Part 4: Evaluation
- 4.1 Evaluating your siamese network
- 4.2 Classify
  - Exercise 05
Part 5: Testing with your own questions
- Exercise 06
On Siamese networks

Overview

In this assignment, concretely you will:

Learn about Siamese networks
Understand how the triplet loss works
Understand how to evaluate accuracy
Use cosine similarity between the model's outputted vectors
Use the data generator to get batches of questions
Predict using your own model

By now, you are familiar with trax and know how to make use of classes to define your model. We will start this homework by asking you to preprocess the data the same way you did in the previous assignments. After processing the data you will build a classifier that will allow you to identify whether to questions are the same or not.
alt

You will process the data first and then pad in a similar way you have done in the previous assignment. Your model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity. Before taking a deep dive into the model, start by importing the data set.

Part 1: Importing the Data

1.1 Loading in the data

You will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because you don't want to have several versions of the same question posted. Several times when teaching I end up responding to similar questions on piazza, or on other community forums. This data set has been labeled for you. Run the cell below to import some of the packages you will be using.

import os
import nltk
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import numpy as np
import pandas as pd
import random as rnd

# set random seeds
trax.supervised.trainer_lib.init_random_number_generators(34)
rnd.seed(34)

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0

Notice that for this assignment Trax's numpy is referred to as fastnp, while regular numpy is referred to as np.

You will now load in the data set. We have done some preprocessing for you. If you have taken the deeplearning specialization, this is a slightly different training method than the one you have seen there. If you have not, then don't worry about it, we will explain everything.

data = pd.read_csv("questions.csv")
N=len(data)
print('Number of question pairs: ', N)
data.head()

Number of question pairs:  404351