Assignment 1: Logistic Regression

Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:

Learn how to extract features for logistic regression given some text
Implement logistic regression from scratch
Apply logistic regression on a natural language processing task
Test using your logistic regression
Perform error analysis

We will be using a data set of tweets. Hopefully you will get more than 99% accuracy.
Run the cell below to load in the packages.

Import functions and data

# run this cell to import nltk
import nltk
from os import getcwd

Imported functions

Download the data needed for this assignment. Check out the documentation for the twitter_samples dataset.

twitter_samples: if you're running this notebook on your local computer, you will need to download it using:

nltk.download('twitter_samples')

stopwords: if you're running this notebook on your local computer, you will need to download it using:

nltk.download('stopwords')

Import some helper functions that we provided in the utils.py file:

process_tweet(): cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.
build_freqs(): this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the freqs dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.

import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples 

from utils import process_tweet, build_freqs