Introduction to NLP with Python - Bag of Words

Prerequisites

Understanding this project requires familiarity with the following:

- Python
- Pandas
- Scikit-learn

NLP - Natural Language Processing

Natural Language Processing (NLP) is a field of study that focuses on the computer understanding of human or "natural" language. This field is essentially the pursuit of building machines that are capable of understanding and responding to, or generating text/speech that sounds human. This field covers many applications from voice assistants like Siri and Alexa, to Google's auto-complete feature on their search engine.

Components of NLP

Natural Language Understanding (NLU)

Understanding statements said or written by humans. This can be difficult as words can have many meanings (lexical ambiguity), or a sentence could mean two different things (syntactic ambiguity), or a person or place can be referred to using multiple different methods in a different sentence.

Natural Language Generation (NLG)

Generating sentences that sound human-made. This involves creating a knowledge base for the machine to draw from, choosing the proper words to form the idea of the sentence and finally using proper sentence structure for readability.

Bag Of Words Method

The method we will be exploring in this project is the "Bag-Of-Words" (BOW) representation of text data. The way this method works, is to convert text data into fixed-length vectors by counting how many time each word appears in the given section of text whether it be a document or sentence. While not suitable for very complex processing tasks, BOW is still used due to its simplicity, it functions as a benchmark tool to get an idea of performance before using more powerful methods.

Implementation & Outline

1. Install and import the necessary libraries
2. Download and explore the dataset to be used in this project
3. Text preprocessing
4. Perform vectorization
5. Cosine similarity
6. Create Recommender
7. Summary
8. References

1. Install/Import Libraries

!pip install nltk pandas numpy opendatasets sklearn --quiet
import nltk
import pandas as pd
import numpy as np
import opendatasets as od
import os

2. Download Dataset From Kaggle And Analyze

od.download('https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata/download?datasetVersionNumber=2')
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: srinathnanduri97 Your Kaggle Key: ········ Downloading tmdb-movie-metadata.zip to ./tmdb-movie-metadata
100%|██████████| 8.89M/8.89M [00:00<00:00, 64.3MB/s]

Convert the datasets from .csv files to pandas DataFrames.

os.listdir('tmdb-movie-metadata')
['tmdb_5000_credits.csv', 'tmdb_5000_movies.csv']
raw_df_movies = pd.read_csv('tmdb-movie-metadata/tmdb_5000_movies.csv')
raw_df_credits = pd.read_csv('tmdb-movie-metadata/tmdb_5000_credits.csv')
raw_df_movies.head(5)
raw_df_credits.head(5)

These two datasets have two columns in common, therefore we can merge them to create one DataFrame with 24 columns and 4803 rows.

raw_df = pd.merge(raw_df_movies, raw_df_credits, how='left', left_on=['id', 'original_title'], right_on=['movie_id', 'title'])
raw_df.head(5)

Get an overview of the columns in the DataFrame

raw_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4803 entries, 0 to 4802 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 budget 4803 non-null int64 1 genres 4803 non-null object 2 homepage 1712 non-null object 3 id 4803 non-null int64 4 keywords 4803 non-null object 5 original_language 4803 non-null object 6 original_title 4803 non-null object 7 overview 4800 non-null object 8 popularity 4803 non-null float64 9 production_companies 4803 non-null object 10 production_countries 4803 non-null object 11 release_date 4802 non-null object 12 revenue 4803 non-null int64 13 runtime 4801 non-null float64 14 spoken_languages 4803 non-null object 15 status 4803 non-null object 16 tagline 3959 non-null object 17 title_x 4803 non-null object 18 vote_average 4803 non-null float64 19 vote_count 4803 non-null int64 20 movie_id 4542 non-null float64 21 title_y 4542 non-null object 22 cast 4542 non-null object 23 crew 4542 non-null object dtypes: float64(4), int64(4), object(16) memory usage: 938.1+ KB

Check how many unique values each column contains.

raw_df.nunique()
budget                   436
genres                  1175
homepage                1691
id                      4803
keywords                4222
original_language         37
original_title          4801
overview                4800
popularity              4802
production_companies    3697
production_countries     469
release_date            3280
revenue                 3297
runtime                  156
spoken_languages         544
status                     3
tagline                 3944
title_x                 4800
vote_average              71
vote_count              1609
movie_id                4542
title_y                 4540
cast                    4501
crew                    4515
dtype: int64

Describe the numerical columns in the DataFrame.

round(raw_df.describe().T, 2)

3. Text Preprocessing

a. Tokenization

The first step to preprocessing is to tokenize the statement that needs to be processed. What this does is it extracts "tokens" from the statement by separating each word and bit of punctuation into separate list items using the word_tokenize module from the nltk library. This breaks down the statement allowing for further preprocessing to take place and makes it easier to identify patterns compared to using full sentences.

from nltk.tokenize import word_tokenize
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
True
raw_df.overview[0]
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'
overview_1_tokenize = word_tokenize(raw_df.overview[0])
overview_1_tokenize
['In',
 'the',
 '22nd',
 'century',
 ',',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission',
 ',',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization',
 '.']
b. Stopwords Removal

The second step in preprocessing is to remove "stopwords", which are essentially descriptive words that add very little to the purpose of the text. This allows for further improvement in pattern identification without sacrificing the understanding of the idea of the sentence/statement.

from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
True
english_stopwords = stopwords.words('english')
len(english_stopwords)
179
", ".join(english_stopwords)
"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't"
def remove_stopwords(tokens):
    return [token for token in tokens if token not in english_stopwords]
overview_1_stopwords = remove_stopwords(overview_1_tokenize)
overview_1_stopwords
['In',
 '22nd',
 'century',
 ',',
 'paraplegic',
 'Marine',
 'dispatched',
 'moon',
 'Pandora',
 'unique',
 'mission',
 ',',
 'becomes',
 'torn',
 'following',
 'orders',
 'protecting',
 'alien',
 'civilization',
 '.']
c. Lemmatization/Stemming

The final step in this project is to use either "Lemmatization" or "Stemming" on the statement.

Stemming reduces the remaining tokens into their "stem" or "root" form. This makes the statement even more basic and allows for different variations of words to be matched for example "protecting" and "protected" would be turned into "protect".

Lemmatization performs a similar function however, lemmatization uses the language's full vocabulary to more accurately represent what the root of a word should be rather than just cutting off what the algorithm perceives to be the suffix of the word.

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english', ignore_stopwords=True)
overview_1_stemmed = [stemmer.stem(word) for word in overview_1_stopwords]
overview_1_stemmed
['in',
 '22nd',
 'centuri',
 ',',
 'parapleg',
 'marin',
 'dispatch',
 'moon',
 'pandora',
 'uniqu',
 'mission',
 ',',
 'becom',
 'torn',
 'follow',
 'order',
 'protect',
 'alien',
 'civil',
 '.']
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data... [nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
True
lemmatizer = WordNetLemmatizer()
overview_1_lemmatized = [lemmatizer.lemmatize(word) for word in overview_1_stopwords]
overview_1_lemmatized
['In',
 '22nd',
 'century',
 ',',
 'paraplegic',
 'Marine',
 'dispatched',
 'moon',
 'Pandora',
 'unique',
 'mission',
 ',',
 'becomes',
 'torn',
 'following',
 'order',
 'protecting',
 'alien',
 'civilization',
 '.']

4. Count Vectorization

Vectorization is the process of converting text into numerical matrix. There are many methods of vectorization, but in this case we will be using "Count Vectorization". This method essentially counts the number of times each word appears in each statement.

Before vectorization is performed, the dataset needs to be prepared for vectorization using all of the other relevant columns. The following is essentially tokenizing every relevant column by removing all of the formatting and converting them into a list of tokens/words/names. They are then all combined into one column simply titled 'tags'.

import ast

Create a new DataFrame with only the necessary columns.

movies_df = raw_df[['movie_id','original_title','overview','genres','keywords','cast','crew']]
movies_df.dropna(inplace=True)
movies_df.isna().sum()
/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py:311: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return func(*args, **kwargs)
movie_id          0
original_title    0
overview          0
genres            0
keywords          0
cast              0
crew              0
dtype: int64
movies_df.head(5)

Functions to get information out of the columns, format the columns and to tokenize the statements.

def get_cast(text):
    """Get a list of the first 5 cast members"""
    counter = 0
    cast= []
    for i in ast.literal_eval(text):
        if counter != 5:
            cast.append(i['name'])
            counter += 1
        else:
            break
    return cast

def get_director(text):
    """Get the name of the director of the movie"""
    director = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            director.append(i['name'])
            break
    return director

def get_tags(text):
    """get a list of all of the tags in the keywords column"""
    tags = []
    for i in ast.literal_eval(text):
        tags.append(i['name'])
    return tags

def remove_spaces(text):
    """Remove spaces in list items"""
    return [i.replace(" ", "") for i in text]

def split_text(text):
    """Split strings into a list using empty space as a separator"""
    return text.split()

def tokenize(text):
    """Perform the tokenization and stemming of a sentence"""
    tokenized = [stemmer.stem(word) for word in word_tokenize(text)]
    punc = [',', '.', '?', "'"]
    return [i for i in tokenized if i not in punc]
movies_df['genres'] = movies_df['genres'].apply(get_tags)
movies_df['keywords'] = movies_df['keywords'].apply(get_tags)
movies_df['cast'] = movies_df['cast'].apply(get_cast)
movies_df['crew'] = movies_df['crew'].apply(get_director)
movies_df['overview'] = movies_df['overview'].apply(split_text)
movies_df['cast'] = movies_df['cast'].apply(remove_spaces)
movies_df['crew'] = movies_df['crew'].apply(remove_spaces)
movies_df['genres'] = movies_df['genres'].apply(remove_spaces)
movies_df['keywords'] = movies_df['keywords'].apply(remove_spaces)
/tmp/ipykernel_37/3296912027.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['genres'] = movies_df['genres'].apply(get_tags) /tmp/ipykernel_37/3296912027.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['keywords'] = movies_df['keywords'].apply(get_tags) /tmp/ipykernel_37/3296912027.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['cast'] = movies_df['cast'].apply(get_cast) /tmp/ipykernel_37/3296912027.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['crew'] = movies_df['crew'].apply(get_director) /tmp/ipykernel_37/3296912027.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['overview'] = movies_df['overview'].apply(split_text) /tmp/ipykernel_37/3296912027.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['cast'] = movies_df['cast'].apply(remove_spaces) /tmp/ipykernel_37/3296912027.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['crew'] = movies_df['crew'].apply(remove_spaces) /tmp/ipykernel_37/3296912027.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['genres'] = movies_df['genres'].apply(remove_spaces) /tmp/ipykernel_37/3296912027.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['keywords'] = movies_df['keywords'].apply(remove_spaces)
movies_df.head(5)

Create a new column named 'tags' with all of the information from the necessary columns.

movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']
/tmp/ipykernel_37/4004039259.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']
movies_df['tags'][0]
['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.',
 'Action',
 'Adventure',
 'Fantasy',
 'ScienceFiction',
 'cultureclash',
 'future',
 'spacewar',
 'spacecolony',
 'society',
 'spacetravel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alienplanet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'loveaffair',
 'antiwar',
 'powerrelations',
 'mindandsoul',
 '3d',
 'SamWorthington',
 'ZoeSaldana',
 'SigourneyWeaver',
 'StephenLang',
 'MichelleRodriguez',
 'JamesCameron']

Create a final DataFrame with 3 columns, the movie ID, the title and the tags

final_df = movies_df[['movie_id', 'original_title', 'tags']]
final_df['tags'] = final_df['tags'].apply(lambda x: " ".join(x))
final_df['tags'] = final_df['tags'].apply(lambda x: x.lower())
final_df['movie_id'] = final_df['movie_id'].astype(np.int32)
/tmp/ipykernel_37/2448317582.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_df['tags'] = final_df['tags'].apply(lambda x: " ".join(x)) /tmp/ipykernel_37/2448317582.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_df['tags'] = final_df['tags'].apply(lambda x: x.lower()) /tmp/ipykernel_37/2448317582.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_df['movie_id'] = final_df['movie_id'].astype(np.int32)
final_df.head(5)

To perform count vectorization, the module CountVectorizer is imported from Scikit-learn. The count vectorizer is initialized using the tokenize function from the previous section as well setting the stop words argument to english and restricting the maximum features to 1000.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=tokenize,
                        stop_words='english',
                        max_features=1000)
cv_matrix = cv.fit_transform(final_df['tags']).toarray()
cv_matrix.shape
/opt/conda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['afterward', 'alon', 'alreadi', 'alway', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becom', 'besid', 'cri', 'describ', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'otherwis', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev'] not in stop_words. warnings.warn(
(4539, 1000)
cv_matrix
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

5. Cosine Similarity

Cosine similarity is a measure of similarity between two sequences of numbers. In this case, each row in the tags column has been converted into a very large number of columns containing numbers referring to the significance or importance of the words to the movie. Using this technique it is possible to measure the similarity between two sets of tags and therefore measure how similar two movies are and whether or not to recommend one based on the other.

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(cv_matrix)
similarity
array([[1.        , 0.1467348 , 0.12792043, ..., 0.09988907, 0.04264014,
        0.        ],
       [0.1467348 , 1.        , 0.09176629, ..., 0.03582872, 0.        ,
        0.        ],
       [0.12792043, 0.09176629, 1.        , ..., 0.0624695 , 0.        ,
        0.        ],
       ...,
       [0.09988907, 0.03582872, 0.0624695 , ..., 1.        , 0.0624695 ,
        0.02567481],
       [0.04264014, 0.        , 0.        , ..., 0.0624695 , 1.        ,
        0.0328798 ],
       [0.        , 0.        , 0.        , ..., 0.02567481, 0.0328798 ,
        1.        ]])

6. Build Recommender

The recommender is essentially taking the name of a movie entered by the user, comparing the similarities with all of the movies in the DataFrame and outputting the 5 most similar movies based on the cosine similarity metric used earlier.

def recommender(movie):
    movie_index = final_df[final_df['original_title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    return [final_df.iloc[i[0]].original_title for i in movies_list]
recommender('Avatar')
['Independence Day',
 'Beowulf',
 'Aliens vs Predator: Requiem',
 'Jupiter Ascending',
 'Small Soldiers']

Advantages and Disadvantages

Advantages:
  • Often performs at a high level of accuracy for tasks where frequency or occurrence of words are predictive features
  • Easy and quick to implement
  • Helpful when working on a few domain specific documents (i.e. sentiment analysis of political news data from twitter)
Disadvantages:
  • Doesn't work well for large documents as it can run into issues with computation and differentiating between vectors when the size of the vocabulary is large (a large variety of words used)
  • Unsuitable if the sequence of words matters (text generation)
  • Has trouble understanding the meaning of text data as sentence structure is not taken into account and sentences with similar words but a different meaning will have a similar vector representation.

Summary & Conclusion

In summation, this was an introduction to Natural Language Processing using the TMBD 5000 Movie Dataset from Kaggle.com to build a movie recommending application. We used the overview, genres, keywords, cast and director columns as tags to measure similarity between every movie in the database and output the 5 most similar movies to the movie entered into the application.

The outline we followed was:

1. Install and import the necessary libraries
2. Download and explore the dataset to be used in this project
3. Text preprocessing
4. Perform vectorization
5. Cosine similarity
6. Create Recommender

References

nsrinath97
Srinath Nanduri2 months ago
Jovian
Sign In