Introduction to NLP with Python - Bag of Words
Introduction to NLP with Python - Bag of Words
Prerequisites
Understanding this project requires familiarity with the following:
- Python
- Pandas
- Scikit-learn
NLP - Natural Language Processing
Natural Language Processing (NLP) is a field of study that focuses on the computer understanding of human or "natural" language. This field is essentially the pursuit of building machines that are capable of understanding and responding to, or generating text/speech that sounds human. This field covers many applications from voice assistants like Siri and Alexa, to Google's auto-complete feature on their search engine.
Components of NLP
Natural Language Understanding (NLU)
Understanding statements said or written by humans. This can be difficult as words can have many meanings (lexical ambiguity), or a sentence could mean two different things (syntactic ambiguity), or a person or place can be referred to using multiple different methods in a different sentence.
Natural Language Generation (NLG)
Generating sentences that sound human-made. This involves creating a knowledge base for the machine to draw from, choosing the proper words to form the idea of the sentence and finally using proper sentence structure for readability.
Bag Of Words Method
The method we will be exploring in this project is the "Bag-Of-Words" (BOW) representation of text data. The way this method works, is to convert text data into fixed-length vectors by counting how many time each word appears in the given section of text whether it be a document or sentence. While not suitable for very complex processing tasks, BOW is still used due to its simplicity, it functions as a benchmark tool to get an idea of performance before using more powerful methods.
Implementation & Outline
1. Install and import the necessary libraries
2. Download and explore the dataset to be used in this project
3. Text preprocessing
4. Perform vectorization
5. Cosine similarity
6. Create Recommender
7. Summary
8. References
1. Install/Import Libraries
!pip install nltk pandas numpy opendatasets sklearn --quiet
import nltk
import pandas as pd
import numpy as np
import opendatasets as od
import os
2. Download Dataset From Kaggle And Analyze
od.download('https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata/download?datasetVersionNumber=2')
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: srinathnanduri97
Your Kaggle Key: ········
Downloading tmdb-movie-metadata.zip to ./tmdb-movie-metadata
100%|██████████| 8.89M/8.89M [00:00<00:00, 64.3MB/s]
Convert the datasets from .csv files to pandas DataFrames.
os.listdir('tmdb-movie-metadata')
['tmdb_5000_credits.csv', 'tmdb_5000_movies.csv']
raw_df_movies = pd.read_csv('tmdb-movie-metadata/tmdb_5000_movies.csv')
raw_df_credits = pd.read_csv('tmdb-movie-metadata/tmdb_5000_credits.csv')
raw_df_movies.head(5)
raw_df_credits.head(5)
These two datasets have two columns in common, therefore we can merge them to create one DataFrame with 24 columns and 4803 rows.
raw_df = pd.merge(raw_df_movies, raw_df_credits, how='left', left_on=['id', 'original_title'], right_on=['movie_id', 'title'])
raw_df.head(5)
Get an overview of the columns in the DataFrame
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 budget 4803 non-null int64
1 genres 4803 non-null object
2 homepage 1712 non-null object
3 id 4803 non-null int64
4 keywords 4803 non-null object
5 original_language 4803 non-null object
6 original_title 4803 non-null object
7 overview 4800 non-null object
8 popularity 4803 non-null float64
9 production_companies 4803 non-null object
10 production_countries 4803 non-null object
11 release_date 4802 non-null object
12 revenue 4803 non-null int64
13 runtime 4801 non-null float64
14 spoken_languages 4803 non-null object
15 status 4803 non-null object
16 tagline 3959 non-null object
17 title_x 4803 non-null object
18 vote_average 4803 non-null float64
19 vote_count 4803 non-null int64
20 movie_id 4542 non-null float64
21 title_y 4542 non-null object
22 cast 4542 non-null object
23 crew 4542 non-null object
dtypes: float64(4), int64(4), object(16)
memory usage: 938.1+ KB
Check how many unique values each column contains.
raw_df.nunique()
budget 436
genres 1175
homepage 1691
id 4803
keywords 4222
original_language 37
original_title 4801
overview 4800
popularity 4802
production_companies 3697
production_countries 469
release_date 3280
revenue 3297
runtime 156
spoken_languages 544
status 3
tagline 3944
title_x 4800
vote_average 71
vote_count 1609
movie_id 4542
title_y 4540
cast 4501
crew 4515
dtype: int64
Describe the numerical columns in the DataFrame.
round(raw_df.describe().T, 2)
3. Text Preprocessing
a. Tokenization
The first step to preprocessing is to tokenize the statement that needs to be processed. What this does is it extracts "tokens" from the statement by separating each word and bit of punctuation into separate list items using the word_tokenize module from the nltk library. This breaks down the statement allowing for further preprocessing to take place and makes it easier to identify patterns compared to using full sentences.
from nltk.tokenize import word_tokenize
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
True
raw_df.overview[0]
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'
overview_1_tokenize = word_tokenize(raw_df.overview[0])
overview_1_tokenize
['In',
'the',
'22nd',
'century',
',',
'a',
'paraplegic',
'Marine',
'is',
'dispatched',
'to',
'the',
'moon',
'Pandora',
'on',
'a',
'unique',
'mission',
',',
'but',
'becomes',
'torn',
'between',
'following',
'orders',
'and',
'protecting',
'an',
'alien',
'civilization',
'.']
b. Stopwords Removal
The second step in preprocessing is to remove "stopwords", which are essentially descriptive words that add very little to the purpose of the text. This allows for further improvement in pattern identification without sacrificing the understanding of the idea of the sentence/statement.
from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
english_stopwords = stopwords.words('english')
len(english_stopwords)
179
", ".join(english_stopwords)
"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't"
def remove_stopwords(tokens):
return [token for token in tokens if token not in english_stopwords]
overview_1_stopwords = remove_stopwords(overview_1_tokenize)
overview_1_stopwords
['In',
'22nd',
'century',
',',
'paraplegic',
'Marine',
'dispatched',
'moon',
'Pandora',
'unique',
'mission',
',',
'becomes',
'torn',
'following',
'orders',
'protecting',
'alien',
'civilization',
'.']
c. Lemmatization/Stemming
The final step in this project is to use either "Lemmatization" or "Stemming" on the statement.
Stemming reduces the remaining tokens into their "stem" or "root" form. This makes the statement even more basic and allows for different variations of words to be matched for example "protecting" and "protected" would be turned into "protect".
Lemmatization performs a similar function however, lemmatization uses the language's full vocabulary to more accurately represent what the root of a word should be rather than just cutting off what the algorithm perceives to be the suffix of the word.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english', ignore_stopwords=True)
overview_1_stemmed = [stemmer.stem(word) for word in overview_1_stopwords]
overview_1_stemmed
['in',
'22nd',
'centuri',
',',
'parapleg',
'marin',
'dispatch',
'moon',
'pandora',
'uniqu',
'mission',
',',
'becom',
'torn',
'follow',
'order',
'protect',
'alien',
'civil',
'.']
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
True
lemmatizer = WordNetLemmatizer()
overview_1_lemmatized = [lemmatizer.lemmatize(word) for word in overview_1_stopwords]
overview_1_lemmatized
['In',
'22nd',
'century',
',',
'paraplegic',
'Marine',
'dispatched',
'moon',
'Pandora',
'unique',
'mission',
',',
'becomes',
'torn',
'following',
'order',
'protecting',
'alien',
'civilization',
'.']
4. Count Vectorization
Vectorization is the process of converting text into numerical matrix. There are many methods of vectorization, but in this case we will be using "Count Vectorization". This method essentially counts the number of times each word appears in each statement.
Before vectorization is performed, the dataset needs to be prepared for vectorization using all of the other relevant columns. The following is essentially tokenizing every relevant column by removing all of the formatting and converting them into a list of tokens/words/names. They are then all combined into one column simply titled 'tags'.
import ast
Create a new DataFrame with only the necessary columns.
movies_df = raw_df[['movie_id','original_title','overview','genres','keywords','cast','crew']]
movies_df.dropna(inplace=True)
movies_df.isna().sum()
/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py:311: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return func(*args, **kwargs)
movie_id 0
original_title 0
overview 0
genres 0
keywords 0
cast 0
crew 0
dtype: int64
movies_df.head(5)
Functions to get information out of the columns, format the columns and to tokenize the statements.
def get_cast(text):
"""Get a list of the first 5 cast members"""
counter = 0
cast= []
for i in ast.literal_eval(text):
if counter != 5:
cast.append(i['name'])
counter += 1
else:
break
return cast
def get_director(text):
"""Get the name of the director of the movie"""
director = []
for i in ast.literal_eval(text):
if i['job'] == 'Director':
director.append(i['name'])
break
return director
def get_tags(text):
"""get a list of all of the tags in the keywords column"""
tags = []
for i in ast.literal_eval(text):
tags.append(i['name'])
return tags
def remove_spaces(text):
"""Remove spaces in list items"""
return [i.replace(" ", "") for i in text]
def split_text(text):
"""Split strings into a list using empty space as a separator"""
return text.split()
def tokenize(text):
"""Perform the tokenization and stemming of a sentence"""
tokenized = [stemmer.stem(word) for word in word_tokenize(text)]
punc = [',', '.', '?', "'"]
return [i for i in tokenized if i not in punc]
movies_df['genres'] = movies_df['genres'].apply(get_tags)
movies_df['keywords'] = movies_df['keywords'].apply(get_tags)
movies_df['cast'] = movies_df['cast'].apply(get_cast)
movies_df['crew'] = movies_df['crew'].apply(get_director)
movies_df['overview'] = movies_df['overview'].apply(split_text)
movies_df['cast'] = movies_df['cast'].apply(remove_spaces)
movies_df['crew'] = movies_df['crew'].apply(remove_spaces)
movies_df['genres'] = movies_df['genres'].apply(remove_spaces)
movies_df['keywords'] = movies_df['keywords'].apply(remove_spaces)
/tmp/ipykernel_37/3296912027.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['genres'] = movies_df['genres'].apply(get_tags)
/tmp/ipykernel_37/3296912027.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['keywords'] = movies_df['keywords'].apply(get_tags)
/tmp/ipykernel_37/3296912027.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['cast'] = movies_df['cast'].apply(get_cast)
/tmp/ipykernel_37/3296912027.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['crew'] = movies_df['crew'].apply(get_director)
/tmp/ipykernel_37/3296912027.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['overview'] = movies_df['overview'].apply(split_text)
/tmp/ipykernel_37/3296912027.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['cast'] = movies_df['cast'].apply(remove_spaces)
/tmp/ipykernel_37/3296912027.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['crew'] = movies_df['crew'].apply(remove_spaces)
/tmp/ipykernel_37/3296912027.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['genres'] = movies_df['genres'].apply(remove_spaces)
/tmp/ipykernel_37/3296912027.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['keywords'] = movies_df['keywords'].apply(remove_spaces)
movies_df.head(5)
Create a new column named 'tags' with all of the information from the necessary columns.
movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']
/tmp/ipykernel_37/4004039259.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']
movies_df['tags'][0]
['In',
'the',
'22nd',
'century,',
'a',
'paraplegic',
'Marine',
'is',
'dispatched',
'to',
'the',
'moon',
'Pandora',
'on',
'a',
'unique',
'mission,',
'but',
'becomes',
'torn',
'between',
'following',
'orders',
'and',
'protecting',
'an',
'alien',
'civilization.',
'Action',
'Adventure',
'Fantasy',
'ScienceFiction',
'cultureclash',
'future',
'spacewar',
'spacecolony',
'society',
'spacetravel',
'futuristic',
'romance',
'space',
'alien',
'tribe',
'alienplanet',
'cgi',
'marine',
'soldier',
'battle',
'loveaffair',
'antiwar',
'powerrelations',
'mindandsoul',
'3d',
'SamWorthington',
'ZoeSaldana',
'SigourneyWeaver',
'StephenLang',
'MichelleRodriguez',
'JamesCameron']
Create a final DataFrame with 3 columns, the movie ID, the title and the tags
final_df = movies_df[['movie_id', 'original_title', 'tags']]
final_df['tags'] = final_df['tags'].apply(lambda x: " ".join(x))
final_df['tags'] = final_df['tags'].apply(lambda x: x.lower())
final_df['movie_id'] = final_df['movie_id'].astype(np.int32)
/tmp/ipykernel_37/2448317582.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
final_df['tags'] = final_df['tags'].apply(lambda x: " ".join(x))
/tmp/ipykernel_37/2448317582.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
final_df['tags'] = final_df['tags'].apply(lambda x: x.lower())
/tmp/ipykernel_37/2448317582.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
final_df['movie_id'] = final_df['movie_id'].astype(np.int32)
final_df.head(5)
To perform count vectorization, the module CountVectorizer is imported from Scikit-learn. The count vectorizer is initialized using the tokenize function from the previous section as well setting the stop words argument to english and restricting the maximum features to 1000.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=tokenize,
stop_words='english',
max_features=1000)
cv_matrix = cv.fit_transform(final_df['tags']).toarray()
cv_matrix.shape
/opt/conda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['afterward', 'alon', 'alreadi', 'alway', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becom', 'besid', 'cri', 'describ', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'otherwis', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev'] not in stop_words.
warnings.warn(
(4539, 1000)
cv_matrix
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 1, 0, 0],
...,
[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
5. Cosine Similarity
Cosine similarity is a measure of similarity between two sequences of numbers. In this case, each row in the tags column has been converted into a very large number of columns containing numbers referring to the significance or importance of the words to the movie. Using this technique it is possible to measure the similarity between two sets of tags and therefore measure how similar two movies are and whether or not to recommend one based on the other.
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(cv_matrix)
similarity
array([[1. , 0.1467348 , 0.12792043, ..., 0.09988907, 0.04264014,
0. ],
[0.1467348 , 1. , 0.09176629, ..., 0.03582872, 0. ,
0. ],
[0.12792043, 0.09176629, 1. , ..., 0.0624695 , 0. ,
0. ],
...,
[0.09988907, 0.03582872, 0.0624695 , ..., 1. , 0.0624695 ,
0.02567481],
[0.04264014, 0. , 0. , ..., 0.0624695 , 1. ,
0.0328798 ],
[0. , 0. , 0. , ..., 0.02567481, 0.0328798 ,
1. ]])
6. Build Recommender
The recommender is essentially taking the name of a movie entered by the user, comparing the similarities with all of the movies in the DataFrame and outputting the 5 most similar movies based on the cosine similarity metric used earlier.
def recommender(movie):
movie_index = final_df[final_df['original_title'] == movie].index[0]
distances = similarity[movie_index]
movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
return [final_df.iloc[i[0]].original_title for i in movies_list]
recommender('Avatar')
['Independence Day',
'Beowulf',
'Aliens vs Predator: Requiem',
'Jupiter Ascending',
'Small Soldiers']
Advantages and Disadvantages
Advantages:
- Often performs at a high level of accuracy for tasks where frequency or occurrence of words are predictive features
- Easy and quick to implement
- Helpful when working on a few domain specific documents (i.e. sentiment analysis of political news data from twitter)
Disadvantages:
- Doesn't work well for large documents as it can run into issues with computation and differentiating between vectors when the size of the vocabulary is large (a large variety of words used)
- Unsuitable if the sequence of words matters (text generation)
- Has trouble understanding the meaning of text data as sentence structure is not taken into account and sentences with similar words but a different meaning will have a similar vector representation.
Summary & Conclusion
In summation, this was an introduction to Natural Language Processing using the TMBD 5000 Movie Dataset from Kaggle.com to build a movie recommending application. We used the overview, genres, keywords, cast and director columns as tags to measure similarity between every movie in the database and output the 5 most similar movies to the movie entered into the application.
The outline we followed was:
1. Install and import the necessary libraries
2. Download and explore the dataset to be used in this project
3. Text preprocessing
4. Perform vectorization
5. Cosine similarity
6. Create Recommender
References
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
- https://www.geeksforgeeks.org/cosine-similarity/
- https://en.wikipedia.org/wiki/Cosine_similarity
- https://jovian.ai/learn/nautral-language-processing-zero-to-nlp/lesson/text-classification-with-bag-of-words#C29
- https://www.analyticsvidhya.com/blog/2021/05/natural-language-processing-step-by-step-guide/
- https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_python.htm#
- https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
- https://www.kaggle.com/code/abdoashref/nlp-movie-recommender-system
- https://www.kaggle.com/code/codefreaksubhamml/movie-recommendation-system-vectorization-bow
- https://www.englishbix.com/stop-words-list/
- https://pythonwife.com/stemming-in-nlp/
- https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
- https://towardsdatascience.com/nlp-in-python-vectorizing-a2b4fc1a339e
- https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/
- https://blog.quantinsti.com/bag-of-words/
- https://aiml.org/what-are-the-advantages-and-disadvantages-of-bag-of-words-model/
- https://www.analyticsvidhya.com/blog/2021/07/bag-of-words-vs-tfidf-vectorization-a-hands-on-tutorial/
- https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/