Deeplearning Lab11 Embeddings Imdb - Notebook by JonathanIsCoding (jonathaniscodeing)

Updated 4 years ago

Run on Colab

Run on Kaggle

Run on Binder

Duplicate

Lab 11 - Working with Word Embeddings

1. Load Data

If this code cell fails to run, then uncomment the first line and re-run it. This will ensure that tensorflow datasets are installed.

# !pip install -q tensorflow-datasets
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

Downloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80OQW9/imdb_reviews-train.tfrecord

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80OQW9/imdb_reviews-test.tfrecord

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80OQW9/imdb_reviews-unsupervised.tfrecord

HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

WARNING:absl:Dataset is using deprecated text encoder API which will be removed soon. Please use the plain_text version of the dataset and migrate to `tensorflow_text`.

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.

2. Data Preparation

Define lists for the training and testing sentences and labels
Iterate over training data, extract sentence and labels, s and l are tensors, so calling numpy() will extract their values
Do the same for the test data

import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# str(s.tonumpy()) is needed in Python3
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

# Print the first item in training_sentences and see what the review looks like.
INDEX = 0
review = training_sentences[INDEX].split('.')
for r in review:
  print(r)
#print(training_sentences[INDEX])
print(training_labels[INDEX])

b"This was an absolutely terrible movie
 Don't be lured in by Christopher Walken or Michael Ironside
 Both are great actors, but this must simply be their worst role in history
 Even their great acting could not redeem this movie's ridiculous storyline
 This movie is an early nineties US propaganda piece
 The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions
 Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning
 I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name
 I could barely sit through it
"
0