Learn practical skills, build real-world projects, and advance your career

Movies Example - Using a StratifiedSample

Author: Andreas Traut

In this example I used a dataset which contains the following columns:

Rank | Title | Year | Score | Metascore | Genre | Vote | Director | Runtime | Revenue | Description | RevCat

My aim was to predict the Revenue based on the other information. As there are some "NaN"-values in the column "Revenue" I did the following steps:

  • I filled the "NaN"-value in column "Revenue" with Median-values,
  • then I drew a stratified sample (based on "Revenue") on the whole dataset,
  • created a pipeline to fill the "NaN"-value in other columns (e.g. "Metascore", "Score").
  • split the dataset into a training dataset ("movies_training") and a testing dataset ("movies_test"):