Created 3 years ago

Dealing with Large Datasets using Pandas

It is no lie that 'Data is the new Oil', but the amount of data produced every day is mind-boggling. There is about 2.5 quintillion bytes of data created each day at our current pace. And it is not surprising that,
In the last two years alone, the astonishing 90% of the world’s data has been created.

To be able to handle and engineer such a vast amount of data is power.

In This Tutorial we will cover the following topics:

Loading datasets into Google Colab.
Fastening Data Loading processes with pandas.dataframe
Memory saving with pandas(Chunking)
Loading datasets into intermediate file formats.
Fastening Data Loading processes with other libraries.

Prerequisites:

You should be familliar with pandas, series and dataframes. If you are not familiar with these concepts, have a quick look at this helper notebook
You can find out how to run this notebook on google Colab with this helper notebook

import pandas as pd

Opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.

There are multiple ways to load your dataset into colab. I have mentioned two below.
Although, using a link to directly download the datasets should be the ideal way to do it.