Created 3 years ago
Dealing with Large Datasets using Pandas
It is no lie that 'Data is the new Oil', but the amount of data produced every day is mind-boggling. There is about 2.5 quintillion bytes of data created each day at our current pace. And it is not surprising that,
In the last two years alone, the astonishing 90% of the world’s data has been created.
To be able to handle and engineer such a vast amount of data is power.
In This Tutorial we will cover the following topics:
- Loading datasets into Google Colab.
- Fastening Data Loading processes with pandas.dataframe
- Memory saving with pandas(Chunking)
- Loading datasets into intermediate file formats.
- Fastening Data Loading processes with other libraries.
Prerequisites:
- You should be familliar with pandas, series and dataframes. If you are not familiar with these concepts, have a quick look at this helper notebook
- You can find out how to run this notebook on google Colab with this helper notebook
import pandas as pd
Opendatasets
is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.
There are multiple ways to load your dataset into colab. I have mentioned two below.
Although, using a link to directly download the datasets should be the ideal way to do it.