Analyzing Tabular Data using Python and Pandas

Part 7 of "Data Analysis with Python: Zero to Pandas"

This tutorial is the seventh in a series on introduction to programming and data analysis using the Python language. These tutorials take a practical coding-based approach, and the best way to learn the material is to execute the code and experiment with the examples. Check out the full series here:

  1. First Steps with Python and Jupyter
  2. A Quick Tour of Variables and Data Types
  3. Branching using Conditional Statements and Loops
  4. Writing Reusable Code Using Functions
  5. Reading from and Writing to Files
  6. Numerical Computing with Python and Numpy
  7. Analyzing Tabular Data using Pandas

How to run the code

This tutorial hosted on Jovian.ml, a platform for sharing data science projects online. You can "run" this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your own computer.

This tutorial is a Jupyter notebook - a document made of "cells", which can contain explanations in text or code written in Python. Code cells can be executed and their outputs e.g. numbers, messages, graphs, tables, files etc. can be viewed within the notebook, which makes it a really powerful platform for experimentation and analysis. Don't afraid to experiment with the code & break things - you'll learn a lot by encoutering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top of the notebook.

Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on mybinder.org, a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.

Option 2: Running on your computer locally

You'll need to install Python and download this notebook on your computer to run in locally. We recommend using the Conda distribution of Python. Here's what you need to do to get started:

  1. Install Conda by following these instructions. Make sure to add Conda binaries to your system PATH to be able to run the conda command line tool from your Mac/Linux terminal or Windows command prompt.

  2. Create and activate a Conda virtual environment called zerotopandas which you can use for this tutorial series:

conda create -n zerotopandas -y python=3.8 
conda activate zerotopandas

You'll need to create the environment only once, but you'll have to activate it every time want to run the notebook. When the environment is activated, you should be able to see a prefix (python-pandas-data-analysis) within your terminal or command prompt.

  1. Install the required Python libraries within the environmebt by the running the following command on your terminal or command prompt:
pip install jovian jupyter numpy pandas matplotlib seaborn --upgrade
  1. Download the notebook for this tutorial using the jovian clone command:
jovian clone aakashns/python-pandas-data-analysis

The notebook is downloaded to the directory python-pandas-data-analysis.

  1. Enter the project directory and start the Jupyter notebook:
cd python-pandas-data-analysis
jupyter notebook
  1. You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook python-pandas-data-analysis.ipynb to open it and run the code. If you want to type out the code yourself, you can also create a new notebook using the "New" button.

Reading a CSV file using Pandas

Pandas is typically used for working in tabular data (simliar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formates like CSV, Excel spreadsheets, HTML tables, JSON, SQL and more. Let's download a file italy-covid-daywise.txt which contains daywise Covid-19 data for Italy in the following format:

date,new_cases,new_deaths,new_tests
2020-04-21,2256.0,454.0,28095.0
2020-04-22,2729.0,534.0,44248.0
2020-04-23,3370.0,437.0,37083.0
2020-04-24,2646.0,464.0,95273.0
2020-04-25,3021.0,420.0,38676.0
2020-04-26,2357.0,415.0,24113.0
2020-04-27,2324.0,260.0,26678.0
2020-04-28,1739.0,333.0,37554.0
...

This format of storing data is known as comma separated values or CSV.

CSVs: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

We'll download this file using the urlretrieve function from the urllib.request module.

from urllib.request import urlretrieve
urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv', 
            'italy-covid-daywise.csv')
('italy-covid-daywise.csv', <http.client.HTTPMessage at 0x7fc5f1c599a0>)