Learn practical skills, build real-world projects, and advance your career

An Exploratory Analysis of COVID-19 in Texas


In this analysis, I an going to be looking at the data for COVID-19 case and death counts in Texas along with data that can be computed from those counts. The computed data will be cases per thousand, deaths per thousands and the case fatality rate.

Since a large portion of Texas is very rural with low population, I'm going to be concentrating the majority of the analysis on the metropolitan statistical areas in Texas. My main reason for this decision is that for rural counties, the computed data can easily be skewed by the small size of the population. For example, Marion county in Texas currently has a polulation of 9860 and has 147 cases and 13 deaths. This results in the following data:

CountyPopulationCasesDeathsCases per thousandDeaths per thousandCase fatality Rate (%)

The deaths per thousand and the case fatality rate are significantly higher than the results for Texas as a whole.

What is a metropolitan statistical area?

In the United States, a metropolitan statistical area or MSA is a geographical region with a relatively high population density at its core and close economic ties throughout the area. MSAs are defined by the US Office of Management and Budget and are used by the Census Bureau and other government agencies for statistical purposes.


We will be using the MSAs in Texas for the majority of our analysis as they describe the metropolitan areas in Texas which can contain more than one county. This helps to give a clearer picture of the data than just the numbers by county.

Data Sources

The data that I am going to be using comes from the following sources:

This dataset contains the historical counts of cases and deaths broken down by date and county within the United States. We are going to narrow the data down to just the data for the counties in Texas.

This is the latest population estimate data by county from the State of Texas, we are going to use the latest 2019 estimates for our county populations.

This PDF document has a table that defines which counties are in each of the MSAs in the state of Texas.


We are going to be using a Jupyter Notebook to write Python code to display tables and charts with cells using Markdown to describe our work and analysis.

The main Python modules that we will be using are pandas, numpy, Matplotlib, and seaborn. We will be using pandas mainly for its DataFrame data structure that closely resembles a spreadsheet with columns and rows for data. The pandas module has extensive support for doing operations and manipulating the data in the DataFrame object. In additon, pandas is built on top of numpy and uses numpy arrays extensively under the hood. We will be using numpy where needed to help us with the analysis. For our graphs, we will be mainly using seaborn and Matplotlib to create and display the graphs.