Learn practical skills, build real-world projects, and advance your career

Python for Data 25: Chi-Squared Tests

Last lesson we introduced the framework of statistical hypothesis testing and the t-test for investigating differences between numeric variables. In this lesson, we turn our attention to a common statistical test for categorical variables: the chi-squared test.

Chi-Squared Goodness-Of-Fit Test

In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.

When working with categorical data, the values themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.

Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:

import numpy as np
import pandas as pd
import scipy.stats as stats