Learn practical skills, build real-world projects, and advance your career

About the project

We are going to use the Free Spoken Digit Dataset (FSDD) to create a ResNet that identifies spoken digits.

The dataset consists of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

The dataset has 3,000 recordings from total of 6 speakers (50 of each digit per speaker) at the time of writing.

The audio data can be represented in many forms, like for example, as a time series vector, or as a spectrogram (image). However, we use Mel-frequency cepstral coefficients (MFCCs) as it has been found to be a better representation of sound for deep learning.

# To play audio files inside the notebook
from IPython.display import Audio

Downloading the dataset