Learn to identify spoken digits using ResNet with the Free Spoken Digit Dataset. The blog explains how to process audio data and generate MFCCs for deep learning.
We are going to use the Free Spoken Digit Dataset (FSDD) to create a ResNet that identifies spoken digits.
The dataset consists of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.
The dataset has 3,000 recordings from total of 6 speakers (50 of each digit per speaker) at the time of writing.
The audio data can be represented in many forms, like for example, as a time series vector, or as a spectrogram (image). However, we use Mel-frequency cepstral coefficients (MFCCs) as it has been found to be a better representation of sound for deep learning.
# To play audio files inside the notebook from IPython.display import Audio