Learn practical skills, build real-world projects, and advance your career
project_name = 'emotional-speech-classification2d-resnet'

Emotional-speech-classification using PyTorch

The dataset is taken from kaggle : https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio

The dataset don't come with seperate test dataset.

SO for the sake of comparing result I created a test and train dataset and saved into files(train.csv, test.csv).

If you use the dataset please use the csv files I created so that we can compare the models optimally

I also provided the code to create the test.csv and train.csv
which you can find below

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Speech audio-only files (16bit, 48kHz .wav) from the RAVDESS. Full dataset of speech and song, audio and video (24.8 GB) available from Zenodo. Construction and perceptual validation of the RAVDESS is described in our Open Access paper in PLoS ONE.

Files

  • for Speech data-set : This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

  • for song data-set : This portion of the RAVDESS contains 1012 files: 44 trials per actor x 23 actors = 1012. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Song emotions includes calm, happy, sad, angry, and fearful expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

File naming convention (same for both the data-set)

Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

Filename identifiers

  1. Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

  2. Vocal channel (01 = speech, 02 = song).

  3. Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

  4. Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

  5. Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

  6. Repetition (01 = 1st repetition, 02 = 2nd repetition).

  7. Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 03-01-06-01-02-01-12.wav

  1. Audio-only (03)
  2. Speech (01)
  3. Fearful (06)
  4. Normal intensity (01)
  5. Statement "dogs" (02)
  6. 1st Repetition (01)
  7. 12th Actor (12)
  8. Female, as the actor ID number is even.

How to cite the RAVDESS

Academic citation

If you use the RAVDESS in an academic publication, please use the following citation: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

All other attributions

If you use the RAVDESS in a form other than an academic publication, such as in a blog post, school project, or non-commercial product, please use the following attribution: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

objective of this notebook:

use the audio to generate a spectrogram and then use a nural network for images and pass the spectograms to the model.

Previously I tried a direct approach to the emotional classification which didn't turned out as I expected. link to the notebook

Setting up the notebook