This project implements a Speech Emotion Recognition (SER) system using deep learning. It extracts audio features from the CREMA-D dataset and trains both 1D and 2D Convolutional Neural Networks (CNNs) to classify emotions from speech.
- 1D Feature Space: Zero Crossing Rate (ZCR) and Energy
- 2D Feature Space: Mel Spectrogram
- Data Augmentation: Applied only to 1D features (ZCR, Energy)
- Model Architectures: 1D CNN for 1D features, 2D CNN for Mel Spectrograms
- Evaluation: Accuracy, F1 Score, and Confusion Matrix
Speech-Emotion-Recognition/
│
├── Speech_Emotion_Recognition.ipynb # Main Jupyter Notebook
├── model_1d.h5 # Saved 1D CNN model (after training)
├── model_2d.h5 # Saved 2D CNN model (after training)
└── README.md # This file
-
Install Requirements
Make sure you have Python 3.8+ and install the required packages:
pip install numpy pandas librosa matplotlib scikit-learn tensorflow keras tqdm seaborn
-
Dataset
- Download the CREMA-D dataset and place the
.wavfiles in theCremafolder inside your project directory.
- Download the CREMA-D dataset and place the
-
Run the Notebook
- Open
Speech_Emotion_Recognition.ipynbin Jupyter Notebook or VS Code. - Run all cells to extract features, train models, and evaluate performance.
- Open
- The 1D CNN test accuracy
Test accuracy 1D: 56.05% - The 2D CNN test accuracy
Test accuracy 2D: 87.66%
- Feature Extraction: Extracts ZCR, Energy, and Mel Spectrogram from each audio file.
- Data Augmentation: Augments only the 1D features (not the Mel Spectrogram).
- Data Preparation: Pads and stacks features for model input.
- Model Training: Trains both 1D and 2D CNNs.
- Evaluation: Plots accuracy/loss curves, confusion matrices, and prints F1 scores.
- Confusion Matrices: Visualizes model performance and highlights the most confusing emotion pairs.
- Most Confusing Classes: Automatically identified and printed after evaluation.
- Data augmentation is not applied to the 2D Mel Spectrogram features.
- The notebook is designed to run efficiently, but processing the full dataset may require significant memory.