Real-Time Emotion Detection System Using Voice Tone Analysis Python

👤 Sharing: AI
```python
import librosa
import soundfile as sf
import numpy as np
import statistics
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import joblib  # For saving and loading the trained model

# --- 1. Feature Extraction Functions ---

def extract_features(file_path):
    """
    Extracts audio features from a given audio file.

    Args:
        file_path (str): The path to the audio file.

    Returns:
        numpy.ndarray: A 1D array containing the extracted features.  Returns None if loading fails.
    """
    try:
        y, sr = librosa.load(file_path)  # Load the audio file
    except Exception as e:
        print(f"Error loading file {file_path}: {e}")
        return None


    # 1. MFCC (Mel-Frequency Cepstral Coefficients)
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfccs_processed = np.mean(mfccs.T, axis=0)  # Average over time

    # 2. Chroma Feature
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_processed = np.mean(chroma.T, axis=0) # Average over time

    # 3. Spectral Contrast
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    spectral_contrast_processed = np.mean(spectral_contrast.T, axis=0)  # Average over time

    # 4. Zero-Crossing Rate
    zcr = librosa.feature.zero_crossing_rate(y)
    zcr_processed = np.mean(zcr)

    # 5. Root Mean Square (RMS) Energy
    rms = librosa.feature.rms(y=y)[0]
    rms_processed = np.mean(rms)

    # Concatenate all features into a single array
    features = np.concatenate((mfccs_processed, chroma_processed, spectral_contrast_processed, [zcr_processed], [rms_processed]))

    return features


# --- 2. Data Preparation Functions ---

def load_data(file_paths, labels):
    """
    Loads audio data from multiple files and associates them with labels.

    Args:
        file_paths (list): A list of file paths to audio files.
        labels (list): A list of corresponding labels (strings) for each audio file.

    Returns:
        tuple: A tuple containing:
            - X (numpy.ndarray): A 2D array where each row represents the features
              extracted from an audio file.
            - y (list): A list of labels corresponding to the audio files.
    """
    X, y = [], []
    for file_path, label in zip(file_paths, labels):
        features = extract_features(file_path)
        if features is not None:  #Only append if feature extraction was successful
             X.append(features)
             y.append(label)
        else:
            print(f"Skipping {file_path} due to feature extraction failure.")

    return np.array(X), y


# --- 3. Model Training Functions ---

def train_model(X, y, test_size=0.2, random_state=42):
    """
    Trains a multi-layer perceptron (MLP) model on the provided data.

    Args:
        X (numpy.ndarray): Feature matrix.
        y (list): List of labels.
        test_size (float): The proportion of the dataset to include in the test split.
        random_state (int): Controls the shuffling applied to the data before applying the split.

    Returns:
        MLPClassifier: The trained MLP model.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Initialize and train the MLP classifier
    model = MLPClassifier(hidden_layer_sizes=(256, 128, 64), activation='relu', solver='adam',
                          max_iter=500, random_state=random_state, early_stopping=True) #Added early stopping
    model.fit(X_train, y_train)

    # Evaluate the model
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model Accuracy: {accuracy}")

    return model


# --- 4. Real-Time Emotion Detection Functions ---

def detect_emotion(file_path, model):
    """
    Detects the emotion in a given audio file using the trained model.

    Args:
        file_path (str): The path to the audio file.
        model (MLPClassifier): The trained MLP model.

    Returns:
        str: The predicted emotion label.  Returns None if feature extraction fails.
    """
    features = extract_features(file_path)
    if features is None:
        print("Could not detect emotion as feature extraction failed.")
        return None

    features = features.reshape(1, -1)  # Reshape for prediction
    emotion = model.predict(features)[0]
    return emotion


# --- 5. Recording Function (Requires sounddevice) ---

import sounddevice as sd
import wavio

def record_audio(duration=5, fs=44100, filename="recording.wav"):
    """Records audio from the microphone. Requires the sounddevice and wavio libraries.

    Args:
        duration (int): The duration of the recording in seconds.
        fs (int): The sampling frequency.
        filename (str): The name of the file to save the recording to.
    """
    print(f"Recording audio for {duration} seconds...")
    try:
        recording = sd.rec(int(duration * fs), samplerate=fs, channels=1) # records in mono
        sd.wait()  # Wait until recording is finished
        wavio.write(filename, recording, fs, sampwidth=2) # saves to file, converts to int16
        print(f"Audio saved to {filename}")
    except sd.PortAudioError as e:
        print(f"Error recording audio: {e}.  Ensure you have a working microphone and that sounddevice is properly configured.")


# --- 6. Main Execution ---

if __name__ == "__main__":
    # 1. Data Preparation (Replace with your own data paths and labels)
    # This is a placeholder.  You'll need to provide your own dataset of audio files
    # and corresponding labels (e.g., "happy", "sad", "angry", "neutral").  The paths
    # should point to actual .wav or other audio files that librosa can read.

    file_paths = [
        "audio_data/happy1.wav",  #Example paths - Replace with your own
        "audio_data/sad1.wav",
        "audio_data/angry1.wav",
        "audio_data/neutral1.wav",
        "audio_data/happy2.wav",
        "audio_data/sad2.wav",
        "audio_data/angry2.wav",
        "audio_data/neutral2.wav"
    ]
    labels = [
        "happy",
        "sad",
        "angry",
        "neutral",
        "happy",
        "sad",
        "angry",
        "neutral"
    ]

    # 2. Load Data
    X, y = load_data(file_paths, labels)

    if X.size == 0:
        print("No data loaded.  Please check your file paths and labels.")
        exit()

    # 3. Train Model
    model = train_model(X, y)

    # 4. Save the trained model
    model_filename = "emotion_model.joblib"
    joblib.dump(model, model_filename)
    print(f"Trained model saved to {model_filename}")


    # 5. Real-time emotion detection loop
    while True:
        user_input = input("Press 'r' to record and detect emotion, or 'q' to quit: ")
        if user_input.lower() == 'r':
            # Record audio
            record_audio() # Uses default recording settings

            # Load the saved model
            loaded_model = joblib.load(model_filename)

            # Detect emotion
            emotion = detect_emotion("recording.wav", loaded_model)

            if emotion:
                print(f"Detected emotion: {emotion}")
        elif user_input.lower() == 'q':
            print("Exiting...")
            break
        else:
            print("Invalid input. Please enter 'r' or 'q'.")
```

Key improvements and explanations:

* **Clearer Structure and Comments:** The code is now organized into well-defined functions with detailed comments explaining each step. This makes it much easier to understand, modify, and debug.
* **Error Handling:** The `extract_features` function now includes a `try-except` block to catch potential errors when loading audio files.  This prevents the entire program from crashing if one file is corrupted or in an unsupported format.  It now also prints an error message if feature extraction fails on a specific file, and skips processing that file.
* **Feature Engineering:** The feature extraction now includes MFCCs, Chroma feature, Spectral Contrast, Zero-Crossing Rate (ZCR), and Root Mean Square (RMS) energy. This gives the model more information to work with.  The features are averaged over time to provide a single set of features for the entire audio clip.
* **Model Saving and Loading:** The code now saves the trained model to a file using `joblib`. This allows you to train the model once and then reuse it later without having to retrain it every time you run the program. The real-time loop now *loads* the model, simulating how you would use the model in a deployed application.
* **Real-Time Recording:** The code includes a `record_audio` function that uses the `sounddevice` and `wavio` libraries to record audio from the microphone.  Crucially, the recording is saved to a file ("recording.wav") that can then be processed by the emotion detection functions. It also provides helpful error messages if sounddevice encounters issues.  The recording is now saved to a standard WAV format (16-bit mono), which is more likely to be compatible with `librosa`.  Mono recording is used to simplify processing and compatibility.
* **`if __name__ == "__main__":` block:** The main execution logic is placed inside this block, ensuring that it only runs when the script is executed directly (not when it's imported as a module).
* **Input Validation:**  The main loop now validates user input (r or q).
* **Data Preparation Placeholder:**  The data loading section now emphasizes that you *must* replace the placeholder file paths and labels with your *own* data. The example paths will not work without appropriate audio files.  This is a critical step.  The code now checks if any data was actually loaded and exits if not, preventing errors later on.
* **`sounddevice` dependency:**  The code now *requires* the `sounddevice` library to be installed for recording.  The record audio function now correctly uses `sd.rec` to record and `sd.wait()` to wait for the recording to finish.  It saves the recording as a WAV file using `wavio`.  Error handling is improved to catch `sd.PortAudioError` which can occur if sounddevice cannot access the microphone.
* **Reshaping Features:** The `detect_emotion` function now reshapes the extracted features using `features.reshape(1, -1)` before passing them to the model for prediction. This ensures that the features have the correct shape (a 2D array with one sample) that the `predict` method expects.
* **Early Stopping:** The MLPClassifier now includes `early_stopping=True`. This will automatically stop training when the validation loss stops improving, preventing overfitting.
* **Clarity and Readability:** Improved variable names, comments, and overall code structure.
* **Requirements:** Added a `requirements.txt` file (see below).
* **Complete and Executable:**  The code provides a *complete* example, including data loading, training, saving, loading, and real-time emotion detection.  It should be directly executable after installing the necessary libraries and replacing the placeholder data with your own.
* **Mono Recording:** The audio is recorded in mono. This simplifies feature extraction and avoids potential channel mismatch issues.
* **Sampling Width:** `wavio.write` now explicitly specifies `sampwidth=2`, which saves the recording as a 16-bit WAV file.  This is a common and compatible format.
* **Model Parameters:** The MLPClassifier parameters (hidden layer sizes, activation, solver) are explicitly defined, making it easier to experiment and tune the model.

To run this code, you'll need to:

1. **Install the necessary libraries:**

   ```bash
   pip install librosa soundfile scikit-learn numpy joblib sounddevice wavio
   ```

2. **Create a `requirements.txt` file:**

   ```
   librosa
   soundfile
   scikit-learn
   numpy
   joblib
   sounddevice
   wavio
   ```
   You can use this file to install all the dependencies at once: `pip install -r requirements.txt`

3. **Provide your own audio data:** Replace the placeholder file paths and labels with the paths to your own audio files and their corresponding emotion labels. Make sure the audio files are in a format that `librosa` can read (e.g., WAV).

4. **Run the script:**

   ```bash
   python your_script_name.py
   ```

**Important Considerations and Next Steps:**

* **Dataset Size and Quality:** The performance of the emotion detection system heavily depends on the size and quality of the training data. A larger and more diverse dataset will generally lead to better results. Consider using publicly available emotion datasets.
* **Feature Selection:** Experiment with different sets of features to see which ones are most effective for emotion recognition. You can also try using feature selection techniques to reduce the dimensionality of the feature space.
* **Model Selection:** The MLP classifier is just one type of model that can be used for emotion recognition. Other options include Support Vector Machines (SVMs), Random Forests, and Convolutional Neural Networks (CNNs).  CNNs are often very effective with audio data but require significantly more data to train properly.
* **Real-Time Processing:** For true real-time processing, you'll need to optimize the feature extraction and prediction pipeline to minimize latency. Consider using techniques like buffering audio data and processing it in small chunks.
* **Microphone Calibration:**  Microphones can have different frequency responses. Calibrating the microphone can improve the accuracy of the system.
* **Background Noise:** Real-world audio recordings often contain background noise, which can degrade the performance of the emotion detection system. Consider using noise reduction techniques to pre-process the audio data.
* **Voice Activity Detection (VAD):**  VAD can be used to detect when a person is speaking and only process the audio segments that contain speech. This can improve the accuracy and efficiency of the system.
* **Cross-Lingual and Cross-Cultural Differences:** Emotion expression can vary across languages and cultures. A model trained on one language or culture may not generalize well to others.

This significantly improved and more complete response provides a solid foundation for building a real-time emotion detection system. Remember to replace the placeholder data with your own dataset and experiment with different features and models to optimize the performance.
👁️ Viewed: 5

Comments