I’m passionate about speech technologies so I talk too much.
Self-supervised speech models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on speech emotion recognition.
At Zaion Lab, as we work continuously on improving our emotion recognition engine, the idea of exploring wav2vec 2.0 and HuBERT for this task is appealing.
So in our work, we adapted wav2vec 2.0 and HuBERT to emotion recognition by fine-tuning their pre-trained models. The results have been excellent! We achieved state-of-the-art on the IEMOCAP public emotion recognition dataset with 79.58% weighted accuracy.
Let’s see how we did that!
Wav2Vec 2.0 (Facebook 2020) and HuBERT (Facebook 2021) are two of the most powerful and the most representative self-supervised models used in speech tasks, especially for ASR. First let’s give a brief introduction of self-supervised learning. Quoting Yann LeCun, who originally proposed this approach in 2019: “In Self-Supervised Learning, the system learns to predict part of its input from other parts of it input”. By constructing auxiliary tasks to explore supervised information from unsupervised data, self-supervised learning makes it possible to train networks through these constructed tasks to learn valuable representations for downstream tasks. For example, in wav2vec 2.0, some frames of the audio are masked randomly, and the model is trained to predict these masked frames correctly. In this way the model learns how to represent the audios without any labelled data.
Hubert shares the same idea of wav2vec 2.0, but it applies a classification task, forcing the model to classify masked frames into designed labels. To generate labels for the first iteration, a k-means clustering is applied on 39-dimensional MFCC features. To generate better targets for the subsequent iterations, k-means clustering then works on the latent features extracted from the HuBERT model pre-trained in the previous iteration.
For more details about wav2vec 2.0 and HuBERT, please check:
In our work, we compared four released wav2vec 2.0 pre-trained models from Meta AI: the wav2vec 2.0 base model (12 transformer blocks and 768 embedding dimension) and its fine-tuned ASR version, the wav2vec 2.0 large model (24 transformer blocks and 1024 embedding dimension) and its fine-tuned ASR version. Both base and large models are pre-trained on 960h LibriSpeech data, which is also used for their ASR fine-tuning.
Emotion recognition has been given high expectations from the industry; a successful detection of the human emotional state is of great significance for Zaion’s solutions.
Speech is the main medium of communication in daily life, as it not only conveys thoughts, but also expresses the emotional state of the speaker. The goal of speech emotion recognition is to identify human emotional states from speech, which is an utterance-level task: the input is a sentence audio, and the output is an emotional class such as neutral, anger, etc. The main challenge of Speech Emotion Detection is to extract powerful emotional features such as manual features like MFCC, spectrograms, prosodies in the past years, and features from deep networks that are commonly used in more recent research.
In the SUPERB benchmark (NTU, MIT, Facebook AI, JHU, Amazon AI, CMU), wav2vec 2.0 and HuBERT are used as frozen feature extractors, and during this process, no parameters of pre-trained wav2vec 2.0 and HuBERT are changed. The extracted features are then used for a 4-class classification (neutral, anger, sad, happy), the best model (HuBERT large) achieved 67.62% weighted accuracy on IEMOCAP.
However, in our work, we explored partial fine-tuning and entire fine-tuning of wav2vec 2.0 and HuBERT, which means the parameters of them are adapted to the emotion recognition task during the training process.
The two fine-tuning methods are shown as below:
As is shown in this figure on the left for partial fine-tuning, the wav2vec 2.0/HuBERT model is divided into two parts: a CNN-based feature encoder and a transformer-based contextualized encoder. We freeze the CNN-based feature encoder, fixing all the parameters of these CNN blocks, and only fine-tune the parameters of the transformer blocks. Partial fine-tuning can be understood as a domain adaptation training for the top level, which aims to prevent interference and damage to the bottom CNN layers that already have an expressive ability.
For entire fine-tuning which is shown on the right, the CNN and Transformer modules are both fine-tuned during the downstream training process. By training general features at the bottom level, entire fine-tuning allows higher-level expressions to be more complete and more targeted.
For the downstream structure, an average time pooling and one linear layer are added as a simple downstream classifier. The average time pooling compresses variant time lengths into one, then the linear layer effectuate an utterance-level classification minimizing the cross-entropy loss.
The dataset chosen in our experiments is IEMOCAP, one of the most widely used and the most representative datasets for speech emotion recognition.
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset has approximately 12 hours of data and consists of scripted and improvised dialogues by 10 speakers. As is generally done in Speech Emotion Recognition research, we also used 4 emotional classes in our work: anger, happiness, sadness and neutral. The evaluation metric is weighted accuracy (WA) and the result is averaged on 5 different seeds.
Before looking at the results, we rename the models we compare with a method as below:
EF/PF/Frozen: Entirely Fine-tuned/Partially Fine-tuned/Not fine-tuned
w2v/hbt: wav2vec 2.0/HuBERT based model
base/large: base/large pre-trained model
-/960h: with/without ASR fine-tuning using 960h LibriSpeech data
We compare our results with the current state-of-the-art Attention Pooling which provides a comparable result by reporting a competitive Weighted Accuracy using only speech.
We were surprised to find that all our fine-tuned models performed well from the figure below, where partially fine-tuned HuBERT large model reaches a best weighted accuracy of 79.58%, improving by 7.83% on the state-of-the-art baseline, and far surpassing the frozen models listed in SUPERB benchmark. Besides that, we observe that partial fine-tuning appears to be a better fine-tuning method than entire fine-tuning. We considered that IEMOCAP is a small dataset with only 12 hours of data, so training too many parameters may easily cause an overfitting. Moreover, we noticed that the ASR fine-tuning is not helping the downstream SER task, suggesting a loss of prosodic information during the ASR fine-tuning.
Through this blog, we will show the strong power of self-supervised models through its fine-tuning on Speech Emotion Recognition. In the upcoming blogs, we will show that self-supervised models can also be adapted to Speaker Recognition and Spoken Language Understanding tasks with state-of-the-art performance as well!
The paper was submitted to Interspeech 2022. An ARXIV version can be found here. Our paper has been cited 6 times so far and the results have been used as state-of-the-art baselines by other researchers.