Yingzhi Wang, Deep Learning Engineer at Zaion, speechbrain contributor, Centrale-Supélec engineer.
I'm passionate about speech technology, which is why I love to chat.
Self-supervised speech models such as wav2vec 2.0 and HuBERT are making revolutionary advances in Automatic Speech Recognition (ASR). However, it has not yet been fully proven that they deliver improved performance in emotion recognition.
At Zaion Lab, we're working relentlessly to improve our emotion recognition engine, the idea of exploring wav2vec 2.0 and HuBERT.
For our project, we therefore adapted wav2vec 2.0 and HuBERT to emotion recognition by adjusting their pre-trained models. The results were excellent! We achieved state-of-the-art performance on the IEMOCAP public emotion recognition dataset, with a WA (Weighted Accuracy) of 79.58%.
Let's see how we did it!
A brief introduction to Wav2vec 2.0/HuBERT
Wav2Vec 2.0 (Facebook 2020) and HuBERT (Facebook 2021) are two of the most successful and representative self-supervised models used in speech tasks, particularly forASR. Let's start with a brief introduction to self-supervised learning. Quoting Yann LeCun, who originally proposed this approach in 2019: "In self-supervised learning, the system learns to predict part of its input from other parts of its input. By constructing complementary tasks to explore supervised information from unsupervised data, self-supervised learning enables neural networks to be trained through these constructed tasks with a view to learning valuable representations for downstream tasks. For example, in wav2vec 2.0, certain parts of the audio are randomly masked, and the model is trained to correctly predict these masked parts of the audio. In this way, the model learns to represent audios without any labeled data.
Hubert shares the same idea as wav2vec 2.0, but applies a classification task, forcing the model to classify hidden sequences into pre-defined categories. To generate labels for the first iteration, k-means clustering is applied on 39-dimensional MFCC features. To generate more accurate labels for subsequent iterations, k-means clustering then works on representations extracted from the pre-trained HuBERT model from the previous iteration.
For more information on wav2vec 2.0 and HuBERT, please see :
(wav2vec2) https://arxiv.org/abs/2006.11477
(HuBERT) https://arxiv.org/abs/2106.07447
In our work, we compared four Meta AI wav2vec 2.0 pre-trained models: the basic wav2vec 2.0 model (12 transformation blocks and 768 integration dimensions) and its ASR-adjusted version, the wide wav2vec 2.0 model (24 transformation blocks and 1024 integration dimensions) and its ASR-adjusted version. The base and wide models were pre-trained on LibriSpeech 960h data, also used for their ASR adjustment.
Why is there interest in voice recognition of emotions?
Emotion recognition has raised great expectations within Customer Relations; successful detection of the human emotional state is a major asset for Zaion's solutions.
Speech is the primary means of communication in everyday life, since it not only conveys thoughts, but also expresses the speaker's emotional state. The aim of emotion recognition in speech is to identify human emotional states from speech, which is a task played out at the utterance level: the input is an audio sentence, and the output is an emotional class such as neutral, angry, etc. The main challenge of emotion detection in speech is to identify the emotional state of the speaker. The main challenge in detecting emotions in speech is to extract powerful emotional features, such as manual features like MFCC, spectrograms, prosodies in recent years, and features from deep networks commonly used in more recent research.
Exploring Wav2vec2.0/HuBERT + voice recognition of emotions
In the SUPERB benchmark (NTU, MIT, Facebook AI, JHU, Amazon AI, CMU), wav2vec 2.0 and HuBERT are used as frozen feature extractors, and during this process, no parameters of the pre-trained wav2vec 2.0 and HuBERT are modified. The extracted features are then used for a 4-class classification (neutral, angry, sad, happy), with the best-performing model (HuBERT large) achieving a WA (Weighted Accuracy) of 67.62% on IEMOCAP.
However, in our research, we explored partial fine-tuning and full fine-tuning of wav2vec 2.0 and HuBERT, adapting their parameters to the emotion recognition task during the training process.
The two fine-tuning methods are shown below:
As shown in this illustration on the left for partial fine-tuning, the wav2vec 2.0/HuBERT model is divided into two parts: a CNN-based feature encoder and a transformer-based contextualized encoder. We block the CNN-based feature encoder, setting all the parameters of these CNN blocks, and refine only the parameters of the transform blocks Partial fine-tuning can be seen as domain adaptation training for the top level, intended to prevent interference and damage to the lower CNN layers, already endowed with expressive capability.
In the case of full fine-tuning, illustrated on the right, the CNN and transformer modules are all adjusted during the downstream training process. By training general features at the lower level, full fine-tuning enables higher-level expressions to be more complete and focused.
For the downstream structure, an average-pooling and a linear layer are added as a simple downstream classifier. The average-pooling compresses the different lengths into one, then the linear layer performs sentence-level classification, minimizing cross-entropy loss.
Data set
The dataset chosen in our experiments is IEMOCAP, one of the most widely used and representative datasets for speech emotion recognition.
The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset contains around 12 hours of data and consists of scripted and improvised dialogues by 10 speakers. As is generally the case in research on speech recognition of emotions, we also used 4 classes of emotions in our work:
- anger,
- happiness
- sadness
- neutral
The evaluation metric used is weighted accuracy (WA), and results are averaged over 5 different seeds.
And... The cutting edge in emotion recognition through speech
Before examining the results, we rename the models we are comparing with the following method:
EF/PF/Frozen: Fully refined/Partially refined/Unrefined
w2v/hbt: model based on wav2vec 2.0/HuBERT
base/large: base/large pre-trained model
-/960h: with/without ASR fine-tuning using LibriSpeech 960h data
We compare our results with the current state-of-the-art Attention Pooling, which provides a comparable and competitive result using speech only.
We were surprised to find that all our fine-tuned models achieved satisfactory results, as shown in the table below, where the partially fine-tuned HuBERT wide model achieves the best WA of 79.58%, an improvement of 7.83% on the state of the art, and far exceeds the fixed models listed in the SUPERB benchmark. Furthermore, we observe that partial fine-tuning appears to be a more efficient fine-tuning method than full fine-tuning. We considered that IEMOCAP is a small dataset with only 12 hours of data, so too many training parameters can easily cause overlearning. In addition, we noted that fine-tuning ASRs does not facilitate the downstream SER task, indicating a loss of prosodic information during ASR fine-tuning.
Next episode
In this blog, we will present the performance of self-supervised models by adapting them to emotion recognition in speech. In future blogs, we will demonstrate that self-supervised models can also be adapted to speaker recognition and spoken language understanding tasks with state-of-the-art performance!
The article was submitted to Interspeech 2022. An ARXIV version can be found here. Our paper has been cited 10 times so far, and the results have been used as a basic model by other researchers.