I’m passionate about speech technologies so I talk too much.
Speech Emotion Diarization (paper link) is a task proposed for fine-grained speech emotion recognition. Just as Speaker Diarization answers the question of ”Who speaks when”, Speech Emotion Diarization answers the question of ”Which emotion appears when”.
The Speech Emotion Diarization task takes an utterance as input and aims to find out if particular emotions are present within the utterance and also find out their corresponding boundaries. A comparison of traditional utterance-level Speech Emotion Recognition (SER) and Speech Emotion Diarization (SED) can be found in the image above.
The lack of available datasets is an important obstacle to the studies of fine-grained speech emotion recognition. We released the Zaion Emotion Dataset (ZED) which is annotated with discrete emotion labels and also the emotion boundaries at frame-level for a spoken utterance.
This example presents an audio where a happiness is perceived from 1.797 seconds to the end of the audio, while the first 1.797 seconds is neutral. It should be noted that for an utterance, only non-neutral emotions are shown in ZED.json, the rest is neutral by default.
Some basic statistics of ZED:
|number of utterances||180|
|emotions||Happy + Sad+ Angry + Neutral|
If you are interested in helping us enlarge the dataset or expand the dataset to other languages, please contact us.
The new Speech Emotion Diarization has been integrated to our Voice Analytics platform. This key component will be soon available for our customers !
Recognizing your users’ emotions with a great accuracy and granularity would allow you to identify key moments in conversations and precisely spot pain points in your customer service.
The ZED dataset is free to download.Download Dataset