A detailed guide to boosting the performance of the Whisper model with your own data.
According to OpenAI, the robustness and accuracy of English speech recognition has never been closer to human than with its new Whisper model. In this blog post, we'll explain our approach to fine-tuning the Whisper model to achieve this level of performance on other languages.

Machine Learning Engineer at Zaion, working on speech recognition (ASR) and automatic natural language processing (NLP).
TL;DR
According to OpenAI, the robustness and accuracy of English speech recognition has never been closer to human than with its new Whisper model. In this blog post, we explain our approach to fine-tuning the Whisper model to achieve this level of performance on other languages. It was this work that won us first prize in the Whisper Fine-Tuning Eventorganized by HuggingFace 🤗 and Lambda Labs, for both French and German. The models and demos from this event are available on the Hugging Face Hub.

BY GEORGE MARKS/RETROFILE/GETTY IMAGES.
Foreword
I'm a Machine Learning engineer at Zaion, the European market leader in AI solutions for customer relations. One of Zaion's goals is to provide accurate and precise transcription of customer service conversations. So it's crucial for us to have a system for speech recognition system that is reliable, robust and ready for use in real-life conditions.
At Zaion Lab, Zaion's Research team, we're constantly on the lookout for the latest trends and innovations in speech recognition. This gave me the opportunity to take part in the Whisper Fine-Tuning Event organized by Hugging Face 🤗 and Lambda Labs, which aims to democratize the use of the Whisper model and make it robust for as many languages as possible. I took part in both the French and German language challenges, and won first prize for both(leaderboard).
Introduction
In September 2022, OpenAI released a pre-trained automatic speech recognition (ASR) model called Whisper. Self-supervised learning models, such as wav2vec 2.0, are typically pre-trained on masked prediction tasks using unlabeled audio data, then fine-tuned on labeled data for various downstream tasks, including automatic speech recognition. Whisper models, on the other hand, are trained directly on a large amount of weakly labeled data collected from the web.
This corpus consists of 680,000 hours of multilingual, multitasking data, including transcription in several languages, translation from these languages into English, and timestamp prediction. At this scale, the model is highly robust to accents, background noise and technical language.
Whisper is a sequence-to-sequence model, a Transformer-based encoder-decoder - which maps the spectrogram representation of a recording to the corresponding sequence of sub-lexical units. This acoustic representation is calculated by a mathematical transformation from the raw signal waveforms, then analyzed by the Transformer encoder. The decoder then auto-regressively predicts the next output (or sub-lexical unit) as a function of the previous outputs and the encoder's hidden states. The figure below summarizes the architecture of the model.

The architecture of the Whisper model. Source:OpenAI Whisper Blog
In this blog, we'll show you how to fine-tune the French version of the Whisper medium model. This version of the model has 24 encoder and decoder layers and 769 million parameters. The full code can be found here.
Data and model preparation
Loading the model
Let's start by loading the medium version of the pre-trained Whisper model:

You'll find certain arguments defined in the Whisper model, such as "forced_decoder_ids" and "suppress_tokens". These arguments are defined in GenerationConfig for the generation task; however, they are not used during training, to let the model learn them on its own.
We also disable the use_cache function in the Whisper decoder. It allows us to reuse the key and calculated values of the auto-attention and cross-attention blocks to speed up the current decoding step. However, it is incompatible with gradient checkpointing, which will be applied in a later step to reduce the memory footprint.

Loading datasets
We will use the 🤗 Datasets library to download and prepare the datasets. We mix training data from Common Voice 11.0and Multilingual LibriSpeech to form a larger training set, and use only Common Voice 11.0 test data for evaluation.
It is always recommended to collect as much training data as possible. There are other speech recognition datasets available on the Hugging Face Hub, such as Voxpopuli and Fleurs. If you want to load your local corpus, take a look at this page.
The audio signal sampling rate is 48kHz in Common Voice and 16kHz in Multilingual LibriSpeech. We ensure that audio samples are resampled to 16kHz, not only to unify the sampling rate of different data sets, but also because this is the sampling rate of 680,000 hours of Whisper model pre-training corpus. Resampling can be easily performed on the fly using the "cast_column" method and Datasets' "Audio" object.
In order to mix different datasets, it is also necessary to ensure that all datasets have the same data fields. Here, we retain only the audio column and the phrase from two datasets.

If you have disk space constraints, you can load data sets on the fly with streaming mode.
Data enhancement
We have found that the audio samples in the LibriSpeech Multilingual dataset are quite intelligible. To ensure that the model is robust in noisy environments and can be generalized to different speakers, we augment the data using the Audiomentations library. Several augmentations are applied to the audio samples, including TimeStretch,Gain,PitchShift, and one of the AddBackgroundNoiseor AddGaussianNoise options.
This method of increase is carried out as follows:

Next, we apply the augmentation to all learning examples using the "map" method:

Note: Data augmentation is only performed on the training set. We also keep an original version of the training set, then compose it with the augmented training set.
Text standardization
"While diversity in audio quality can help a model be robust, diversity in transcription quality is not as beneficial."
The diversity here is reflected in the transcription format, i.e. case and punctuation exist in the Common Voice dataset, but not in the Multilingual LibriSpeech dataset. We need to make sure that the transcriptions are in lower case and remove any punctuation when we use them together. This will simplify the task - since the model no longer needs to distinguish between upper and lower case characters, or to predict punctuation marks between characters.
However, if you want transcriptions that are easy to read or require case or punctuation, it's best to keep them and only use case and punctuation data sets such as Common VoiceandFleurs.

Note: The model is always evaluated on standardized transcriptions, i.e. without capital letters and without punctuation.
You can find the standardization used for English and other languages in Appendix C of the Whisper article.
Data pre-processing
You can find the standardization used for English and other languages in Appendix C of the Whisper article.
As we showed in the introduction, the Whisper model takes the log-Mel spectrogram as input and produces BPE tokens. We therefore need to prepare our data in the appropriate format. This can be achieved by two utility classes; WhisperFeatureExtractor and WhisperTokenizer, used respectively on audio inputs and transcriptions or model predictions. The transformer library groups these two classes into a single WhisperProcessor class, which can be loaded as shown below:

All we need to do is specify the target language and task, and WhisperTokenizer will prefix the corresponding language and task tokens when encoding transcriptions into label IDs.
Let's see what's in our data preparation function:

Next, we apply the data preparation function to all the examples in the dataset, using the "map" method:

Deleting long audio
In the previous step, examples with audio longer than 30s were truncated by the WhisperFeatureExtractor. If the audio is truncated, the transcription is not, which seriously destabilizes the learning process. Here, we define a function to filter all audio longer than 30s:

We then apply our filter function to all the examples using the "filter" method:

Deleting long texts
The Whisper decoder uses a learned position representation with a maximum length of 448 tokens. It can therefore not decode a transcript of more than 448 token identifiers. Here, we define a filter function on the label identifiers:

Then apply it to all examples using the "filter" method:

Learning and assessment
Data collator
The data collator takes a list of pre-processed samples and assembles them into a batch of Pytorch tensors. We need to make sure that all the audio features in the batch have the same length, and this rule also applies to all the labels in the batch.
Audio features are already padded or truncated to a fixed dimension by the "WhisperFeatureExtractor", so we only need to convert them to Pytorch tensors using the "pad" method. On the other hand, label identifiers are not padded. We must first paginate them to the maximum batch length using the "pad" method, then replace the pagination tokens with "-100" so that these tokens are not taken into account in the loss calculation.
Let's define our data collator as follows:

Next, we can initialize the data collator we have just defined:

Evaluation measures
We use the word error rate(WER)metric to evaluate model performance. The WER metric can be simply loaded by 🤗 Evaluate :

We then need to define a function that takes the actual token ids and model predictions, then returns the WER metric. In this function, we need to replace "-100" with the "pad_token_id" (overriding the step in the data collator to ignore padded tokens) so that label ids can be properly de-tokenized into strings.

Learning configuration
In this step, we define all the parameters related to training. For details of the other training arguments, please refer to the "Seq2SeqTrainingArguments" documentation.

Learning
In the last step, we initialize the trainer by passing the model, dataset, data collator, training arguments and metrics calculation function as arguments.

Let's start learning!

Don't forget to save your model and processor once you've finished training:

Let's recap!
In this blog, we presented a step-by-step guide to fine-tuning Whisper for ASR on French data. The WER of the Whisper mediumversion was reduced from 16.00% to 9.03% on Common Voice. With the Whisper largeversion, it was reduced from 13.90% to 8.15%.
Here you can find a demo for French ASR using fine-tuned Whisper models.
You can also fine-tune Whisper in other languages: simply collect and clean datasets in that language, then specify the corresponding language code when loading "WhisperProcessor".
References
- Robust Speech Recognition via Large-Scale Weak Supervision
- Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
- Whisper Fine-Tuning Event
Finally, I'd like to thank my Zaion Lab colleagues for their invaluable help and constructive comments: Mohamed Bouaziz, Tiphaine Fievet,Imed Laaridh,Lorraine Vanel,Yingzhi Wang andAlya Yacoubi!