How does Automatic Speech Recognition (ASR) work?
Starting from Audrey, the very first system capable of recognizing numbers from 0 to 9, to the ultra-sophisticated voice assistants Alexa from Amazon or Google Assistant, speech recognition has taken a leap forward. This artificial intelligence technology can be found everywhere nowadays in our homes, phones, and cars, to the point where it would be difficult to imagine living without this technology.
Are you interested in learning more about automatic speech recognition (ASR)? Here in this article we’ll look at the history, how it works, the models, and its applications.
What is speech recognition?
Speech recognition is based on more than 70 years of scientific research and is still generating a lot of buzz! What is it and what is it used for? Keep on reading to find out.
Definition of automatic speech recognition (ASR)
Automatic Speech Recognition (ASR) is an artificial intelligence technology that makes it possible to understand natural language. It captures the human voice from a microphone and analyzes it based on spoken words, intonation, and accent. After this it transcribes it into a request in the form of a text or a file that can be used by a computer. This process is also called speech recognition or speech-to-text.
From voice dictation to voice command
In 1952, Bell Laboratories launched Audrey, the very first voice recognition system. It was able to identify the numbers 0 to 9 when pronounced separately with a success rate of 99%. This machine transcribed the human voice, and it marked the start of voice dictation.
This first achievement then led to Shoebox, the first voice assistant from IBM in 1962. It was a kind of calculator that understood simple spoken mathematical problems and then solved them on the spot. This software answered a request and represented the first steps of voice command.
While early automatic speech recognition systems were slow, cumbersome, and expensive, recent software is nearing mastery in the world of technology. With the power of machine learning, they are now able to understand different voices, accents and even emotions more easily. The two most popular ASR technologies are voice dictation and voice command.
This is not to be confused with speech synthesis, a technology that creates speech from written text. It is a conversion of text-to-speech. Many AI systems use both voice command and text-to-speech software to answer requests verbally. For instance, this is the case of callbots in customer services, which function as conversational phone agents.
What are the applications of speech recognition?
One thing is for sure – speech recognition has become an important part of our daily lives. We use it in our personal and professional lives without even realizing it. So why is it so successful? The reason is simple: all it needs is our voice. With speech recognition, we are free to move around. You do not have to type on a keyboard as is the case with an IVR or stare at a screen for it to work. There is no need to know how to write, or even use formal language, because the machine learning software understands accents, English mistakes and adapts accordingly. On top of that, speech allows us to convey information much more quickly than in writing. In short, speech recognition saves us time.
Nowadays it can be found in numerous industries with the following examples of applications:
- Making an appointment over the phone 24/7
- Checking your account balance
- Dictating medical consultation reports
- Arranging a replacement vehicle in the event of a claim
How does speech recognition work?
Automatic Speech Recognition (ASR) is a complex technology designed to make life easier. We will briefly explain how it works.
The 5 models of automatic speech recognition
In order to understand natural language, the software generally combines 5 models specific to ASR:
- Acoustic preprocessing: identifies instances of speech in the recording
- The pronunciation model: associates words known to the system via phonetics
- The acoustic model: predicts the most likely phonemes
- The linguistic model: predicts the most likely sequence of words
- The decoder: combines the predictions into a suggested text transcription.