A few months ago the internet was on fire with the Yanny versus Laurel debate, which you can listen to HERE. Both sets of sounds are present in the recording, but some listeners focus on the higher frequency sounds in Yanny and cannot seem to hear the lower sounds of the word Laurel. Regardless of whether you hear Yanny or Laurel, the bottom line is that the human ear has trouble distinguishing between either name – yet we expect speech recognition software to get it right every time!
How people can listen to the same thing, at the same time, but hear different things is the foundation of speech recognition. Understanding the dynamics of speech recognition to accommodate differences in hearing helps drive customer engagement solutions like interactive voice response (IVR), mobile applications and virtual assistants. These solutions are supposed to provide a stress-free customer experience so apparently the speech technology system needs to be bang on.
Some people's ears do a better job of hearing lower frequencies, while others are tuned to higher ones. This difference can be due to how their native language has tuned their hearing, or to hearing loss that muffles certain high- or low-frequency sounds. Another variable is the quality of the audio and the equipment people use to listen to a sound file.
Most speech recognition technologies use neural networks, which are an interconnected group of nodes like the vast network of neurons in a brain. Nodes are devices or data points on a larger network. The neural networks are designed to mimic the way the human brain works. The networks then need to learn the various ways that people pronounce phonemes. Phonemes are important to speech recognition because they are the building blocks of words and are the distinct units of sound in a specified language that distinguishes one word from another. For example, p in "tap," which separates that word from "tab," "tag," and "tan".
Each language has nuances and it's challenging to get software to learn those nuances. And it doesn't end there...There are dialects within each language, and dialects sometimes have different vocabulary -- like chesterfield in Canada and couch in the US. These problems are resolved by training the voice recognition system with hours of human speech, demonstrating these variabilities.
Bottom line...with enough training, a neural network can learn how people pronounce each phoneme. However, as the Yanny versus Laurel debate demonstrates, there are still differences in how humans hear and interpret sounds.
What did you hear? Yanny or Laurel? For more information on speech recognition and IVR contact firstname.lastname@example.org.
Interactive voice response (IVR) technology is constantly adapting to the industry’s needs. An example of such an adaptation is visual IVR (VIVR).
Hit Go to Search or X to close