Technology 2 min read

Why Concatenative TTS Technology is Making Waves

Google's AI branch recently announced WaveNet - a new Concatenative TTS technology capable of generating more natural-sounding artificial speech.

studiostoks | Shutterstock.com

studiostoks | Shutterstock.com

DeepMind, a London subsidiary of Google specializing in artificial intelligence, recently announced the development of a new Concatenative TTS technology capable of generating more natural-sounding artificial speech: WaveNet.

By now, we all know an awkward robot voice when we hear one.

Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa…

Although some of the most familiar robot voices are modeled from actual human speech using Text-to-Speech, or “TTS”, technology (Siri is based on voiceover artist Sarah Benett), current techniques often result in speech patterns that are unnatural to human ears.

Concatenative TTS

Classic AI voices like Siri are modeled with a particular process called Concatenative Synthesis. This method first records entire words and phrases, which are known as individual “units”. Then, the method builds sentences by linking units together in the most logical way possible. The downside of Concatenative Synthesis is that it has trouble modulating the voice, resulting in awkward (and often unpleasant) speech patterns.

DeepMind is changing the paradigm from Concatenative TTS Synthesis by pioneering a technique based on Deep Learning.

WaveNet’s Revolutionary Approach

Instead of feeding entire units into the AI, DeepMind is training its WaveNet AI to directly model raw waveforms contained in the audio signal, one by one. This means that WaveNet can intelligently dissect both the text and the sound samples that it receives before reproducing them one at a time, which is particularly important for syllabic languages like English.

WaveNet even accounts for sounds as subtle as breathing and tongue movement. These additional parameters not only allow for more flexibility in voice modulation specifically but also help to create a more natural-sounding speech pattern as a whole.

Like traditional AI voices, WaveNet technology is also based on real human voices and speech patterns. What’s different about WaveNet (and really exciting) is that it incorporates Machine Learning into its process. The AI seems to learn from exposure to human speech samples, and subsequently, use that information to improve itself.

The result is less choppy and more natural machine babbling than any other artificial voice assistants known to date.

Google will certainly be making use of WaveNet in new versions of classic applications like Maps, and other companies are sure to follow suit. DeepMind’s accomplishments show that more human AI voices like Tony Stark’s natural-language interface J.A.R.V.I.S. are just around the corner.

First AI Web Content Optimization Platform Just for Writers

Found this article interesting?

Let Zayan Guedim know how much you appreciate this article by clicking the heart icon and by sharing this article on social media.


Profile Image

Zayan Guedim

Trilingual poet, investigative journalist, and novelist. Zed loves tackling the big existential questions and all-things quantum.

Comments (0)
Most Recent most recent
You
share Scroll to top

Link Copied Successfully

Sign in

Sign in to access your personalized homepage, follow authors and topics you love, and clap for stories that matter to you.

Sign in with Google Sign in with Facebook

By using our site you agree to our privacy policy.