Why Concatenative TTS Technology is Making Waves

DeepMind, a London subsidiary of Google specializing in artificial intelligence, recently announced the development of a new Concatenative TTS technology capable of generating more natural-sounding artificial speech: WaveNet.

By now, we all know an awkward robot voice when we hear one.

Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa…

Although some of the most familiar robot voices are modeled from actual human speech using Text-to-Speech, or “TTS”, technology (Siri is based on voiceover artist Sarah Benett), current techniques often result in speech patterns that are unnatural to human ears.

Concatenative TTS

Classic AI voices like Siri are modeled with a particular process called Concatenative Synthesis. This method first records entire words and phrases, which are known as individual “units”. Then, the method builds sentences by linking units together in the most logical way possible. The downside of Concatenative Synthesis is that it has trouble modulating the voice, resulting in awkward (and often unpleasant) speech patterns.

DeepMind is changing the paradigm from Concatenative TTS Synthesis by pioneering a technique based on Deep Learning.

WaveNet’s Revolutionary Approach

Instead of feeding entire units into the AI, DeepMind is training its WaveNet AI to directly model raw waveforms contained in the audio signal, one by one. This means that WaveNet can intelligently dissect both the text and the sound samples that it receives before reproducing them one at a time, which is particularly important for syllabic languages like English.

WaveNet even accounts for sounds as subtle as breathing and tongue movement. These additional parameters not only allow for more flexibility in voice modulation specifically but also help to create a more natural-sounding speech pattern as a whole.

Like traditional AI voices, WaveNet technology is also based on real human voices and speech patterns. What’s different about WaveNet (and really exciting) is that it incorporates Machine Learning into its process. The AI seems to learn from exposure to human speech samples, and subsequently, use that information to improve itself.

The result is less choppy and more natural machine babbling than any other artificial voice assistants known to date.

Google will certainly be making use of WaveNet in new versions of classic applications like Maps, and other companies are sure to follow suit. DeepMind’s accomplishments show that more human AI voices like Tony Stark’s natural-language interface J.A.R.V.I.S. are just around the corner.