IBM's new Neural Speech Synthesis Method Improves TTS Systems

To make text-to-speech (TTS) systems less dependent on large and complex neural network models, IBM researchers developed a new method of neural speech synthesis based on a modular architecture.

The team’s method combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output to produce high-quality speech. The new TTS architecture is reportedly lightweight and can synthesize HQ speech in real-time.

In their paper in arXiv.org, the IBM researchers described how the network models learn a different aspect of a person’s voice, making it easier to train them efficiently on each component independently.

“Once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data,” the team wrote.

New Method of Neural Speech Synthesis

IBM’s new method of neural speech synthesis involves three DNNs: prosody prediction, acoustic feature prediction, and neural vocoder.

The prosody features are learned by the network while being trained, allowing the latter to predict them from textual features being extracted by the front-end synthesis time.

“Prosody is extremely important, not only for helping the speech sound natural and lively,” the IBM researcher noted, “but also to best-represent the specific speaker’s style in the training or adaptation data.”

On the other hand, the acoustic feature prediction provides spectral speech representation at short ten millisecond frames. This is where the actual audio will be generated.

The network learns the acoustic feature at training time for it to predict the acoustic from the phonetic labels and prosody features during speech synthesis.

“The DNN model created represents the voice of the speaker in the training or adaptation data.”

Last but not least is the neural vocoder. This network is responsible for producing the actual speech samples from the acoustic features.

The IBM researchers trained the neural vocoder from the speaker’s natural speech samples together with their corresponding features. Called LPCNet, the IBM team claims to be the first to use the said lightweight, high-quality neural vocoder in a fully commercialized text-to-speech system.

The team wrote:

“The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.”

Once trained, IBM’s DNNs could quickly adapt to any voice just by using a small amount of data from the target speaker.

Results of the team’s listening tests revealed that the three networks were able to maintain both high quality and high similarity to the original speaker. That’s even if voices were from speeches that ran for as little as five minutes.

The team’s work is the basis for the new Watson TTS service that you can try here.