Microsoft's Latest Text-to-Speech AI Generates Realistic Speech

Following Google’s introduction of its AI speech translation tool Translatotron, Microsoft has unveiled its latest text-to-speech AI system that can reportedly generate realistic speech. The technology was developed in partnership with a team of Chinese researchers.

In their paper published on GitHub, the team reported that the TTS AI utilizes two key components – a transformer and denoising auto-encoder – to work. A transformer is a type of neural architecture developed by scientists from Google Brain which emulates our own neurons. It analyzes inputs and outputs like synaptic links, making the TTS AI system process complex sentences efficiently.

The denoising auto-encoder is a neural network capable of reconstructing corrupted data. It operates on unsupervised learning, a branch of machine learning that gathers knowledge from unclassified, unlabeled, and uncategorized data sets.

With the help of these neural network systems, Microsoft’s TTS AI was able to reach 99.84 percent word intelligibility accuracy and 11.7 percent phoneme error rate for its automatic speech recognition.

Microsoft’s Text-to-Speech AI

Through the transformers, Microsoft’s text-to-speech AI was able to recognize speech or text as either input or output. The team sourced the LJSpeech data set which reportedly contains over 13,000 English audio snippets and transcripts to create their training data.

The researchers were able to create a data set composed of 200 clips chosen randomly from LJSpeech and used it to train the AI system. Then, they leveraged the denoising auto-encoder to reconstruct corrupted speech and text in the data set. Surprisingly, the combination enabled the TTS AI system to generate realistic speeches, even outperforming three baseline algorithms. The team wrote in their paper:

“Our method consists of several keys components, including denoising auto-encoder, dual transformation, bidirectional sequence modeling, and a unified model structure to incorporate the above components. We can achieve 99.84% in terms of word-level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR with just 200 paired data on LJSpeech dataset, demonstrating the effectiveness of our method.”

Here are a few audio samples of the speech produced by the AI: