Technology 2 min read

Microsoft's Latest Text-to-Speech AI Generates Realistic Speech

efes / Pixabay

efes / Pixabay

Following Google’s introduction of its AI speech translation tool Translatotron, Microsoft has unveiled its latest text-to-speech AI system that can reportedly generate realistic speech. The technology was developed in partnership with a team of Chinese researchers.

In their paper published on GitHub, the team reported that the TTS AI utilizes two key components – a transformer and denoising auto-encoder – to work. A transformer is a type of neural architecture developed by scientists from Google Brain which emulates our own neurons. It analyzes inputs and outputs like synaptic links, making the TTS AI system process complex sentences efficiently.

The denoising auto-encoder is a neural network capable of reconstructing corrupted data. It operates on unsupervised learning, a branch of machine learning that gathers knowledge from unclassified, unlabeled, and uncategorized data sets.

With the help of these neural network systems, Microsoft’s TTS AI was able to reach 99.84 percent word intelligibility accuracy and 11.7 percent phoneme error rate for its automatic speech recognition.

Microsoft’s Text-to-Speech AI

Through the transformers, Microsoft’s text-to-speech AI was able to recognize speech or text as either input or output. The team sourced the LJSpeech data set which reportedly contains over 13,000 English audio snippets and transcripts to create their training data.

The researchers were able to create a data set composed of 200 clips chosen randomly from LJSpeech and used it to train the AI system. Then, they leveraged the denoising auto-encoder to reconstruct corrupted speech and text in the data set. Surprisingly, the combination enabled the TTS AI system to generate realistic speeches, even outperforming three baseline algorithms. The team wrote in their paper:

“Our method consists of several keys components, including denoising auto-encoder, dual transformation, bidirectional sequence modeling, and a unified model structure to incorporate the above components. We can achieve 99.84% in terms of word-level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR with just 200 paired data on LJSpeech dataset, demonstrating the effectiveness of our method.”

Here are a few audio samples of the speech produced by the AI:

Read More: Google Unveils Translatotron, A Speech-To-Speech Translation System

First AI Web Content Optimization Platform Just for Writers

Found this article interesting?

Let Chelle Fuertes know how much you appreciate this article by clicking the heart icon and by sharing this article on social media.


Profile Image

Chelle Fuertes

Chelle is the Product Management Lead at INK. She's an experienced SEO professional as well as UX researcher and designer. She enjoys traveling and spending time anywhere near the sea with her family and friends.

Comments (0)
Most Recent most recent
You
2
share Scroll to top

Link Copied Successfully

Sign in

Sign in to access your personalized homepage, follow authors and topics you love, and clap for stories that matter to you.

Sign in with Google Sign in with Facebook

By using our site you agree to our privacy policy.