Google Unveils Translatotron, a Speech-to-Speech Translation System

Last week, Google unveiled its latest end-to-end speech translation model that can convert speech to different languages while retaining the speaker’s voice. Called the Translatotron, Google’s latest tool combines the three separate core components of auto-translation: automatic speech recognition, machine translation, and text-to-speech synthesis.

In a statement, Google AI software engineers Ye Jia and Ron Weiss explained:

“In ‘Direct speech-to-speech translation with a sequence-to-sequence model,’ we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation.

This system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated.”

How Translatotron Works

According to Google, Translatotron has two primary goals: to eliminate the speech-to-text step during translation and the use of the generic voice. In their paper published in ArXiv, the Google engineers described using a neural network to analyze the original speech spectrograms and use it to generate the spectrograms of the translated language, reproducing the speaker’s voice.

The Google AI team reported that the translation tool also utilizes two separately trained components to perform its function. It has a neural vocoder which converts the output spectrograms to time-domain waveforms and a speaker encoder which maintains the original voice of the speaker in the synthesized translated speech.

Original speech

Translatotron translation in canonical voice

Translatotron translation in original voice

The team tested the performance of their translator using the BLEU score, an algorithm that evaluates the quality of machine-translated speech from one natural language to another. The results were still behind the conventional cascade system, but the engineers were satisfied that they were able to demonstrate the feasibility of the end-to-end direct speech-to-speech translation.