Speech to speech translation with translatotron: A state of the art review
No Thumbnail Available
Date
2025-10-20
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier B.V.
Abstract
A speech-to-speech translation using cascade-based methods has been considered a benchmark for a very long time. Still, it is plagued by many issues, like the time to translate a speech from one language to another and compound errors. These issues are because cascade-based methods use a combination of other methods, such as speech recognition, speech-to-text transcription, text-to-text translation, and finally, text-to-speech transcription. Google proposed Translatotron, a sequence-to-sequence direct speech-to-speech translation model that was designed to address the issues of compound errors associated with cascade-based models. Today, there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron 3. Translatotron 1 is a proof of concept to demonstrate direct speech-to-speech translation. This first approach was found to be less effective than the cascade model, but it was producing promising results. Translatotron 2 was an improved version of Translatotron 1 with results similar to the cascade-based model. Translatotron 3, the latest version of the model, significantly improves the translation and is better than the cascade model at some points. This paper presents a complete review of speech-to-speech translation using Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.
Description
This article provides a comprehensive review of Translatotron models.
• It explores the architecture, innovations, and performance of Translatotron models compared to traditional cascaded systems.
• Compares Translatotron models to other S2ST models, and presents it as a potential candidate for African Language translation.
Keywords
Translatotron, Speech-to-speech, BLEU, Cascade