Meta states that SeamlessM4T is a "significant breakthrough" in speech-to-speech and speech-to-text AI. With this aim of creating an AI which has the capability to understand a wide variety of dialects, the company has designed an AI model that can translate and transcribe nearly 100 different languages through text and speech.
Meta unveiled SeamlesM4T, available now in open source with a new translation dataset called SeamlessAlign, as poised to represent a "significant breakthrough" in the field of speech-to-speech and speech-to-text AI applications.
Implicit source language identification without an external language ID model is one of SeamlessM4T's underlying features, according to Meta in a blog post leaked to TechCrunch: "Our single model translates on the fly, which means speakers of different languages can better communicate with each other.
SeamlessM4T is something of a spiritual successor to Meta's No Language Left Behind: it's a text-to-text machine translation model and Universal Speech Translator is one of the few direct speech-to-speech translation systems that support Hokkien, and it builds on Massively Multilingual Speech, Meta's framework providing speech recognition, language identification, and speech synthesis tech across more than 1,100 languages.
Meta is not alone, as it invests time in developing sophisticated AI translation and transcription tools.
Besides all the commercial services and open-source models that Amazon, Microsoft, OpenAI, and many start-ups are coming up with, Google is working on what it has termed the Universal Speech Model, as part of an even larger development initiative by the technology behemoth to develop the model that would probably understand the 1,000 most spoken languages globally. Mozilla started Common Voice, which is at present the biggest multi-language collections of voices for training automatic speech recognition algorithms.
But SeamlessM4T is one of the more ambitious attempts to date of integrating translation and transcription abilities into one model.
During its development, Meta claims to have scraped publicly available text (in the order of "tens of billions" of sentences) and speech (4 million hours) from the web. In an interview with TechCrunch, Juan Pino, a research scientist at Meta's AI research division and contributor on the project, refused to provide more information on the sources themselves, stating that "there was a variety of" data sources.
Not all content creators agree with the concept of using public information to train models that could be used for commercial purposes. Some have sued companies building AI tools on top of available data, arguing the vendors should be forced to provide credit if not compensation — and clear ways to opt out.
But Meta claims that the data it mined-which might contain personally identifiable information, the company admits-didn't include any copyrighted material and primarily came from open source or licensed sources.
Whatever the context, Meta used the scraped text and speech to build the training dataset for SeamlessM4T, dubbed SeamlessAlign. Researchers aligned 443,000 hours of speech with texts and created 29,000 hours of "speech-to-speech" alignments, which "taught" SeamlessM4T how to transcribe speech to text, translate text, generate speech from text and even translate words spoken in one language into words in another language.
Meta says that on an internal benchmark, SeamlessM4T performed better against "background noises" and "speaker variations" in speech-to-text tasks than the currently state-of-the-art speech transcription model. It explains this because of the richness that it's receiving due to the rich amalgamation of speech and text data that comprises the training dataset; Meta believes provides SeamlessM4T an edge over its speech-only and text-only models.
With state-of-the-art results, the blog post by Meta said SeamlessM4T is an important breakthrough to the quest of the AI community towards creating universal multitask systems.
One would wonder what biases might the model contain.
This past article in The Conversation details many of the very serious failings of AI translation, including gender issues. For instance, Google Translate once assumed doctors were male and nurses female in some languages, while Bing's translator translated phrases like "the table is soft" as the feminine "die Tabelle" in German, referring to a table of figures.
Speech recognition algorithms too, are often fraught with biases. According to a report in The Proceedings of the National Academy of Sciences, the audio from Black speakers was incorrectly transcribed twice as often by speech recognition systems from top companies as that from white speakers.
SeamlessM4T makes no exception.
According to a whitepaper published with the blog post, Meta says the model "overgeneralizes to masculine forms when translating from neutral terms" and works better when translating from the masculine reference, that is, when translating nouns used for masculine personal pronouns in English, for most languages.
In other words, it tends to translate the masculine form about 10 percent of the time when it lacks gender information perhaps because there's "overrepresentation of masculine lexica" in the training data Meta speculates.
Meta also claims that SeamlessM4T does not create an inordinate quantity of toxic text within its translations, an annoying flaw of translation and generating AI text models in general. However, it's not flawless. Some languages such as Bengali and Kyrgyz in SeamlessM4T generated more translations that are hateful or offending to the socioeconomic status and culture. In a general sense, SeamlessM4T emerged as more toxic in translations related to sexual orientation and religion.
Meta notes that the public demo for SeamlessM4T has a filter for toxicity of input speech, and also has a filter for output speech that might be toxic. That filter is absent by default in the open source version of the model.
A far greater concern with the rampant use of AI translators, besides being totally ignored in the whitepaper, is lexical richness loss. Human translators, unlike AI, do draw choices unique to them as they translate one language into another, often explaining, normalizing or condensing and summarizing to produce fingerprints that have come to be known informally as "translationese." AI will generate more "accurate" translations, but at what cost - translation variety and diversity?.
That is probably the reason that Meta advises against the use of SeamlessM4T for long documents and sworn translations, such as those used when requesting a government agency or translation authority to verify the authenticity of the foreign text.
Meta doesn't recommend SeamlessM4T for medical or legal uses — I suppose an insurance policy in case the AI messes up.
That makes sense; after all, there have been at least some cases in which AI mistranslations led the authorities astray. In September 2012 police wrongfully detained a Kurdish man for funding terrorism based on a mistranslated text message. And in 2017, a cop in Kansas used Google Translate to ask a Spanish speaker if he could search his car for drugs-but because the translation was off the driver wasn't quite sure exactly what he was agreeing to and so that case was dismissed.
"This unified system approach reduces errors and delays, increases efficiency and quality of the translation process, bringing us closer to making seamless translation possible," Pino said. "In the future, we want to explore how this foundational model can enable new communication capabilities — ultimately bringing us closer to a world where everyone can be understood."
Let's hope humans aren't left completely out of the loop in that future.