12/10/2022
Last month, Intento published its much-anticipated State of Machine Translation report for 2022. This year’s report was done in partnership with e2f, a language solutions company specializing in AI and data.
The report provides a broad and in-depth look at the landscape of current machine translation vendors and analyzes how they perform against each other, and gives businesses insight on how to best leverage this information in their own strategies.
The study provides a wealth of information, for which this article hopes to provide a useful summary. Without further ado, let’s dive in.
There is a robust supply of machine translation engines on the market today; this report evaluated 31 of them. This includes the much-talked about large language model from Meta, No Language Left Behind, which was released as an open-source resource earlier this year, and boasts support for 200 languages.
You can learn more about the NLLB here: No Language Left Behind: Meta’s massive multilingual machine translation ambition pays it forward
This year sees a huge expansion in terms of the number of language pairs covered in total. Last year’s total was at 26,000 unique language pairs; this year, an additional 125,075 have been added.
This year, they evaluated MT engines across 11 language pairs and 9 different domains. For the purposes of evaluation, the researchers used only the stock models of each MT engine. This means that they are trained on generic data and are not focused on any particular domain.
Data collection. The first step was to source and collect language data to be used in the test. Intento sought out data in English for each of the 9 domains and cut out segments in the form of sentences that were suitable for the purpose. It bears emphasis that the data collected was open-source and available for the purpose according to the terms of the license agreements these texts were bound by.
Data cleaning. Intento then did an initial “cleaning” process. A number of criteria were involved in weeding out “bad” data to ensure that the remaining text segments were all of high quality.
Translation. Once the dataset was ready, it was turned over to e2f, which translated all text into 11 different languages with the help of native speaking translators who were experts in their language and in each domain.
Review and proofreading. The translations were then turned over to native language experts for review and proofreading.
Automated QA. Here’s one particularly noteworthy step in the process—in order to ensure the quality of the translations, e2f deployed its MT Detection tool to check the likelihood that a translator may have used raw machine translation or post-editing. Segments that exceeded the system’s threshold would then be translated and undergo review again. As such, the report claims that the resulting dataset “does not bear the trace of MTPE”.
The result of this process was 500 unique segments per domain, with corresponding translations in 11 different languages. That’s 49,500 segments in total!
In addition, due to the process of selecting and processing these segments, researchers could be sure that none of them could have been used to train the MT engines beforehand, allowing for a more fair playing field.
After preparing the data for evaluation, the researchers had to decide what metric to use in evaluating MT engines’ performance. Out of six metrics, one stood out: COMET.
COMET is an open-source MT evaluation framework first introduced in late 2020. The researchers chose COMET as its scores showed higher correlation with the judgment of human evaluators.
After running the tests, the following results came out:
Six major contenders. The study found a number of leaders that were statistically significant among the 11 language pairs. These are Google, DeepL, Amazon, Yandex, Youdao, and Naver.
Two for full coverage. Going with DeepL and Google covers the best option for all 11 languages, when domains are left out of the equation.
More leaders when domains are included. In addition to the six, 10 more MT engines were found to be statistically significant leaders when domains were taken into consideration. A combination of six engines were found to provide minimal coverage for all languages and domains: Google, DeepL, Amazon, Microsoft, Tencent, and Naver.
Best-performing languages. Most engines performed particularly well for English to Spanish and English to Chinese.
Lowest-scoring domains. MT engines were seen to perform lower in the Entertainment and Colloquial domains.
It should be noted, once again, that the evaluation was done on stock models of each MT engine. There are 12 engines that feature vertical models (already pre-trained for certain domains), and many others offer customization support, which were not considered in this evaluation. Other engines that offer customization may show better performance, particularly in the domains with low scores.
There is much more information provided in the report, which can be accessed here. The Intento report gives us a clear view of the current landscape of machine translation, with a lot of actionable insight for businesses and other stakeholders in the field.
We will also be taking a deeper look at different parts of the report in the future, so stay tuned for updates on our blog.