11/08/2022
Say you’re a business that has decided to invest in a machine translation system. You’ve done some basic research, and find that there are so many options to choose from. Each one claims to score a certain amount based on certain metrics, but you don’t know what the numbers really mean. How do you know which one is the best fit for you?
You need to understand how machine translation evaluation works.
This article will go in-depth on the topic of machine translation evaluation. It will help you understand what it is, why you need it, and the different types of evaluation, to help you make a well-informed decision when choosing an MT system to invest in.
Machine translation evaluation refers to the different processes of measuring the performance of a machine translation system.
It’s a way of scoring the quality of MT so that it’s possible to know how good the system is, and there's a solid basis to compare how effective different MT systems are. To do this, machine translation evaluation makes use of quantifiable metrics.
There are two main reasons why evaluating the performance of an MT system needs to be done. The first is to check if it’s good enough for real-world application. The second is to serve as a guide in research and development.
First, of course, is to determine whether the MT system works at a level that is good enough for actual use. This is the reason that is of most direct relevance to end users. If the machine translation system performs poorly, users are more likely to choose something else.
Industrial sectors that use MT would also want concrete metrics for deciding what MT system to get. After all, MT is an investment, and businesses need to get the best value for their money.
As such, MT developers need to evaluate whether the machine translation system’s quality is good enough for them to send it out to clients.
MT systems are, ideally, not a static entity. The technology for MT is continually improving over time. It makes sense that MT systems should be expected to improve as well.
This is where research comes in, and researchers need to have some guide on where to look. Measurable metrics allow researchers to compare whether a particular approach is better than another, helping them to fine-tune the system.
This is especially good for seeing how the system deals with consistent translation errors. Having measurable metrics can show in a more controlled setting whether or not a particular approach is able to deal with these kinds of errors.
There are two different ways to determine how well an MT system performs. Human evaluation is done by human experts doing manual assessment, while automatic evaluation uses AI-based metrics specially developed for assessing translation quality without human intervention. Each has its own advantages and disadvantages. We’ll go into further detail on both kinds of MT evaluation in the later sections of this article, but first, here’s a quick overview of the two types of machine translation evaluation, as well as the approaches toward MT evaluation that make use of them.
Human evaluation of machine translation means that the assessment of translation quality is done by human professional translators. This is the most effective option when it comes to determining the quality of machine translations down to the level of sentences. But human evaluation, as with human translation, is by nature more costly and time consuming.
Automatic evaluation, on the other hand, uses programs built specifically to assess the quality of machine translation according to different methods. It’s not as reliable as human evaluation on the sentence level, but is a good scalable option when evaluating the overall quality of translation on multiple documents.
The approaches toward machine translation evaluation are based on the concept of granularity. That is, the different levels at which the scoring might be considered significant.
Sentence-based approach. Under this approach, each sentence is given a score saying whether its translation is good (1) or not good (0) and the total is given an average. This is most commonly done in human evaluation.
Document-based approach. Also known as the corpus-based approach, sentences are also given scores but the significant score is the total or average among a larger set of documents. This is the smallest level at which automated MT evaluation can be considered significant, as it depends heavily on statistics from a wide dataset.
Context-based approach. This approach differs from the previous ones as what it takes into account is how well the overall MT task suits the purposes to which it's put, rather than through average scores based on sentences. As such, it might be considered a holistic approach to MT evaluation.
Machine translation evaluation is a difficult process. This is because language itself is a very complex thing.
For one, there can be multiple correct translations. Take, for example, the following sentence:
The quick brown fox jumped over the lazy dog.
An MT system might generate the following translation instead:
The fast brown fox pounced over the indolent dog.
This is a technically correct translation, and in human evaluation it would normally be marked as such. But in automated evaluation, it would be marked as incorrect.
Small details can also completely change a sentence’s meaning.
The quick brown fox jumped on the lazy dog.
Here, there’s only one word that has been changed. But that one word changes the meaning of the sentence completely. Automatic evaluations are likely to mark it higher than the previous example. Human translators are likely to catch the error, but some might consider it correct.
And that’s because language can be subjective. Even human evaluators can differ in their judgments on whether a translation is good or not.
Now that we’ve gone over the basics, let’s take an in-depth look at the two types of MT evaluation, beginning with human evaluation.
At the most basic level, the goal of machine translation is to translate text from a source language into a target language on a level that humans can understand. As such, humans are the best point of reference for evaluating the quality of machine translation.
There are a number of different ways that human evaluation is done, which we’ll go into now:
This is the most simple kind of human evaluation. Machine translation output is scored on the sentence level.
The challenge with direct assessment is that different judges will vary widely in the way that they score. Some may tend to go for the extremes in terms of scoring, marking translations as either very bad or very good. Others may play it more conservatively, marking the same sentences with scores closer to the middle.
Another challenge is, again, subjectivity. In judging whether a sentence is a bad translation or not, evaluators need to make decisions on language that is ambiguous. Going back to the example sentence:
The quick brown fox jumped over the lazy canine.
Here, canine isn’t necessarily wrong, but it isn’t the best fit either. Some evaluators may consider it good enough, while others might flag it as completely wrong. For example, if the scoring is done on a 5-point scale, some translators might mark it a 4, while another might give it only a 2.
These challenges can be offset by employing a larger pool of evaluators, which will allow the scores to be normalized on statistical terms.
Another way to assess machine translation systems through human evaluation is ranking.
In this case, evaluators don’t provide individual scores for sentences, but instead compare among translations from different MT systems. They then decide which one is the best translation, which is second best, and so on.
The advantage of this method over direct assessment is that it immediately provides a direct comparison, as opposed to comparing scores that have been generated over different trials and possibly by different evaluators.
However, it does still suffer from the challenge of subjectivity. Different MT systems are likely to come up with different errors. For example:
The quick green fox jumped over the lazy dog.
Quick brown fox jumped over lazy dog.
The quick brown fox jump over the lazy dog.
Each sentence has a simple error. The first one has a mistranslation. The second omits articles. The third one is missing verb tenses.
Evaluators now need to decide which error is more important than the other, and again, evaluators may have different opinions on the matter.
If the user’s purpose for an MT system is to prepare documents for post-editing, there are also ways to evaluate it according to the amount of effort it takes to post-edit.
The fundamental purpose of post-editing is to allow a translator to work faster than if they were to translate a text from scratch. As such, the simplest way to assess an MT system for post-editing is by measuring the time it takes for the translator to correct the machine-translated output.
Another way to measure post-editing effort is by tabulating the number of strokes on the keyboard that it would take to replace the machine-translated text with a human reference translation. This is independent of time constraints, but also does not take into consideration the possibility of multiple correct translations.
Then there’s task-based evaluation which, as the name suggests, assesses an MT system based on how well it's suited to the task at hand. For example, if it's used in a multilingual webinar setting, participants could be asked to rate their experience with a machine-translated transcript. This means that they are rating the success of the MT system as a whole.
The problem with this approach is that it's very open to the introduction of other uncontrolled elements that may affect the rating evaluators give. As such, the use of task-based evaluation is very situational.
As you might be able to see, the different types of human evaluation of MT come with their own challenges. There are also some challenges that they share broadly, and these have to do with consistency or agreement.
This refers to the consistency of scores between different evaluators. As we mentioned earlier, different evaluators will have varying tendencies in the way they score the same segments of text. Some may score them at extremes or toward the middle. When ranking different MT engines, their opinions can also vary. This is why it’s important to have multiple evaluators, so that the distribution of scores will be normalized.
The way a single evaluator scores a text is also a measure of validity. An evaluator might score a sentence as good or bad the first time around, but they might change their mind upon repeating the same test. Having a high measurement of intra-annotator agreement ensures that the chosen evaluator can be considered consistent and reliable.
Human evaluation is considered the gold standard when it comes to evaluating the quality of machine translation. However, it’s a costly endeavor in terms of effort and time. This is why researchers in the field have developed different means of evaluating MT quality through automated processes.
These processes are designed to approximate how humans will evaluate the MT system. Of course, they are far from perfect at this, but automatic evaluation still has very important use cases.
The main advantage of automatic evaluation over human evaluation is its scalability. It’s much faster to run hundreds of instances of automatic evaluation than even one round of human evaluation. This makes it an ideal solution when making tweaks or optimizing the MT system, which needs quick results.
Unlike humans, machines aren’t equipped to handle the different nuances of language usage. Automatic evaluation systems are premised on the MT having an exact match with a reference text, and minor differences can have an impact on the final score. These differences can include deviations in morphology, the use of synonyms, and grammatical order.
Anything that can be considered technically or more or less correct by a human evaluator can possibly be penalized in automatic evaluation. Nonetheless, the number of exact matches, especially when considering a large sample of text, is often enough to make automatic evaluation feasible for use.
There are a number of different automatic evaluation metrics available today. Here are some examples of the ones in use:
BLEU (Bilingual Evaluation Understudy)
NIST (from the National Institute of Standards and Technology)
METEOR (Metric for Evaluation of Translation with Explicit Ordering)
LEPOR (Length-Penalty, Precision, n-gram Position Difference Penalty and Recall)
COMET
PRIS
TER (Translation Error Rate)
Each metric works on different algorithms and as such handle the process of automatic evaluation differently. That means that they have different strengths and weaknesses, and differ as to which kinds of errors they give higher or lower penalties to.
Of all the metrics listed above BLEU is the one that is most commonly used. It was one of the first metrics to achieve a high level of correlation with human evaluation, and has spawned many different variations.
How it works is that individual sentences are scored against a set of high quality reference translations. These scores are then averaged, and the resulting number is the final BLEU score for that MT system. This score represents how closely the MT system’s output matches the human reference translation, which is the marker for quality.
The scores are calculated using units called n-grams, which refer to segments of consecutive text. Going back to the earlier sample sentence, for example:
The quick brown fox jumped over the lazy dog.
This can be divided into n-grams of different length. A 2-gram, for example, would be “The quick”, “quick brown”, or “brown fox”. A 3-gram would be “The quick brown” or “quick brown fox”. A 4-gram would be “The quick brown fox”. And so on.
It’s a complex mathematical process, but in basic terms BLEU’s algorithm calculates the score by checking for the number of overlaps between n-grams. The calculated score will be between 0 and 1, with 1 representing a completely identical match between the reference and the output sentence. Now take the following variation on the sample sentence:
The fast brown fox jumped over the lazy dog.
All of the n-grams will match except the ones that have the word “fast”. Another example:
The quick brown fox jumped over the dog.
In this example, the word “lazy” is missing, so that also impacts the overlap negatively. In both cases, the BLEU score would still be high, but less than 1.
In practice, not many sentences will show this high level of correlation. As such, BLEU scores become statistically significant only when taken in the context of a large sample of text, or corpora.
There are, of course, other factors that go into calculating the BLEU score, such as penalties for extra words or very short sentences. Other derivative scoring systems have been developed to compensate for its shortcomings, but BLEU remains highly rated and continues to be the most widely used MT evaluation system today.
And that covers the basics of machine translation evaluation. As we have shown, assessing an MT system can be done through human evaluation or automatic evaluation. Both processes have their advantages and disadvantages.
Human evaluation is the gold standard in terms of quality, but is expensive and time-consuming. Automatic translation is not as accurate, but it’s quick and scalable. As such, both types have their specific use cases where they shine.