08/06/2023
Language is at the heart of human communication, shaping our cultures, histories, and identities. The advent of technology, particularly machine learning (ML) and natural language processing (NLP), has transformed our ways of communication, making it more convenient and comprehensive.
It can’t be denied that the world has become more interconnected, especially with the development of ML and NLP. However, while the technological revolution has largely benefitted high-resource languages, it has left many low-resource languages largely unexplored, under-resourced, and, therefore, under-represented. It has led to some concerns regarding handling these under-represented languages and how they will be made more accessible to the general public ethically.
Besides the ethical considerations, there is a race against time in how we can gather and preserve data from low-resource language. The Guardian reported that worldwide 7,000 languages, over half will be extinct by the end of the century, and some are low-resource languages. One of the reasons for this crisis is climate change, as many small communities are affected by the rising sea levels.
Despite this, under-represented languages should be a part of the advancement of machine translation, and we will discuss why in this article.
So let’s first define what makes a language low-resource. The Internet has become more and more multilingual over the years, but that doesn’t mean that all languages can be found online.
Low-resource languages have very limited or non-existing digital content, making it hard for researchers to use and run it through machine translation models. These resources can range from online text data to machine translation tools, speech recognition software, and more.
What‘s the difference between endangered vs low-resource language?
An endangered language is at risk of falling out of use, mostly because its speakers are switching to other languages. So it doesn’t necessarily mean that low-resource language is endangered, or vice-versa. But endangered and low-resource languages can experience the problem of power imbalance in Information and Communication Technologies, wherein there is a lack of “equality in the control of resources and information” as it doesn’t empower local communities to define their views, needs, and goals.
Despite being less visible online, low-resource languages are important because they play a vital role in diversifying our global linguistic heritage by containing unique cultural knowledge and acting as social connectors for communities. Hence, there is a pressing need to develop resources for these languages and make them digitally visible and accessible.
The realm of low-resource languages presents unique challenges, community-based approaches, and ethical considerations, which field experts can shed light on. The primary challenge is the scarce digital resources available for these languages, which hinders the development of NLP tools and technologies. Other complexities include small speaker populations, a broad spectrum of dialects, and the absence of standardized orthographies.
We wanted to learn more about low-resource languages and the effect they had on the advancement of NLP and machine translation, so we interviewed three experts who have written and conducted extensive research on the matter. In our interview, we discussed what were the major challenges in developing low-resource languages and what actions they have taken to address or overcome this issue.
Utsa Chattopadhyay, a researcher from Nanyang Technological University, is an expert he studied and published an article on this matter entitled, “Neural Machine Translation For Low Resource Languages.”In our interview, Utsa explained that one of the most pressing issues is the availability of ground truth, which is not available. And so he would have to rely on monolingual texts most of the time.
“[T]his unavailability of data is a challenge that we tackled through our scientific work and gave improvements on base level work by a bigger margin. Another thing is the requirement of High processing power which is, though readily available but at a bigger price, not suitable for research most of the time, making it time-consuming (cloud-based servers might be a short-term solution),” Utsa explained.
It’s a similar issue for Vassilina Nikoulina, a researcher of NAVER Labs Europe, who has published a study entitled, “SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages.” She stated that the lack of high-quality resources, both for training models, but also for evaluation of these models, was a major challenge. She further explained that when the resources are available, they are often specialized content, such as religious texts translations, and can be of quite low quality.
“[The] recent trend for multilingual models partially addresses this problem because low-resource languages benefit from knowledge transfer from high-resource languages when trained jointly. However, in practice, very little attention is paid to the real low-resource languages in such models: most of them are evaluated on high-resource datasets, and low-resource languages represent just a tiny fraction of the training data. We believe that one could benefit much more from the knowledge transfer if various training factors are considered with caution: eg. the proportion of high-resource languages vs. low-resource languages in the training dataset, model size, and the training procedure. If our final target is low-resource language translation, we should pay attention to the choice of these factors,” Vassilina said.
Professor Surangika Ranathunga from the University of Moratuwa told us that the lack of data for monolingual and parallel sources was one of the challenges he encountered during his research. He later published his findings in an article entitled “Neural Machine Translation for Low-Resource Languages: A Survey.” Surangika explained that there are initiatives to crawl bitext from the web. However, the released parallel corpora are noisy and contain fewer amounts of data for LRLs. So there should be more efforts to create more and more data.
“Our recent experiments show that the use of multilingual Large Language Models (LLMs) is much better than training Transformers from scratch. However, these LLMs mostly contain 100 odd languages, which is far below what we need. Scaling these models to thousands of languages (even if the data was there) is not practical with the current technology. Therefore we should build multilingual LLMs specific to language families and/or covering a set of languages used in a region. IndicBART is a very good example,” Surangika said.
From our discussions, it seemed that engaging with local communities is essential to address these challenges. So we asked the researchers and experts what role community-based approaches play in the development of machine translation systems when it comes to low-resource languages and how they could leverage this effectively. According to Utsa, “community-based approaches might play a vital role” because automated translation doesn’t allow it to be edited or corrected.
“To allow a community to translate low resource languages requires a bit of faith, but can yield quality results and can further improve the quality of the raw data that in turn will again produce robust models. One of the sites which leveraged this opportunity is TIDBITS which actually has been a great introduction to this kind of approach,” Utsa explained.
For Vassilina, one of the important factors for any language task is ensuring the availability of high-quality datasets.
“Creation of such datasets, both for training and evaluation, can significantly boost progress on low-resource languages. Another type of contribution is, of course, making multilingual models more available to the community: e.g., open-sourcing high-quality models, making them smaller and therefore more accessible to researchers or practitioners with little computational resources available, should help to progress,” Vassilina said.
As for Surangika, she stated that data collection should involve the community and native language speakers to gain higher-quality data. The best example of this, according to her, is the Masakhane project from Africa.
“I see that still, some researchers are reluctant to share their data/code/models. If we set up local community initiatives, resources will get shared, at least among the community. Going by the saying, ‘unity is power,’ rather than individual researchers working in silos, if they organize as a community, more knowledge sharing would happen,” Surangika stated.
Surangika further explained that she has observed some researchers are reluctant to share their data, code, or models. She believes that setting up local community initiatives and resource sharing, at least among the community, would lead to more funding for research from organizations and public governments. Communities with these sorts of initiatives can influence their government to invest in language technologies and establish required data protection laws.
Lastly, we discussed the ethical considerations when developing machine translation systems for low-resource languages and how they should be addressed.
According to Utsa, local communities are concerned about machine translation systems collecting, using, and analyzing data about their language and culture. Historically, the language of indigenous people was weaponized or used against them. So it is understandable why many communities are wary of letting outsiders collect and study their language and have it run through machine translation.
“For example, pro-gay texts or religious texts for Arab or Hebrew translation systems. In general, there are actually no rules and norms for this when we are speaking about ‘machines,’ in such but few things can be taken into mind, such as impartiality and cultural sensitivity, but when we are speaking about ethics, it mostly comes to the human mind and not for any AI system, where the thin margin still exist on taking a decision where logic is required, same goes to ethics. We can construe this as an open problem for future research as well,” Utsa explained.
Vassilina stated that data quality and availability taken from religious texts run the risk that translation quality contains biases and a more religious style in its translated texts.
“We can often observe cases of ‘hallucinations’ for such languages when the model generates the text that has nothing to do with the original input text. This implies that such systems should be used with a lot of caution. We also observed that low-resource languages are the first to suffer from quality decreases when we start to compress them for more efficient usage. Therefore, one needs to be cautious about how compression (eg. quantization, pruning, distillation) could potentially impact low-resource languages and whether it would increase bias already present even further. Developing compression techniques that could take such considerations into account would definitely be beneficial for low-resource languages,” Vassilina stated.
As for Surangika, she explained that low-resource languages often come from developing countries. She points out that these countries may not have established data protection laws. It can also run the risk of worker exploitation, wherein people employed to gather and collect data are paid low wages. However, she believes this can be regulated by the government, and policies can be passed to prevent such exploitations and protect the local data. She also stated that researchers should be mindful as to “not to exploit these loopholes” or lack of regulations on these matters.
“Researchers should be mindful in releasing the trained models for public use. In particular, when using pre-trained multilingual LLMs, some biases appearing in high-resource languages of the model may transfer to low-resource languages. And some of these biases (especially religious and ethnic biases) could be really harmful in some countries/communities. Therefore thorough testing for potential biases in the final models is very important,” Surangika explained.
Low-resource languages, while underrepresented in digital spaces, play a vital role in maintaining our global linguistic diversity. Advancing the development of NLP and machine translation systems requires addressing the unique challenges of gathering and analyzing data from low-resource languages. However, as presented by the experts that we interviewed on this subject matter, there are a lot of challenges in conducting low-resource language research on machine translation.
In the coming years, everyone must come together to ensure that as machine translation systems are being developed, the native speakers and their cultural identity and language aren't left behind or exploited. But rather, they are a part of the process and have a voice in how technology will represent them, their language, and their culture.