How can Google Translate, which supports 108 languages, use AI to make the translation quality better and better?

Google said it had made progress in improving the quality of language translation. In an upcoming blog post, the company details new innovative technologies that enhance Google Translate (Google
Translate) supports user experiences in 108 languages (especially Yoruba and malayalan, which are poor in data), and the service translates an average of 150 billion words per day.

In the 13 years since Google translation first became public, technologies such as neural machine translation, rewriting-based paradigm and local processing have made a quantifiable leap in the translation accuracy of the platform. But until recently, the latest algorithm performance of translation also lagged behind that of human beings. Efforts outside Google also illustrate the difficulty of the problem. The Masakhane project aims to enable thousands of languages on the African continent to be automatically translated, but it has not gone beyond the data collection and transcription stages. Common Voice (Lei Feng Network (public number: Lei Feng Network) note, Common Voice is a crowdsourcing project initiated by Mozilla to create a free database for Voice recognition software) since its launch in June 2017, Mozilla’s efforts to establish an open source collection of transcribed voice data have reviewed only 40 voices.

Google said that its breakthrough in translation quality is not driven by a single technology, but a combination of technologies for languages with fewer resources, high-quality source languages, overall quality, delay and overall reasoning speed. From May 2019 to May 2020, it was measured by manual evaluation and BLEU (an indicator based on the similarity between system translation and manual reference translation), google Translate has improved by an average of 5 points or more in all languages and by an average of 7 points or more in the 50 lowest levels of translation. In addition, Google said that the “translation” of the machine translation Lenovo’s functions become more powerful, a phenomenon is that when you Telugu characters “ష ష ష ష ష ష” input, “Shenzhen Shenzhen Shaw International Airport (SSH)”), the AI model will produce a strange translation “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh Sh sh sh sh sh sh”.

Hybrid model and data miner

the first of these technologies is the translation model architecture-a hybrid architecture that includes Transformer encoder and recurrent neural network (RNN) implemented in Lingvo (TensorFlow framework for sequence modeling) decoder.

In machine translation, the encoder usually encodes words and phrases as internal representations, and then the decoder uses them to generate text in the desired language. The Transformer-based model first proposed by Google researchers in 2017 is more effective than RNN in this respect, but Google said its work shows that most of the quality improvement comes from only one component of Transformer: encoder. That may be because although both RNN and Transformer are designed to process ordered data sequences, Transformers do not need to process sequences sequentially. In other words, if the data in question is natural language, the Transformer does not need to process the beginning of the sentence before processing the end.

However, the RNN decoder is still “much faster” than the decoder in “Transformer” when reasoning “. Aware of this, the Google Translate team optimized the RNN decoder to create low latency, higher quality, before combining the RNN decoder with the Transformer encoder, A hybrid model that is more stable than the RNN-based neural machine translation model four years ago is replaced.

Figure: BLEU score of Google translation model since its establishment in 2006. (Image Source: Google)

in addition to the novel hybrid model architecture, Google also translates from millions of examples (for articles, books, documents, and Web search results) compiled the decades-old crawler used to compile the training set. The new translator is based on 14 mainstream languages embedded, instead of dictionary-based-meaning it uses real vectors to represent words and phrases-with more emphasis on precision (the proportion of relevant data in the retrieved data) think of (part of the total amount of relevant data actually Retrieved). Google said that this increased the number of sentences extracted by the translator by an average of 29% during use.

Noisy data and transfer learning

another translation performance improvement comes from modeling methods that better handle noise in training data. It was observed that noisy data (data containing a large amount of information that cannot be correctly understood or interpreted) can damage the translation of languages, so the Google translation team deployed a system, the system uses a trained model to assign scores to examples to tune noisy data and tune “clean” data. In fact, these models begin to train all the data and then gradually train smaller and cleaner subsets, which is called curriculum learning in the AI research community.

In terms of resource-poor languages, Google has implemented a reverse translation scheme in translation to enhance parallel training data, with each sentence in the language paired with its translation. (Machine translation traditionally relies on the corpus statistics of paired sentences in the source language and the target language.) in this scheme, the training data is automatically aligned with the synthetic parallel data, so that the target text is a natural language, but the source will be generated through the neural translation model. The result is that Google Translate uses richer monolingual text data to train the model, which Google says is particularly useful for improving fluency.

Figure: Google Maps with translation function.

Google Translate now also uses M4 modeling, where a large model M4 is translated between multiple languages and English. (M4 was first proposed in a paper last year, proving that it has improved the translation quality of more than 30 low-resource languages after training 100 pairs of sentences in more than 25 billion languages.) M4 modeling makes it possible to transfer learning in Google Translate, collecting high-resource languages including French, German and Spanish (with billions of parallel examples) for training to improve performance, thus, it can be applied to translate low-resource languages such as Yoruba, Sindhi and Hawaiian (only tens of thousands of examples).

As we look to the future

according to Google, translation has improved by at least 1 BLEU point per year since 2010, but automatic machine translation must not solve the problem. Google admits that even its enhanced model is prone to errors, including confusing different dialects of a language and producing too many literal translations, and poor performance on specific subjects and informal or oral language.

Microsoft is trying to solve this problem in various ways, including recruiting volunteers through its Google Translate Community program, help improve the translation quality of low-resource languages by translating words and phrases or checking whether the translation is correct. In February alone, the program was combined with the emerging machine learning technology to increase translation. A total of 75 million people used five languages: Kinyarwanda, Odia, Tatar, turkmen and Uyghur (Uyghur).

Google is not the only one pursuing real universal translation. In August 2018, Facebook released an AI model that combines word-by-word translation, a combination of language models and reverse translation to perform better in language pairing. Recently, researchers at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory have proposed an unsupervised model that can be learned from test data that has never been clearly labeled or classified, the model can be translated between texts in two languages without direct translation between the two languages.

In a statement, Google said diplomatically that it thanked “academia and industry” for machine translation research, some of which reported on its work. The company said: “We have achieved it by synthesizing and expanding various latest developments (recent improvements in Google Translate). Through this update, we are proud to provide relatively consistent automatic translation, even with the least resources in the 108 languages supported.”