Facebook presents a tool kit 'LASER' for accelerating machine translation of over 90 languages


by Dmitry Ratushny

In order to accelerate translation by natural language processing (NLP) in multiple languages, Facebook opened the toolkit called " LASER (Language-Agnostic SEntence Representations) " and released it on GitHub. LASER supports more than 90 languages and 28 letters.

LASER natural language processing toolkit - Facebook Code
https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/

The following page shows Facebook publishing LASER with a multilingual test set of over 100 languages.

GitHub - facebookresearch / LASER: Language - Agnostic SEntence Representations
https://github.com/facebookresearch/LASER

LASER enables "zero shot translation" in NLP model to translate from one language to multiple languages. Zero shot translation is a technology that Google announced in November 2016 and introduced a topic. If we educate the translation system "English and Japanese two-way" and "English and Korean bi-directional" translation, can not you translate without English? Research was conducted by researchers who thought that. As a result, it was possible to translate "reasonable" in two languages without explicit training and mapping.



The vector representation in the text of LASER is comprehensive in both language input and NLP. LASER has the goal "to put the same sentence near any language," and arranges the language in a high-dimensional space. At this time, the distance between sentences and sentences reflects whether the meaning of each sentence is close or distant.

The figure below shows the mechanism of LASER. The state with only one language on the left, the state with the multilingual on the right. In the figure on the left you can see that "sentences with the same meaning written in different languages" are located nearby.



LASER adopts the Seq 2 Seq model using encoders / decoders like other neural machine translations, and shares the encoder that inputs in all language translations and the decoder that outputs. The encoder is made of bidirectional LSTM (Bidirectional Long Short-term memory / BiLSTM) network of 5 layers, but unlike neural machine translation, the Attention layer is not used and instead the sentence to express is expressed There seems to be a fixed size vector of 1024 dimensions.



LASER is the first such library to handle many languages including languages with few resources such as Kabyl or Uygur using a single model, not only Facebook, but also various functions and services using NLP It is considered to be able to utilize it. For example, a review of a movie written in one language can be translated into 100 languages instantly and published.

LASER was able to accurately perform cross-lingual natural language reasoning of zero shots of 13 languages out of the 14 languages of the XNLI corpus and also showed excellent results in cross-lingual document classification. Distributed expression (word embedding) technology developed by Facebook is also strong against parallel corpus mining, and when tested with data of more than 100 languages in the Tatoeba corpus, it has been shown that it is strong against multilingual similarity search in a resource-less language It was.

Other merits of LASER are as follows.

· Performance capable of handling up to 2000 sentences per second on GPU · Sentence encoder executed by PyTorch · Limited language of source can benefit from joint training of many other languages · One sentence the system supports the use of multiple languages in the language family to learn to recognize the characteristics, performance is improved every time a new language is added

in Software,   Science, Posted by darkhorse_log