Oct 04, 2019 06:00:00

Google releases a data set that overcomes the paraphrase of natural language processing

Nicole Honeywill

Natural language processing algorithms have been poor at understanding the order and structure of words. In order to overcome this challenge, Google has released a new data set. Training with this dataset will improve the text classification accuracy of the machine learning model from 50% to 80%.

Google AI Blog: Releasing PAWS and PAWS-X: Two New Datasets to Improve Natural Language Understanding Models
https://ai.googleblog.com/2019/10/releasing-paws-and-paws-x-two-new.html

Google incorporates natural language processing in machine translation and speech recognition, but even with the most advanced algorithms in natural language processing, 'Flight from New York to Florida' 'Flight from New York to Florida' 'Flight from Florida to New York ”Cannot be recognized correctly. Researchers have pointed out that the weakness of existing algorithms is “ paraphrasing ”.

In order to solve this problem using “diversity”, Google released a corpus called “Paraphrase Adversaries from Word Scrambling (PAWS)” on Wednesday, October 2, 2019. Since PAWS is only available in English, “PAWS-X” is also available for French, Spanish, German, Chinese, Japanese, and Korean. According to Google, PAWS and PAWS-X, which consist of paraphrases and non-paraphrases, will improve the accuracy of the algorithm to capture the order and structure of words from the previous 50% to 85-89%. .

by Romain Vignes

The PAWS dataset contains 108,463 pairs of words that are human-labeled in English, sourced from Quora Question Pairs and Wikipedia. PAWS-X, on the other hand, contains 23,659 pairs of human-translated PAWS data and 296,640 pairs from machine translation training.

According to Google researcher Yuan Zhang and software engineer Yinfei Yang, it is difficult to learn a specific sentence pattern, even with a machine learning model that understands sentences in complex contexts. “The new data set provides an effective means to measure the sensitivity of machine learning models to the order and structure of words,” the two follow the Google blog.

Researchers trained multiple models to investigate the impact of the corpus on the accuracy of natural language processing, and in particular, the BERT model and the DIIN model showed a “significant” improvement compared to the baseline. It was said. Originally BERT classification accuracy was 33.5%, but PAWS and PAWS-X seem to have raised the accuracy to 83.1%.

“Our hope is that this dataset will improve sentence structure, context extraction, pairwise comparisons, etc., and will bring great progress to the multilingual model research community,” said Zhang and Yang.

Related Posts:

Oct 04, 2019 06:00:00 in AI, Software, Web Service, Science, Posted by darkhorse_log