Meta releases Sphere, an open source AI knowledge tool based on 134 million web pages



Facebook's parent company, Meta, has announced Sphere , an AI tool that leverages the vast amount of information that exists on the Internet to provide a knowledge base for AI and other systems to function.

Introducing Sphere: Meta AI's web-scale corpus for better knowledge-intensive NLP
https://ai.facebook.com/blog/introducing-sphere-meta-ais-web-scale-corpus-for-better-knowledge-intensive-nlp/

How AI could help make Wikipedia entries more accurate
https://tech.fb.com/artificial-intelligence/2022/07/how-ai-could-help-make-wikipedia-entries-more-accurate/

On July 11, 2022 local time, Meta announced the first AI model, Sphere, that can automatically verify hundreds of thousands of citations at once. Meta describes Sphere as 'a search engine composed of 134 million web pages as a source of knowledge,' and as an AI knowledge tool of this kind, 'on an order of magnitude larger than existing ones.' It's very complicated. '




A voice assistant such as Siri installed in a smartphone performs a question-and-answer or fact-finding task called knowledge-intensive natural language processing (KI-NLP) to search and acquire related information from the database and request it from the user. Returns the answer. However, existing KI-NLPs have some problems, one of which relies on a search engine where it is unclear what commercial algorithm will work to search for relevant web knowledge. It is. '

Meta is 'universal' and 'curated' to solve multiple KI-NLP tasks at once by using more open web data to better utilize real-world knowledge. He said that he thought it important to build a 'no' and 'unstructured' knowledge base, and developed the AI knowledge tool 'Sphere'. Unlike traditional KI-NLP, the database is search engine-independent, allowing AI researchers using Sphere to explore and control the corpus , enabling scaling and optimization in a variety of ways. Meta explains that it can also contribute to the advancement of search technology.

Sphere has a database of 134 million documents published on the Internet, and each document is divided into about 906 million clauses and about 100 tokens, so the existing KI- It is said that it can provide much more data than the knowledge source used in NLP.

Sphere is an open source AI knowledge tool, so it's published on GitHub.

GitHub --facebookresearch / Sphere: Web-scale retrieval for knowledge-intensive NLP
https://github.com/facebookresearch/sphere



Sphere is a variant of Common Crawl or CCNet that crawls the web and freely provides its archives and datasets to the public, dumping redundant material and scoring pages based on write quality. However, since Sphere does not depend on any system, it can be freely used for cutting-edge natural language processing programming research. In the case of Sphere, access to the entire corpus is released, so researchers can examine all the text on Sphere. You can take advantage of this to build an architecture that eliminates certain weaknesses, which in turn allows you to build a universal model of KI-NLP.

The open corpus also allows you to experiment with new architectures like 'dense retriever'. In the case of 'dense search', documents and queries are represented as vectors and can be easily supplied to the reader model. In other words, readers and searches speak the same language, which makes it easier to optimize for interaction. On the other hand, traditional search engines are designed for human use, so the system must communicate in natural language, increasing the likelihood of translation errors.

Meta says, 'There is no guarantee that traditional search engines will continue to grant AI researchers access to building KI-NLP models. Further in this area as part of an ongoing effort to support the AI community. We are releasing Sphere to accelerate our experiments. Sphere helps researchers train researchers to process a wider range of documents and is one of the web's most annoying challenges: false information, noise, and noise. It will be possible to build automated systems for inconsistent text. In the real world, these models will be able to crack down on harmful content and, when combined with a well-designed UI, will allow them to build automated systems. It will enable people to strengthen their digital literacy and critical thinking skills. '

In addition, Wikipedia uses Sphere to automatically scan articles in the platform and verify the web page that is the source of the citation.

in Software, Posted by logu_ii