Microsoft announces AI 'Kosmos-1' that understands not only sentences but also visual content and can answer IQ quizzes, advancing to the development of general-purpose artificial intelligence



In recent years, AI that demonstrates superior capabilities in specific fields such as image generation and dialogue with humans has been attracting attention. The goal is to develop an 'Artificial General Intelligence (AGI)' that can be processed collectively. Newly, Microsoft has announced the

multimodal AI ` ` Kosmos-1 '', which is excellent not only in language processing but also in recognizing images and visual content, and can also answer IQ tests using figures.

[2302.14045] Language Is Not All You Need: Aligning Perception with Language Models
https://doi.org/10.48550/arXiv.2302.14045

Microsoft introduces Kosmos-1, a Multimodal Large Language Model that achieves impressive performance | BigTechWire
https://www.bigtechwire.com/2023/03/01/microsoft-introduces-kosmos-1-a-multimodal-large-language-model-that-achieves-impressive-performance/

Microsoft unveils AI model that understands image content, solves visual puzzles | Ars Technica
https://arstechnica.com/information-technology/2023/03/microsoft-unveils-kosmos-1-an-ai-language-model-with-visual-perception-abilities/

Advances in technology have made it possible for AI to generate images and sentences of quality comparable to humans, but humans have the strength of `` being able to perform various tasks alone '', and at the time of writing the article AI is inferior to humans in terms of versatility. Some AI developers are working on the development of AGI, and Sam Altman, CEO of AI development company OpenAI, said in a blog in February 2023, ``The transition to a world with super-intelligent AGI , is perhaps the most important, hopeful, and terrifying project in human history.' 'We want to provide the world with AGI that will bring prosperity to humanity in ways no one has yet imagined.' said.

CEO of OpenAI, who developed ``ChatGPT'' and ``DALL-E 2'', announces the outlook on ``general-purpose artificial intelligence''-GIGAZINE



Under such circumstances, Microsoft has announced the AI `` Kosmos-1 '', which is excellent not only in natural language processing but also in image processing, and can answer questions combining images and sentences. In a paper published on the preprint server arXiv, the Microsoft research team said, 'Multimodal perception, a fundamental part of intelligence, enables artificial general intelligence in terms of acquiring knowledge and connecting to the real world. We don't just need the language, we need to match the perception to the language model, ”he said, explaining that Kosmos-1 is a multimodal large-scale language model (MLLM).

Kosmos-1 was trained using The Pile , an 825GB text dataset, and web data culled from Common Crawl . After training, Microsoft tested Kosmos-1 for language comprehension, language generation, character recognition without OCR, image caption generation, and question-and-answer sessions including visual content to examine its performance.

An example of the test described by the research team in the paper is as follows. Question (1) asks, 'Please explain why this image is interesting.' On the other hand, Kosmos-1 gives a fairly accurate answer, ``Because the cat is wearing a mask that looks like it is laughing.'' Also, in response to the question (3), 'What is the hairstyle for blondes called?', the correct answer is 'ponytail.'



In question (6), which asked the answer with the image '5 + 4', Kosmos-1 answered '5 + 4 = 9' perfectly, and in question (7), which asked the heart rate by showing the image of the smart watch, it was also correct. Answer heart rate.



Also, when I showed an image of the 2008 movie '

Wall -E' produced by Pixar Animation Studios and Walt Disney Pictures and asked, 'Please explain the details about this image,' Kosmos-1 I even explained the title and that Pixar Animation Studios produced it.



In addition, Microsoft also provides Kosmos-1 with problems using graphics called the Raven Progressive Matrix Task, which is also used for IQ tests. Kosmos-1 gave a correct answer rate of 22 to 26% by showing a sequence of multiple figures with regularity as shown below and asking 'Which figure will come next?' Since the correct answer rate when answering randomly is 17%, although it is small, Kosmos-1 answered correctly with a probability exceeding chance.



A Microsoft research team hopes to scale up Kosmos-1 in the future to also integrate speech recognition capabilities. Microsoft plans to release Kosmos-1 to developers, but according to technology media Ars Technica, the code that can be used at the time of writing the article has not been released.

in Software,   Science, Posted by log1h_ik