Explaining Google's 11 'Gemini' how-to videos that clearly demonstrate the performance of 'multimodal AI' that processes text, music, and images simultaneously



On December 6, 2023, Google released Gemini , a multimodal AI that simultaneously processes text, music, and images. A video explaining how to use Gemini was uploaded along with the release, so I checked it out.

How it's Made: Interacting with Gemini through multimodal prompting - Google for Developers

https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html

Everything to know about Gemini, Google's new AI model
https://blog.google/technology/ai/gemini-collection/

◆Table of contents
01: Find common features in images
02: Ability as a multimodal AI
03: Convert image to code
04: Emoji kitchen
05: Mathematics and physics reasoning
06: Programming ability
07: Questions regarding clothes and belongings
08: Voice recognition
09: Guessing the movie title
10: Reasoning about user ideas
11: Extract data from scientific literature

The following article explains what 'Gemini' is. In addition, Gemini can be used by anyone with the chat AI ' Bard '.

Multimodal AI ``Gemini'' with performance exceeding GPT-4, which can process text, voice, and images simultaneously and have more natural interactions than humans, will be released - GIGAZINE



To coincide with the release, Google has released a video on how to use Gemini.

◆01: Find common features in images
Testing Gemini: Finding connections - YouTube


Show Gemini a photo of the Bosges Chapel and Katsushika Hokusai's ``The Great Wave off Kanagawa,'' and ask Gemini, ``Please find commonalities between the two images.''



Gemini replied, 'Curvy and

organic . The second image (Hokusai) is more fluid.'



Next is a comparison between a golf ball and the moon. In addition to the commonalities, we will also ask about the history of both.



Gemini replied, 'In 1971, the Apollo 14 crew hit two golf balls on the moon.' This means that Gemini can not only read what an image is, but also explain it.



Lastly, a comparison between striped clothes and zebras. Add the question, 'Which one of you looks better dressed?'



Gemini replied, 'The answer is zebras, because zebras have been wearing stripes for millions of years.'



◆02: Ability as a multimodal AI

Hands-on with Gemini: Interacting with multimodal AI - YouTube


When I drew a line and asked aloud, ``What do you see?'' Gemini responded, ``I see a squiggly line.''



When I drew a line and asked her again, her answer changed to, ``It looks like a bird to me.''



When I put in the water line, it said it was a duck.



As you paint the duck blue, Gemini will tell you that the duck's color is not blue.



So he took out a toy duck and asked, 'Is it true?'



Gemini changed her opinion and said, ``Looks like blue ducks are more common than I thought.''



When I asked him to guess the material, he said, ``It looks like rubber or plastic.''



Push the duck further to make it squeak.



Gemini hears the sound and replies, 'If it squeaks, it must float!' and concludes that it is made of rubber, which is less dense than water.



Next, I asked, 'Can you tell me how to say this in another language? Give me some examples.' Gemini returns examples for Spanish, French, Korean, Japanese, and Chinese. Next, ask how to pronounce it in Chinese.



Then Gemini explained that there are tones in Chinese, and explained in detail how to pronounce them in Chinese.



Next, place the duck on the world map. This alone led Gemini to interpret 'the duck is on the ocean' and say, 'the duck is in the middle of the ocean and there is no land nearby where it can rest or find food.'



I also ask them to ``think about the game based on what they can see right now.''



Gemini suggested a country guessing game and immediately asked the first question. ``Home of kangaroos, koalas, and the Great Barrier Reef'' and gives you three hints.



When I pointed to Australia, he said he was correct.


2nd question. He gives us a hint: ``Soccer is popular here, and they boast the most men's World Cup wins in history.''



When I pointed to Brazil, it made a 'ping pong' sound effect.



Next we started the game of finding the paper ball under the cup.



When I put the ball under the cup, she didn't say anything yet, but I guessed, ``You're trying to get me to find the paper ball under the cup.''



I tried to manipulate the cup quickly to confuse it, but Gemini correctly answered, 'It's the cup on the left.'



This time, just make the 'rock, paper, scissors' gesture toward the screen, and Gemini replies, 'I know what you're doing. It's rock, paper, scissors, right?'



Make a butterfly with your hands and ask, ``What does it look like?''



“I spread my fingers so they look like butterfly wings,” Gemini replies.



The next person answered, ``Big ears and a barking mouth. It's a dog.''



Just put the coin in your palm and turn it over, and Gemini will answer, 'The coin should be under your right hand.'




However, the human being was one of a kind and opened his right hand and said, ``I don't know about that...?''



Not to be outdone, Gemini replied, 'The coin should be in your left hand. You're using a magic trick to make it look like the coin has disappeared.'



Next time, let's place the coins and cookies side by side. Before the human could say anything, Gemini commented, 'They're both round and flat.'



Replace the coin with a mandarin orange and ask, 'What is this?' Then Gemini answered, 'Both are food. Mandarin oranges are healthier than cookies.'



This time I replaced the cookies with a fidget spinner. Gemini said, ``Citrus fruits have a calming effect, and fidget spinners are the same.''



Furthermore, replace the mandarin orange with a Rubik's cube. Gemini answers, ``Both are popular toys, enjoyed not only by children but also by adults.''



I show them two pieces of yarn and ask them for examples of their work. Three examples were suggested, including 'dragon fruit' for green and pink yarn.



It will generate an image of the work example according to the color of the yarn.



When we prepared an illustration of a fork in the road and asked, ``Which way should we go?'' Gemini correctly guessed the situation and answered, ``We should go to the left.''



When I showed him an illustration with just dots and numbers, he said, ``This is a picture of a crab.''



By connecting the dots with lines in numerical order, a beautiful crab picture was completed.



When I showed him a picture of three stars and asked him, ``Is this the correct order?'' he answered, ``No, the correct order is the Sun, Earth, and Saturn.''



You can also guess which car is faster based on the shape of the car.



You can also answer the question, 'Which roller coaster seems more fun?'



When I said, ``Think of an appropriate line,'' it generated a nice line that sounded like ``Kya!''.



When you show an illustration of a guitar, it will play acoustic guitar music.



When connected to an amplifier, the music changes to electric guitar music.



Adding an illustration of a palm tree makes it sound like beachy ukulele music.



You can also answer the question, 'What is this scene trying to recreate?'



When I paused the movie and asked, ``What's going to happen after this?'' I was able to predict the future, saying, ``I'm sure I'll get a perfect score of 10!''



Finally, they drew a picture of a constellation and asked, 'What does this look like?' Gemini replied, ``It's a simple line drawing of Gemini. You've successfully captured the beauty of Gemini.''



◆03: Convert image to code
Testing Gemini: Turning images into code - YouTube


This is an experimental video of the process of converting an image into code. First, let's give Gemini an image of a tree and ask, 'Can you convert this image to SVG?'



Geimini presented an image with the comment, ``This is an image that shows the main shape of the tree.''



Next up is a slightly more complex task. I'll ask Gemini if it's possible to create an interactive demo with JavaScript. Then, Gemini presented a 'fractal tree'.



You can also move the slides and transform them interactively.



Of course, they will also show you the code.



◆04: Emoji kitchen

Testing Gemini: Emoji Kitchen - YouTube


This is a video showing how Gemini handles the emoji fusion service 'Emoji Kitchen' provided by Google. First, let's ask about the emoji made by combining a ghost and an avocado, ``Do you know which emoji are combined?''



Gemini guessed correctly.



Next time I asked what the emoji looked like, and Gemini answered, ``It's a ghost, but it's green and has a big brown hole.'' It's not green, but it definitely has brown holes.



Next, they are asked to give the emoji a name and a catchphrase. Gemini proposed the idea of ``Avogeist:

Guacamole Ghost,'' a combination of avocado and ghost.



The emoji that combines a robot and a rabbit is named 'Robo Bunny.'



◆05: Mathematics and physics reasoning

Gemini: Explaining reasoning in math and physics - YouTube


Just present your handwritten document and Gemini will tell you what's wrong, what's right, and where you need more explanation. Here, click on the section marked with a 'x'.



Gemini showed that although the formula was correct, the height calculation was wrong.



It seems that the correct height was 50 meters instead of 6 meters, so I will ask 'Why is it 50 meters?'



Then he gave me a more detailed solution.



Furthermore, Gemini can also create practice questions similar to the calculation you made wrong this time.



◆06: Programming ability

Gemini: Excelling at competitive programming - YouTube


Gemini supports programming languages such as Python, Java, C++, and Go.



You can create prototypes for apps and websites just by presenting your ideas. This time, we requested the creation of a ``location information web app to discover trains.'' Within a minute of asking a question, they'll create a working prototype. Although the code is not perfect, it can be used as a draft.



Gemini is also excellent in competitive programming, and compared to the competitive programming AI 'AlphaCode', 'AlphaCode2' designed based on Gemini showed better performance than 85% of competitive programming participants. The Google team said that programmers should do this because Gemini's programming ability can be greatly improved by simply specifying properties.



◆07: Questions regarding clothes and belongings

Testing Gemini: Fit check - YouTube


A human asks, 'What is this clothing best for?'



Gemini answered, 'Perfect for staying warm in the tundra.'



I will try putting on the spacesuit helmet.



The answer is 'intergalactic travel.'



I gave him a saxophone and asked him, 'What is this?'



The answer was, 'It's perfect for going to places no one has gone before and playing jazz in a bold way.'



◆08: Voice recognition

Gemini: Processing and understanding raw audio - YouTube


Typically, when large-scale language models work with speech, they take the speech, process it through a speech recognition system, convert it to text, and match it to another model that can understand that text. however. It seems that if you do this, many nuances such as voice and pronunciation will be lost. On the other hand, Gemini has native multimodal functionality, so it can process raw audio signals.



Let's take a look at the Chinese example. First, say, 'Which is the correct pronunciation for January of the lunar calendar in Chinese?' and have Gemini listen to the audio of the two words uttered in different tones. Gemini then replies, 'The first pronunciation is correct. The first letter is pronounced in the first voice.' Gemini can distinguish between two pronunciations and check which one is correct.



Next time, listen to the French audio and ask for a summary in English. Gemini will also answer this correctly in English. Gemini is not just a transcription model, but a model that allows you to understand the content of the story in detail.



Furthermore, it is not only possible to process only audio, but also to process multiple contents such as audio and images simultaneously, as the name ``multimodal'' suggests. For example, when I showed the image below and asked, ``What should I do to make an omelet with these ingredients?'', Gemini answered, ``First, put the eggs in a bowl and whisk them.''



Next, I showed them a photo of the cooking process and asked, ``Do you think it's ready yet?'' Gemini replied, 'It looks almost ready. You can flip it over and sear the other side.'



When I showed him a photo of what happened afterwards, he replied, 'That's enough. Please remove from the heat and enjoy.'



◆09: Guessing the movie title

Testing Gemini: Guess the movie - YouTube


A game is being played where you show keyword images and guess the name of the movie. First, show them an image of a fried egg and pancakes, and an image of a club, and ask them, ``Please guess the name of the movie.''



Gemini answered, '

The Breakfast Club .' It's breakfast + club, so it's correct.



This time it's breakfast and a ring. Gemini answered correctly, '

Breakfast at Tiffany's .'



``Uncut diamond'' with a ring and no

scissors allowed



'

Goldfinger '



'

Moonrise Kingdom '



Forest + G + (Bump - B) for '

Forrest Gump ' etc. Gemini can easily play the game ``associating movie titles from pictures.''



◆10: Reasoning about user ideas

Gemini: Reasoning about user intent to generate bespoke experiences - YouTube


This video explains how Gemini understands and makes inferences about user intent. First, when a human asks Gemini about ``ideas for her daughter's birthday party,'' Gemini answers ``What is she interested in?'' and tries to dig deeper into the user's intentions.



In response, humans add information such as ``I like animals'' or ``I want to do something outdoors.'' At this point, Gemini created an illustrated interface called 'Bell Chime' to visually explain the idea.



Gemini's thinking before showing the illustrations is also explained. First, when Gemini received a question from a human, he wondered, ``Is it okay to just use text, or do I need another UI as well?'' We determined that this was a complex task and needed to organize and display a lot of information, so we presented the interface shown above.



Next, Gemini will try to understand whether they themselves have enough knowledge. In this case, we were not initially told what kind of party we wanted or what our daughter was interested in. That's why Gemini decided 'I don't know enough about it to present an idea' and asked for that information, 'What are you interested in?'



On top of that, he considered that ``humans would want to choose from a variety of options,'' and devised an optimal interface that would allow people to choose ideas from a variety of options. The result was the 'bell chime' presented above. In other words, if the humans had asked different questions, a completely different interface could have been presented.



◆11: Extracting data from scientific literature

Gemini: Unlocking insights in scientific literature - YouTube


Scientists sometimes have to manually extract and utilize only the necessary and important information from thousands or tens of thousands of scientific documents.



The process carried out this time extracts only the necessary data from hundreds of thousands of papers related to genetics. First, the scientists gave Gemini a prompt such as, ``Please read the title and abstract of the following paper and tell me whether this paper talks about testing variants for a specific human disease.''



Gemini succeeded in separating necessary and non-essential papers and extracting important data. They also added annotations so they could see exactly where Gemini found information in the paper. Gemini was able to do work that would take hours if done manually 'during his lunch break.'



Furthermore, we asked Gemini to create images based on the acquired dataset. When given a graph showing data up to 2019 and instructed to redraw it, Gemini drew a new graph based on the acquired data after 2019. The Google team explained, ``It can be used not only in biology, but also in any field that deals with large datasets, such as law and finance.''



in Software,   Video, Posted by log1p_kr