Explaining 11 Google 'Gemini' how-to videos that clearly demonstrate the performance of 'multimodal AI' that processes text, music, and images simultaneously



On December 6, 2023, Google released Gemini , a multimodal AI that can simultaneously process text, music, and images. A video explaining how to use Gemini was uploaded to coincide with the release, so I took a look.

How it's Made: Interacting with Gemini through multimodal prompting - Google for Developers

https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html

Everything to know about Gemini, Google's new AI model
https://blog.google/technology/ai/gemini-collection/

Table of Contents
01: Find commonalities between images
02: Multimodal AI capabilities
03: Convert images into code
04: Emoji Kitchen
05: Mathematical and Physics Reasoning
06: Programming ability
07: Questions about clothing and belongings
08: Voice recognition
09: Guess the movie title
10: Inferring user ideas
11: Extracting data from scientific literature

The article below explains what 'Gemini' is. Anyone can use Gemini with the chat AI ' Bard .'

A multimodal AI called 'Gemini' that can process text, voice, and images simultaneously and interact more naturally than humans, surpassing GPT-4, will be released - GIGAZINE



To coincide with the release, Google released a video on how to use Gemini.

◆01: Find commonalities between images
Testing Gemini: Finding connections - YouTube


Show Gemini a photo of the Bosges Chapel and Hokusai Katsushika's 'Under the Great Wave off Kanagawa' and ask him to 'find the similarities between the two images.'



Gemini responded, 'Curvier and

organic . The second image (Hokusai) is more fluid.'



Next, we'll compare a golf ball to the moon. In addition to asking about their similarities, we'll also ask about their history.



Gemini responded, 'In 1971, the crew of Apollo 14 hit two golf balls on the surface of the moon.' This means that Gemini can not only interpret what the image is, but also describe it.



Finally, we compare striped clothing with zebras, asking, 'Which one looks better?'



Gemini replied, 'The answer is zebras because zebras have been wearing stripes for millions of years.'



◆02: Multimodal AI capabilities

Hands-on with Gemini: Interacting with multimodal AI - YouTube


When I drew a line and asked a voice question, 'What do you see?', Gemini replied, 'I see a squiggly line.'



When I added more lines and asked again, the answer changed to 'It looks like a bird to me.'



When I put the water line in, he said it was a duck.



As you paint the duck blue, Gemini will tell you that the duck is not blue.



So he took out a toy duck and asked, 'Is that true?'



Gemini changed her mind, saying, 'Looks like blue ducks are more common than I thought.'



When asked to guess the material, he said, 'It looks like rubber or plastic.'



Then press the duck to make it squeak.



Gemini heard the sound and replied, 'If it squeaks, it must float!' and concluded that it was made of rubber, which is less dense than water.



Next, I ask, 'Can you tell me how to say this in another language? Give me some examples.' Gemini responds with examples in Spanish, French, Korean, Japanese, and Chinese. I then ask how to pronounce it in Chinese.



Gemini then explained that Chinese has tones and gave a detailed explanation of how to pronounce words in Chinese.



Next, place a duck on the world map. Gemini interprets this as 'The duck is on the ocean' and says, 'The duck is in the middle of the ocean, with no land nearby where it can rest or find food.'



I also ask them to 'think of a game based on what you can see now.'



Gemini suggested a guess-the-country game and immediately asked the first question, giving three hints: 'Home of kangaroos, koalas, and the Great Barrier Reef.'



When I pointed to Australia, he said that was the correct answer.


For the second question, the clue is 'Soccer is popular here, and they have the most wins in the men's World Cup in history.'



When I pointed to Brazil, it played the sound effect 'ping pong.'



Next, we started a game where we had to find the paper ball under the cup.



Even though he hadn't said anything yet about putting the ball under the cup, he guessed, 'You're trying to get me to find the paper ball under the cup.'



I tried to confuse him by quickly manipulating the cup, but Gemini correctly answered, 'The cup on the left.'



This time, all you have to do is make a 'rock-paper-scissors' gesture towards the screen, and Gemini will respond, 'I know what you're doing. Rock-paper-scissors, right?'



Make a butterfly with your hands and ask, 'What does it look like?'



'I spread my fingers to look like butterfly wings,' replies Gemini.



The next answer was, 'Big ears and a barking mouth. A dog.'



With just this simple action of placing the coin in your palm and flipping it over, Gemini will answer, 'The coin should be under your right hand.'




However, the human is one step ahead, opening his right hand and saying, 'What do you think about that...?'



Gemini responded, 'The coin should be in your left hand. You're using a magic trick to make it look like the coin has disappeared.'



Now, place the coin and the cookie next to each other. Before the human can say anything, Gemini comments, 'They're both round and flat.'



When asked, 'What is this?', Gemini replied, 'They're both foods. Tangerines are healthier than cookies.'



This time, they replaced the cookies with fidget spinners, and Gemini commented, 'Just as citrus fruits have a calming effect, so do fidget spinners.'



Then, replace the mandarin orange with a Rubik's Cube, and Gemini answers, 'Both are popular toys enjoyed by both children and adults.'



I showed them two yarns and asked for examples of work. When I used green and pink yarn, three examples were suggested, including 'dragon fruit.'



It generates an example image of the work to match the color of the yarn.



When presented with an illustration of a fork in the road and asked, 'Which way should you go?', Gemini accurately guessed the situation and answered, 'You should go left.'



When shown an illustration with just dots and numbers, he said, 'This is a picture of a crab.'



By connecting the dots in numerical order, a beautiful picture of a crab was completed.



When shown a picture of three stars and asked, 'Is this the correct order?', he replied, 'No, the correct order is Sun, Earth, Saturn.'



You can also guess the 'faster car' based on the shape of the car.



It can also answer the question, 'Which roller coaster looks more fun?'



When I told it to 'think of an appropriate line,' it generated a nice line like 'Eek!'



If you show it an illustration of a guitar, it will play acoustic guitar music.



When you connect an amplifier, the music changes to electric guitar music.



Adding a palm tree illustration gives the ukulele music a beachy feel.



It also answers the question, 'What is this scene trying to recreate?'



If you pause the movie and ask, 'What do you think will happen next?', you can guess the future, such as, 'You'll land on a perfect 10!'



Finally, the participants drew a picture of a constellation and asked, 'What does this look like?' Gemini answered, 'It's a simple line drawing of Gemini. It really captures the beauty of Gemini.'



◆03: Convert images into code
Testing Gemini: Turning images into code - YouTube


This is an experimental video of the process of converting an image into code. First, we give Gemini an image of a tree and ask it, 'Can you convert this image into SVG?'



Geimini presented an image with the comment, 'This is an image showing the main shape of the tree.'



Next, I moved on to a slightly more complex task: I asked Gemini if I could create an interactive demo in JavaScript. Gemini suggested a 'fractal tree.'



You can also move the slides to transform them interactively.



Of course, they will also provide the code.



◆04: Emoji Kitchen

Testing Gemini: Emoji Kitchen - YouTube


This video shows how Gemini handles Google's emoji fusion service, Emoji Kitchen. First, it asks, 'Can you guess which emojis are combined?' for an emoji made by combining a ghost and an avocado.



Gemini guessed correctly.



When asked what the emoji looks like, Gemini replied, 'A ghost, but green with a big brown hole.' It's not green, but it definitely has a brown hole.



Next, they were asked to give the emoji a name and a catchphrase, and Gemini came up with the idea of 'Abogeist: The Ghost of

Guacamole ,' a combination of avocado and ghost.



The emoji that combines a robot and a rabbit has been named 'Robobunny.'



◆05: Mathematical and Physics Reasoning

Gemini: Explaining reasoning in math and physics - YouTube


Just show your handwritten document and Gemini will tell you what's wrong, what's right, and what needs more explanation.



Gemini showed that the formula was correct, but the height calculation was incorrect.



Apparently the correct height was 50 meters, not 6 meters, so I asked, 'Why 50 meters?'



He then gave me a more detailed solution.



Additionally, Gemini can create practice problems similar to the calculation you made incorrectly.



◆06: Programming ability

Gemini: Excelling at competitive programming - YouTube


Gemini supports programming languages such as Python, Java, C++, and Go.



Simply submit your idea and we'll create a prototype for your app or website. In this case, we asked them to create a location-based web app for finding trains. Within a minute of receiving our request, they created a fully functional prototype. While the code isn't perfect, it's usable as a rough draft.



Gemini also excels in competitive programming, and compared to the competitive programming AI 'AlphaCode,' the 'AlphaCode2' designed based on Gemini performed better than 85% of the competition participants. The Google team said that Gemini's programming ability can be significantly improved simply by developers specifying properties, so programmers should do this.



◆07: Questions about clothing and belongings

Testing Gemini: Fit check - YouTube


The human asks, 'What is this outfit best for?'



Gemini replied, 'Perfect for staying warm in the tundra.'



I'll try putting on a space suit helmet.



The answer is 'Intergalactic travel.'



He gave me the saxophone and asked, 'What's this?'



The answer was, 'It's the perfect way to go where no one has gone before and play bold jazz.'



◆08: Voice Recognition

Gemini: Processing and understanding raw audio - YouTube


Typically, when large-scale language models deal with speech, they take the audio, process it through a speech recognition system to convert it into text, and then fit that text into another model that can understand it. However, doing this loses a lot of the nuances of speech and pronunciation. Gemini, on the other hand, has native multimodal capabilities, so it can process raw audio signals.



Let's look at an example in Chinese. First, ask, 'Which pronunciation is correct when saying the first month of the lunar calendar in Chinese?' and play two audio recordings of the word spoken in different tones to Gemini. Gemini will then respond, 'The first pronunciation is correct. The first character is pronounced with the first tone.' Gemini can distinguish between the two pronunciations and confirm which is correct.



Next, we play a French audio recording and ask Gemini to 'summarize it in English.' Again, Gemini responds correctly in English. Gemini is not just a transcription model; it's a model that can understand the details of what's being said.



Furthermore, it doesn't just process audio, but can also process multiple content types, such as audio and images, in the same way, as the name 'multimodal' suggests. For example, when shown the image below and asked, 'What do you do first to make an omelet with these ingredients?', Gemini responded, 'First, put the eggs in a bowl and whisk them.'



Next, he showed her a photo of the cooking process and asked, 'Do you think it's ready yet?' Gemini replied, 'It looks almost ready. You can flip it over and cook the other side.'



When I showed him the photo after that, he replied, 'That's enough. Remove from the heat and enjoy.'



◆09: Guess the movie title

Testing Gemini: Guess the movie - YouTube


The game involves showing key images and having participants guess the name of a movie. First, they are shown images of fried eggs and pancakes, and then an image of a club, and asked, 'Guess the name of the movie.'



Gemini answered '

Breakfast Club .' This is the correct answer because it's breakfast + club.



This time it was breakfast and a ring. Gemini answered correctly, '

Breakfast at Tiffany's .'



No scissors allowed and '

uncut diamond ' rings



'

Goldfinger '



'

Moonrise Kingdom '



Forest + G + (Bump - B) means '

Forrest Gump, ' etc. Gemini can easily play around with associating the movie title with the picture.



◆10: Inferring user ideas

Gemini: Reasoning about user intent to generate bespoke experiences - YouTube


This video explains how Gemini understands and infers user intent. First, a human asks Gemini about 'ideas for a daughter's birthday party,' and Gemini responds with 'What is she interested in?', trying to dig deeper into the user's intent.



Humans then add information like 'I like animals' or 'I want to do something outdoors.' At this point, Gemini created an interface called 'Bell Chime,' incorporating illustrations to help explain the idea visually.



The book also explains how Gemini thought about the process leading up to the illustration. First, after receiving a question from a human, Gemini considered whether text alone was sufficient or whether other UI elements were necessary. It then concluded that because the task was complex, a lot of information needed to be organized and displayed, and presented the interface shown above.



Next, Gemini tries to understand if it knows enough about itself. In this case, the girl wasn't initially told what kind of party she wanted or what her daughter was interested in. So Gemini decided she didn't know enough to offer an idea, and so it asked for that information, asking, 'What are you interested in?'



Based on this, they considered that 'people would want to choose from a wide range of options,' and devised the optimal interface to allow people to choose ideas from a variety of options. The result was the 'bell chime' presented above. In other words, if the person had asked a different question, a completely different interface might have been presented.



◆11: Extracting data from scientific literature

Gemini: Unlocking insights in scientific literature - YouTube


Scientists often have to manually extract and utilize only the essential information from thousands or even tens of thousands of scientific publications.



To extract the necessary data from hundreds of thousands of genetic papers, scientists first gave Gemini prompts like, 'Read the title and abstract of the following paper and tell me whether it describes the validation of a variant for a specific human disease.'



Gemini was able to separate the essential papers from the non-essential ones, extract the important data, and annotate them, showing exactly where in the paper Gemini found the information. A task that would have taken hours to complete manually could be completed 'during a lunch break,' Gemini said.



Furthermore, they asked Gemini to create an image based on the acquired dataset. When they gave it a graph showing data up to 2019 and instructed it to redraw it, Gemini created a new graph based on the data acquired from 2019 onwards. The Google team explained that 'this can be used not only in biology, but also in any field that deals with large datasets, such as law and finance.'



in AI,   Video,   Software, Posted by log1p_kr