Google releases Gemini Robotics, an AI model for robotics that lets robots perform tasks simply by giving verbal instructions

Google has announced that it has developed an AI model called 'Gemini Robotics' based on
Gemini 2.0 , adding a 'function to output actions' to enable robots to be operated. At the same time, the 'Gemini Robotics-ER' model with advanced spatial understanding capabilities has also been announced.
Gemini Robotics brings AI into the physical world - Google DeepMind
https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/

Gemini Robotics: Bringing AI into the Physical World
(PDF file)
https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf
Previous models in the Gemini family supported multimodal inputs such as text, images, audio, and video, but output was limited to the digital domain. The newly developed Gemini Robotics adds a new output method, 'physical movement,' making it possible to directly control the robot.
Google categorizes Gemini Robotics as a 'vision-language-action model' (VLA model) and says it will be 'able to perform a wide range of real-world tasks.'
You can see how Gemini Robotics actually performs the task in the following movie.
Gemini Robotics: Bringing AI to the physical world - YouTube
When the operator verbally instructs the robot to 'put the bananas in the transparent container,' the robot follows the instruction.

Even if they are mean and try to change the location of the container midway, we respond immediately.

Thanks to the world knowledge of the original Gemini 2.0 model, it can follow instructions that never appeared in the training, such as 'pick up a basketball and dunk it.'

Gemini Robotics is good at adapting to new objects, diverse instructions, and new environments, and Google said it 'performs on average more than twice as well on comprehensive generalization benchmarks compared to other state-of-the-art visual language action models.'
In addition, the robot continuously monitors its surroundings to detect changes in the environment or instructions, and adjusts its behavior accordingly, just as the robot was able to perform the banana task without any problems even when the container was moved. The ability to adapt to change is important in the real world, where unexpected things happen, such as objects slipping out of your hand or someone moving something.
It is also possible to perform actions that require dexterity, and in the following movie you can see how to fold origami.
Gemini Robotics: Dexterous skills - YouTube
The majority of the training was done using the Aloha2 arm, but other arms such as Apollo and Bi-arm Franka could also accomplish the tasks.

At the same time as Gemini Robotics, Google released 'Gemini Robotics-ER', an advanced visual language model. Gemini Robotics-ER has significantly improved functions such as pointing and 3D detection of Gemini 2.0, and can understand 'how to properly grab a mug' by looking at it.
Google also pays attention to safety when developing robots, and says it will continue to develop AI in collaboration with trusted testers such as Apptronik, Agile Robots, Agility Robots, Boston Dynamics, and Enchanted Tools.
Related Posts: