Meta publishes open source AI 'ImageBind' that can understand the real world by integrating characters / images and video / sound / 3D depth / heat / motion



When a person perceives the outside world, they use multiple senses such as sight, hearing, touch, smell, and taste at the same time. Meta AI, Meta's AI development department, has six categories: ``text'', ``image and video'', ``voice'', ``depth for calculating movement (3D)'', ``heat by infrared rays'', and ``movement by inertial measurement unit (IMU)''. Announcing ImageBind , an open-source AI model that integrates data.

ImageBind: Holistic AI learning across six modalities

https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/

IMAGEBIND: One Embedding Space To Bind Them All
(PDF file) https://dl.fbaipublicfiles.com/imagebind/imagebind_final.pdf

GitHub - facebookresearch/ImageBind: ImageBind One Embedding Space to Bind Them All
https://github.com/facebookresearch/ImageBind

Looking at the movie below, you can see what ImageBind can do.

The open source AI model 'ImageBind' announced by Meta can integrate characters/images with video/sound/3D/heat/motion-YouTube


There are many datasets that connect images and videos with text, such as those used for training image recognition AI and image generation AI.



ImageBind integrates text and four types of

self-supervised learning data: audio, 3D depth, heat, and motion, using images and videos as a bridge. According to Meta, heat and 3D depth are strongly correlated with images, making it easy to align datasets. However, since the correlation between movements and sounds measured by the IMU is weak, it seems to be data that accompanies the visual context, such as the crying of a baby.



'Multimodal learning', which integrates 6 types of data centered on images and videos with ImageBind, allows AI to interpret content more holistically without resource-intensive training.



Conventional image generation AI can generate images and movies from text, but by using ImageBind, it is also possible to generate images from laughter and rain sounds. For example, enter the text 'Small creature', an image of the forest, the sound of rain falling in the forest, and the data of the movement of the birds measured by the IMU as a prompt.



Then, AI can generate an animation of `` a cute little creature that moves in the raining forest ''.



Meta said, ``In this study, we considered integrating six types of data, but we believe that connecting as many senses as possible, such as touch, smell, and fMRI signals of the brain, will enable a richer human-centered AI model. I am.” However, there are many things that have not been clarified about multimodal learning yet, and Meta says that ImageBind will be the first step in multimodal learning research.

in Software,   Video, Posted by log1i_yk