An engineer who actually used Gemini 1.5 Pro posted a review praising it, saying, 'The movie processing is especially amazing.'



Engineer Simon Wilson, one of the creators of Django, has posted his impressions of actually using Gemini 1.5 Pro on his blog.

The killer app of Gemini Pro 1.5 is video

https://simonwillison.net/2024/Feb/21/gemini-pro-video/

Gemini 1.5 Pro is a multimodal AI announced by Google on February 16, 2024, and is said to be able to process up to 1 million tokens.

Google releases Gemini 1.5, can process up to 1 million tokens and handle 1 hour of movies and 700,000 words of text - GIGAZINE



Mr. Wilson said, ``It's amazing that the token context size has expanded to 1 million, but the most exciting thing is that you can input movies,'' and posted what it would be like to actually process a movie. For example, the movie below is about a 7 second movie of Mr. Wilson's bookshelf.

My bookshelf - YouTube


The first amazing thing about this movie is that it consumes only 1841 tokens, and Gemini properly reads the contents and outputs a list of the book titles written on the spine.



If you ask 'Convert to JSON', it will output it in JSON format.



Not only that, but Mr. Wilson was particularly impressed by Gemini's output of a book that was more than half hidden, as shown in the image below, and still identified it as 'Site Seeing: A Visual Approach to Web Usability' by Luke Wroblewski. He was surprised.



However, Wilson said that one hallucination occurred.

Mr. Wilson then shot the following 22-second movie. Although it was a slightly longer movie, the number of tokens in the movie was still only 6049 tokens.

My bookshelf 2 - YouTube


Mr. Wilson seems to have suspected that movies are processed in a different format than images because the number of tokens is so small, but in a Google blog post , ``Google AI Studio splits movies into images. It seems that he changed his mind when he saw the words ``.

In addition, Mr. Wilson actually tried inputting images and confirmed that one image becomes 258 tokens. Google states that Gemini processes a 45-minute movie with 2,674 frames and 684,000 tokens , so the calculation of 684,000 ÷ 2,674 means that it processes with 256 tokens per frame. We conclude that there is no doubt that the image is processed by dividing it into images.

in Software,   Video, Posted by log1d_ts