Apple announces Ferret-UI, a multimodal LLM that can recognize smartphone screens, and it may be possible for Siri to understand the UI of iPhone apps



Researchers at Apple have published a paper on arXiv, a repository of unpeer-reviewed papers, about the development of ' Ferret-UI ,' a multimodal large-scale language model (MLLM) designed to understand smartphone app UIs.

[2404.05719] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

https://arxiv.org/abs/2404.05719

Apple teaching an AI system to use apps; maybe for advanced Siri
https://9to5mac.com/2024/04/09/ferret-ui-advanced-siri/

Large-scale language models (LLMs), which are the basis of chatbot AI systems like ChatGPT, learn from huge amounts of text, mainly collected from websites, while MLLMs like Google Gemini learn not only from text, but also from non-text information such as images, videos, and audio.

However, MLLM is not considered to perform well as a smartphone app. One reason for this is that most of the images and videos used for training have a landscape aspect ratio, which is different from that of smartphone screens. Another problem is that on smartphones, UI objects that need to be recognized, such as icons and buttons, are smaller than natural image objects.

Ferret-UI, announced by Apple researchers, is a generative AI system designed to recognize the screens of mobile apps on smartphones.



Smartphone UI screens usually have a long aspect ratio and contain small objects such as icons and text. To address this, Ferret-UI introduces a technique called 'any resolution' that enlarges image details and leverages enhanced visual features, allowing Ferret-UI to accurately recognize UI details regardless of screen resolution.



Ferret-UI also carefully collects a wide range of training examples for basic UI tasks, such as icon recognition, text search, and widget listing. These examples are domain-specifically annotated, which makes it easier to associate language with images and look them up accurately. In other words, Ferret-UI is able to correctly understand a wide variety of UIs by learning a large number of concrete UI examples.



According to the paper, Ferret-UI outperforms

GPT-4V and other existing UI-aware MLLMs. This suggests that Ferret-UI's 'any resolution' technique, large-scale and diverse training data, and support for advanced tasks are highly effective in understanding and operating UIs.

To further enhance the inference capabilities of Ferret-UI's model, more datasets are being compiled for advanced tasks, such as detailed descriptions, perception/interaction conversations, and feature inference, which will enable Ferret-UI to go beyond simple UI recognition to more complex and abstract UI understanding and interaction.

If Ferret-UI is put into practical use, it is expected to improve accessibility. Even people who cannot see the smartphone screen due to visual impairments will be able to summarize what is displayed on the screen and convey it to the user using AI. In addition, when developing smartphone apps, having Ferret-UI recognize the screen may make it possible to check the clarity and ease of use of the app's UI more quickly.



Furthermore, because it is a multimodal AI optimized for smartphones, it can be combined with Siri, the AI assistant built into the iPhone, to automate more advanced tasks using any app.

in Mobile,   Software, Posted by log1i_yk