An AI framework called 'Cradle' will be developed that allows you to play games by operating a mouse and keyboard like a human being
In recent years, AI performance has improved dramatically, and AI agents that can perform tasks in complex real-world scenarios have been developed. However, these AI agents often fail to perform generalized tasks across multiple scenarios, which is due to differences in the observations and actions required in each environment. Therefore, a Chinese research team has announced 'Cradle,' an AI framework that can operate games and apps like a human by using the most unified interface, the 'screen,' as input and the 'keyboard' and 'mouse' as output.
[2403.03186] Cradle: Empowering Foundation Agents Towards General Computer Control
Cradle: Empowering Foundation Agents Towards General Computer Control
https://baai-agents.github.io/Cradle/
Unlocking the Potential of General Computer Control with CRADLE: Steering Through Digital Challenges - MarkTechPost
https://www.marktechpost.com/2024/03/15/unlocking-the-potential-of-general-computer-control-with-cradle-steering-through-digital-challenges/
To generalize the AI agent across different scenarios, the team proposed 'General Computer Control (GCC),' which interacts with the software by providing mouse and keyboard output in response to screen input.
Computers are the most important and universal interface between humans and the digital world, providing apps, games, and other software that AI agents can interact with while avoiding the hardware requirements, breakdowns, and other issues associated with real robots. Mastering these virtual environments is a promising way to generalize AI agents.
To achieve GCC, various capabilities are required, such as 'appropriate understanding of visual information through a screen and decision-making based on that,' 'accurate control of a keyboard and mouse to interact with a computer,' 'retention of reasoning and experience to perform complex tasks,' and 'self-improvement to autonomously discover better strategies and solutions.' And, as a preliminary attempt toward GCC, Cradle, an AI framework that utilizes large-scale language models (LLMs), has been developed.
Cradle's backbone model uses OpenAI's GPT-4o , which consists of a total of six modules: '1: Information collection module that processes multimodal inputs', '2: Self-reflection module that reconsiders past experiences', '3: Task inference module that selects the next best task to perform', '4: Skill collection module that generates and updates skills related to specific tasks', '5: Action planning module that determines keyboard and mouse actions', and '6: Memory module that stores past experiences and skills'. These modules allow Cradle to perform various tasks and play games.
The research team reports that Cradle had a fairly high success rate in simple tasks such as 'chasing NPCs' and 'going to specific locations' when playing games. On the other hand, because it was poor at spatial recognition and time-related decision-making, it was less successful in tasks such as 'navigating dangerous and winding paths' and 'performing real-time combat and search tasks.'
You can see Cradle actually playing the game in several videos posted on YouTube. Below is a video of Cradle playing the open-world action game ' Red Dead Redemption 2 ', in which Cradle successfully completed one mission that took 40 minutes.
Cradle Mastering Tasks in Chapter 1 of Red Dead Redemption II (at 16x speed) - YouTube
Below is a video of Cradle playing the city development simulation game ' Cities: Skylines '. Although Cradle made some mistakes, such as failing to connect water pipes and causing a water shortage in the city, it was reported that he was successful in covering the available area with residential, commercial and industrial zones.
Cradle Mastering Tasks in Cities: Skylines (at 16x speed) - YouTube
Below is a gameplay video of the slow-life experience game ' Stardew Valley .' Although Cradle had difficulty manipulating objects and interacting with characters, he was able to harvest parsnips .
Cradle Mastering Tasks in Stardew Valley (at 16x speed) - YouTube
In addition to games, Cradle also recorded a certain success rate in tasks such as 'downloading a paper in Chrome,' 'posting to X (formerly Twitter) from Chrome,' 'opening and closing a page in Chrome,' 'finding a specific email in Outlook,' and 'replying in Outlook.' However, even with standard GUIs such as Chrome and Outlook, it sometimes failed to recognize certain UIs or lost visual context, and the success rate was even lower with other non-standard software.
The research team said, 'To our knowledge, Cradle is the first framework that enables AI agents to succeed in such diverse environments without relying on built-in APIs.' 'Although Cradle still faces challenges in certain tasks, the combination of further extensions of the framework and advances in LLMs will serve as a pioneering study for developing more powerful LLM-based general-purpose agents across computer-controlled tasks.'
Related Posts: