Apple researchers have published another paper on artificial intelligence (AI) models , and this time the focus is on understanding and navigating smartphone user interfaces (UI). The paper, which has not yet been peer-reviewed, highlights a large language model (LLM). It's called Ferret UI, which can go beyond traditional computer vision and understand complex smartphone screens.
This is not the first paper on artificial intelligence published by the research department of the giant technology company, as it has already published a research paper on multimedia LLMs (MLLMs) and another on on-device AI models.
A pre-print version of the paper has been published on arXiv, an open-access online repository for scientific papers. The paper is titled “Ferret-UI: Understanding the Land Mobile User Interface with Multimodal LLMs” and focuses on expanding the use case for MLLMs.
It highlights that most language models with multimodal capabilities cannot understand beyond natural images and their functionality is “constrained,” and also states that AI models are needed to understand complex and dynamic interfaces such as those found on a smartphone.
According to the paper, Ferret UI is designed to “implement fine-grained referencing and grounding tasks specific to user interface displays, while adeptly interpreting and acting on open language instructions.” In simple terms, a vision language model can not only manipulate a smartphone screen with multiple elements representing different information, but It can also tell the user about it when asked for a query.
Based on the image shared in the paper, the model can understand and classify UI elements and recognize icons. It can also answer questions such as “Where is the launch icon” and “How do I open the Reminders app?” This shows that AI is not able to explain Not only does it see the screen, but it can also navigate to different parts of the iPhone based on the prompt.
To train Ferret UI, Apple researchers generated data of different complexity themselves, and this helped the model learn basic tasks and understand one-step operations. “For advanced tasks, we use GPT-4 [40] to generate data, including detailed description, conversation perception and interaction, and functional reasoning,” the paper explained. “These advanced tasks prepare the model to engage in more nuanced discussions about visual components, and formulate action plans with Keep specific goals in mind, and explain the overall purpose of the screen.