Apple releases benchmark results showing the performance of its assistant AI 'Apple Intelligence,' revealing the performance difference with GPT-4-Turbo



Apple has released benchmark results for

Apple Intelligence , the personal AI for Apple devices.

Introducing Apple's On-Device and Server Foundation Models - Apple Machine Learning Research
https://machinelearning.apple.com/research/introducing-apple-foundation-models



Apple Intelligence was announced during the keynote speech at Apple's annual developer conference 'WWDC24' held at 2:00 on June 11, 2024. You can find out more about Apple Intelligence in the following article.

Apple announces new personal AI 'Apple Intelligence', Siri supports ChatGPT in partnership with OpenAI - GIGAZINE



Apple Intelligence is divided into two parts: a device model with approximately 3 billion parameters that can run on devices such as the iPhone, and a larger, more powerful server model. For both models, Apple is responsible for all stages of model creation, including collection of the data from which they are learned, training and optimization, and they are built on a foundation of privacy.



The underlying model of Apple Intelligence is trained on Apple's AXLearn framework, using licensed data from which it learns, and the data is filtered to remove personally identifiable information such as credit card details, vulgar language and low-quality content.

The base model is fine-tuned to suit the user's everyday activities, but adapters that can be 'plugged in' to different layers of the model enhance its ability to handle specific tasks.



The results of benchmark comparisons between the newly announced Apple Intelligence model and other models have been published. The device model is compared with the Gemma 2B and 7B models, as well as small open models such as Mistral-7B and Phi-3-mini , while the server model is compared with large open models such as DBRX-Instruct and Mixtral-8x22B , as well as OpenAI's commercial models GPT-3.5-Turbo and GPT-4-Turbo.

The figure below shows the results of a human rating of responses to various real-world prompts, asking 'Which is better?'. For the device model, the response 'Apple's model is better' was clearly higher than for all comparisons, and for the server model, it outperformed DBRX-Instruct, GPT-3.5-Turbo, and Mixtral-8x22B, but GPT-4-Turbo won out when compared to GPT-4-Turbo.



When comparing the likelihood of outputting harmful content in response to

hostile prompts attempting to circumvent safety, the Apple Intelligence model was found to be the least likely to generate harmful content in both the device and server versions compared.



When narrowing down the responses to prompts designed to generate harmful content, the Apple Intelligence model's responses were rated significantly more favorably than the comparison controls.



In addition, in the IFEval benchmark, which measures how well the instructions can be followed, the device model recorded the highest score among the comparison subjects, while the server model recorded a score equivalent to GPT-4-Turbo.



The figure below shows the benchmark results for the writing ability of summarizing and composition. We can see that both the device model and the server model achieved the highest level of performance.



Apple plans to release more information soon about its broader family of models, including language models, diffusion models and coding models.

in Software, Posted by log1d_ts