An attack method has emerged that steals hidden information and some functions from ChatGPT and large-scale language models (LLMs).

AI researchers have unveiled a ' model-stealing attack' that can steal confidential information and some of the functionality from OpenAI's chat AI ChatGPT and Google's large-scale language model (LLM)
[2403.06634] Stealing Part of a Production Language Model
https://arxiv.org/abs/2403.06634

Google announces Stealing Part of a Production Language Model
— AK (@_akhaliq) March 12, 2024
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the… pic.twitter.com/bgBCTYywWN
The technique, known as 'model theft attacks,' which aims to steal 'confidential information that would otherwise be hidden' from AI and LLMs, was devised by a research team led by Nicholas Carlini of Google DeepMind . Other researchers involved in the research include developers from ETH Zurich, the University of Washington, Google Research, Cornell University, and OpenAI.
The research team first discovered the existence of 'model theft attacks' in 2020, but it wasn't until October 2023 that it was discovered that this attack method was effective against APIs used in actual language models that were in production. It seems that attacks using model theft attacks were not thought to be feasible.
The research team conducted a proof-of-concept model theft attack in November 2023, and in December of the same year, they disclosed information about several services that were confirmed to be vulnerable to this attack method, allowing them time to fix the vulnerabilities. They also shared details of the attack with several popular services that were not vulnerable to model theft attacks.
Following this notification, Google updated their software to address the vulnerability, and OpenAI updated their software to address the attack on March 3, 2024, which led to the publication of a paper on the model theft attack on March 11, 2024 local time.

The research team launched model theft attacks against several white-box models to verify that the attack actually works. They then performed model theft attacks on OpenAI's LLM GPT-3, including Ada, the fastest model in the GPT-3 model, and Babbage, a model capable of performing simple tasks quickly and at low cost. They successfully stole the entire final layer from each model. Of course, the research team notified OpenAI of their intention to conduct the attacks and obtained their approval.
The researchers also confirmed that their model theft attack was effective against GPT-3.5-turbo-instruct and GPT-3.5-turbo-chat. As part of a responsible disclosure agreement, the researchers did not disclose information about the size of these AI models. However, they confirmed with OpenAI the size of the hidden layers stolen from each model, and the stolen information was confirmed to be accurate.
While model theft attacks have traditionally been considered impractical by AI experts, this paper demonstrates that it is possible to steal features from an AI model or even parts of the model using model theft attacks. However, the research team points out that stealing a model is not necessarily more cost-effective than training your own model, and it is difficult to perform a model theft attack that can restore a model almost perfectly.

The research team pointed out that the reason the model theft attack was successful was because 'a small number of model providers made the logit bias parameters available,' and cited ' Anthropic ' as an example of a model provider that does not offer this type of API. The research team pointed out that this case, in which a small decision in API design could enable attacks on AI models, 'requires API design with security in mind.'
The research team points out that attack methods that target more practical AI models than model theft attacks will likely emerge in the future.
Related Posts:







