How open are so-called 'open source' AI models really?



In the field of AI, which has been developing in recent years, OpenAI's large-scale language model '

GPT ' is being developed in a closed manner, while large-scale language models such as Meta's LLaMA and Google's Gemma are being developed in an open manner. However, even in language models that claim to be 'developed in open source,' it is unclear to what extent they are open. Therefore, a research team at Radboud University has reported the results of a survey on the degree of openness of language models that are called 'open source.'

Rethinking open source generative AI: open-washing and the EU AI Act | Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
https://dl.acm.org/doi/10.1145/3630106.3659005



Not all 'open source' AI models are actually open: here's a ranking

https://www.nature.com/articles/d41586-024-02012-5

The term 'open source' implies access to the source code and no restrictions on the use or distribution of the program. However, given the complexity of large-scale language models and the huge amount of data involved, making everything open source is by no means a simple feat. Furthermore, revealing the model in its entirety could expose it to commercial or legal risks, as well as increasing the risk of it being misused.

Still, by simply slapping the “open source” label on it, companies developing large-scale language models can present themselves as “transparent companies.”

To explore this practice, known as 'openwashing,' Marc Dingemanse of Radboud University and his team conducted an evaluation of 40 large-scale language models that claim to be 'open source' or 'open' based on 14 parameters, including the availability of code and training data, public documentation, and ease of access to the models.

Here is an excerpt from the results of Dingemanse's research: Although each model claims to be 'open,' BigScience 's BloomZ is rated as 'open' in every category, while Meta's Llama 3-Instruct has very few open categories.
Large-scale language models Source code Model Data Model weights Preprint API
BloomZ open open open open open
OLMo open open open open Partially open
Mistral 7B-Instruct Partially open closed open Partially open open
Orca 2 closed closed Partially open Partially open Partially open
Gemma 7B - Instructor Partially open closed Partially open Partially open closed
Llama 3 - Instructor closed closed Partially open closed Partially open


The research team pointed out that 'many AI models that claim to be open or open source actually only publish their weights.' This means that outside researchers can access and use trained models, but they cannot inspect or customize them. It is also difficult to fully understand how the model was fine-tuned for a specific task, such as using human feedback, and Dingemans said, 'If we don't reveal more information, we can't call it 'open.''

The research team also found that roughly half of the models they analyzed did not reveal any details about the dataset beyond general descriptors.

A Google spokesperson said, 'Gemma is 'open,' not 'open source,'' and that 'we don't necessarily embrace all existing open source concepts.' A Microsoft spokesperson said, 'We strive to be as accurate as possible about what is available and to what extent. We choose to make our deliverables, including models, code, tools, and datasets, publicly available because the development and research community plays a critical role in advancing AI technology.'



'This research cuts through a lot of the hype and nonsense surrounding the current open source debate,' said Abeba Birhane, a cognitive scientist at Trinity College Dublin. 'Openness is crucial for science in terms of reproducibility. If you can't reproduce it, it's hard to call it science,' said Dingemans. 'The only way researchers can innovate is to work with existing models. To do that, you need enough information.'

in Software, Posted by log1r_ut