How open are so-called 'open source' AI models really?
In the field of AI, which has been developing in recent years, OpenAI's large-scale language model '
Rethinking open source generative AI: open-washing and the EU AI Act | Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
https://dl.acm.org/doi/10.1145/3630106.3659005
Not all 'open source' AI models are actually open: here's a ranking
The term 'open source' implies access to the source code and no restrictions on the use or distribution of the program. However, given the complexity of large-scale language models and the huge amount of data involved, making everything open source is by no means a simple feat. Furthermore, revealing the model in its entirety could expose it to commercial or legal risks, as well as increasing the risk of it being misused.
Still, by simply slapping the “open source” label on it, companies developing large-scale language models can present themselves as “transparent companies.”
To explore this practice, known as 'openwashing,' Marc Dingemanse of Radboud University and his team conducted an evaluation of 40 large-scale language models that claim to be 'open source' or 'open' based on 14 parameters, including the availability of code and training data, public documentation, and ease of access to the models.
Here is an excerpt from the results of Dingemanse's research: Although each model claims to be 'open,' BigScience 's BloomZ is rated as 'open' in every category, while Meta's Llama 3-Instruct has very few open categories.
Large-scale language models | Source code | Model Data | Model weights | Preprint | API |
---|---|---|---|---|---|
BloomZ | open | open | open | open | open |
OLMo | open | open | open | open | Partially open |
Mistral 7B-Instruct | Partially open | closed | open | Partially open | open |
Orca 2 | closed | closed | Partially open | Partially open | Partially open |
Gemma 7B - Instructor | Partially open | closed | Partially open | Partially open | closed |
Llama 3 - Instructor | closed | closed | Partially open | closed | Partially open |
The research team pointed out that 'many AI models that claim to be open or open source actually only publish their weights.' This means that outside researchers can access and use trained models, but they cannot inspect or customize them. It is also difficult to fully understand how the model was fine-tuned for a specific task, such as using human feedback, and Dingemans said, 'If we don't reveal more information, we can't call it 'open.''
The research team also found that roughly half of the models they analyzed did not reveal any details about the dataset beyond general descriptors.
A Google spokesperson said, 'Gemma is 'open,' not 'open source,'' and that 'we don't necessarily embrace all existing open source concepts.' A Microsoft spokesperson said, 'We strive to be as accurate as possible about what is available and to what extent. We choose to make our deliverables, including models, code, tools, and datasets, publicly available because the development and research community plays a critical role in advancing AI technology.'
'This research cuts through a lot of the hype and nonsense surrounding the current open source debate,' said Abeba Birhane, a cognitive scientist at Trinity College Dublin. 'Openness is crucial for science in terms of reproducibility. If you can't reproduce it, it's hard to call it science,' said Dingemans. 'The only way researchers can innovate is to work with existing models. To do that, you need enough information.'
Related Posts:
in Software, Posted by log1r_ut