OSAID version 1.0, which defines open source AI, is released, Meta's Llama does not meet open source AI standards
The Open Source AI Definition – 1.0 – Open Source Initiative
https://opensource.org/ai/open-source-ai-definition
We finally have an 'official' definition for open source AI | TechCrunch
https://techcrunch.com/2024/10/28/we-finally-have-an-official-definition-for-open-source-ai/
Open-source AI must reveal its training data, per new OSI definition - The Verge
https://www.theverge.com/2024/10/28/24281820/open-source-initiative-definition-artificial-intelligence-meta-llama
Open source has demonstrated that by removing barriers to learning, using, sharing, and improving software systems, it brings great benefits to everyone. OSI asserts that the benefits of open source are achieved by using licenses that comply with the Open Source Definition. In the case of AI, we need the same fundamental freedoms of open source to ensure that AI developers, adopters, and end users can enjoy the benefits of autonomy, transparency, smooth reuse, and collaborative improvement.
Therefore, OSI has worked with academia and industry to develop version 1.0 of the OSAID definition of open source AI, which OSI defines as 'an AI system that is provided on terms and in a manner that grants the following freedoms:
- You can use the system for any purpose without asking permission.
- Find out how the system works and inspect its components.
- Modify the system for any purpose, including changing the output.
- Share the System for other users to use for any purpose, with or without modifications.
Additionally, they listed the following recommended formats for making changes to machine learning systems:
Data Information: Information in sufficient detail about the data used to train the system so that a skilled person could build a substantially equivalent system. Data information shall be provided in accordance with the terms of OSI approval.
(1) A complete description of all data used for training, including data that cannot be shared (if any), the origin, scope and characteristics of the data, how the data was obtained and selected, labeling procedures, and data processing and filtering methods
(2) A list of all publicly available training data and where to find it
(3) A list of all training data available from third parties and where it can be obtained (including for a fee).
Code: The complete source code used to train and run the System. The Code shall represent a complete specification of how data is processed and filtered, and how training is performed. The Code shall be provided under an OSI-approved license.
For example, if used, this should include the code used to process and filter the data, the code used for training including the arguments and settings used, validation and testing, supporting libraries such as tokenizers and hyperparameter search code, inference code, and the model architecture.
Parameters: Model parameters such as weights and other configuration settings. Parameters are made available subject to the terms of the OSI approval.
For example, this might include checkpoints of key intermediate stages of training, as well as the final optimizer state.
In the case of machine learning systems, an AI model consists of a model architecture, model parameters, and inference code to run the model, and AI weights refer to a set of learned parameters that overlay the model architecture to produce an output from a given input, and the recommended format for making changes to machine learning systems also applies to these individual components.
OSI defines AI systems and machine learning as follows:
AI System: An AI system is a machine-based system that, for an explicit or implicit purpose, infers how to generate outputs from inputs it receives to generate predictions, content, recommendations, decisions or other outputs that impact the physical or virtual environment. Different AI systems have different levels of autonomy and adaptability once deployed.
Machine Learning: A set of techniques that enable machines to improve their performance and generate models, usually automatically, by exposing them to training data, which helps them identify patterns and regularities without explicit instructions from humans. The process of using machine learning techniques to improve the performance of a system is called 'training'.
'Our big motivation is to get policymakers and AI developers on the same page,' OSI Executive Vice President Stefano Maffluri told tech media TechCrunch. 'Regulators are already paying attention to this space. We actively reached out to a range of stakeholders and communities, not just tech industry regulars. We also tried to reach out to the organizations that talk most frequently with regulators to get early feedback.'
In OSAID, to be considered an open source AI model, it is necessary to provide enough design information so that a human can 'substantially' reproduce it, and other relevant information about the training data (such as the source, how the data was processed, how the data was obtained and licensed, etc.) must also be disclosed. Therefore, Makhlouri criticized that AI models such as Meta's Llama cannot be called open source.
OSI is also in discussions with AI giants about the use of the term 'open source' in AI. While Google and Microsoft have agreed not to use the term 'open source' for AI models that are not fully open, Meta has not agreed.
Additionally, Stability AI, which has long touted its AI models as open source, cannot be defined as open source under OSAID because it requires any company with revenues of more than $1 million to obtain an enterprise license. French AI startup Mistral's license also prohibits the commercial use of certain models and outputs, making it unconventional under OSAID.
In fact, a study conducted in August 2023 by researchers at the AI Now Institute and Carnegie Mellon University proved that many AI models calling themselves 'open source AI' are not actually open. Another study also showed that many AI models calling themselves open source are not open.
How open are 'open source' AI models actually? - GIGAZINE
Meta does not agree with OSAID version 1.0, despite participating in the drafting process. A Meta spokesperson argued that Llama's license adequately acts as a guardrail against harmful use. They also cited regulations such as California's Training Transparency Act, saying the company is 'taking a cautious approach' to sharing details about its models, including details about training data.
A bill to require disclosure of AI model training data will be introduced - GIGAZINE
Organizations supporting OSAID version 1.0 include Mozilla, Intel, Stanford University, Bloomberg, Digital Public Goods Alliance, EleutherAI, Common Crawl, SUSE, LLM360, Free Software unit , and Open Source Group Japan.
Related Posts:
in Software, Posted by logu_ii