Meta's basic AI research team publishes multiple research results, sharing AI models, data sets, and other results



Meta's Fundamental AI Research Team (Meta FAIR) has released several new research results in accordance with the principles of open science. Meta FAIR's publications include a new multimodal AI model, a dataset that they helped develop, a speech generation AI, and an audio watermarking technology.

Sharing new research, models, and datasets from Meta FAIR

https://ai.meta.com/blog/meta-fair-research-new-releases/



Meta FAIR aims to advance fundamental understanding of all AI-related fields with the mission of advancing the state of the art of AI through open research to benefit everyone. Meta FAIR has been focused on advancing the state of the art of AI through open research for over 10 years. As innovation in the field continues to accelerate, Meta noted, 'we believe that collaboration with the global AI community will become more important than ever.' By sharing research with the community while maintaining an open science approach, Meta FAIR said, 'we can stay true to our goal of building AI systems that work for everyone and bring the world closer together.'

Based on these principles, Meta FAIR brings together the six most recent research findings.

◆1: Meta Chameleon
Meta Chameleon is a multimodal large-scale language model (LLM) that can combine text and images as inputs and output any combination of text and images using a single unified architecture for both encoding and decoding. While most existing AI models use diffusion models , Meta Chameleon employs tokenization for text and images, enabling a more integrated approach and making AI models easier to design, maintain, and scale. It is also possible to generate creative captions for images or generate entirely new images by combining text prompts and images.



With Meta Chameleon you can generate new text from text, generate new images from images, generate images from text, generate text from images, generate text from text-image combinations, generate text-image combinations from text, generate text-image combinations from images, and even generate image-text combinations from image-text combinations.



Meta Chameleon itself was published in

a paper in May 2024, but the main components of the Meta Chameleon 7B and 34B models were made available to researchers on June 18, 2024 local time. The two models made available to researchers are designed with safety in mind and support multimodal input and text-only output for use in research purposes.

While Meta is taking steps to develop these models responsibly, we recognize that risks remain, and therefore, at the time of writing, we are not releasing the Meta Chameleon image generation models.

◆2: Multi-token forecasting
Most training goals for LLMs are to improve the simple task of predicting the next word. This approach is simple and scalable, but it is also inefficient. For example, to get LLMs to acquire child-level fluency in a language requires much more text than is needed to educate a child.

Therefore, Meta proposed an approach to build a better LLM using a technique called multi-token prediction . By training a language model with multi-token prediction, it is possible to predict multiple words at once. This leads to improved capabilities and training efficiency of the AI model, which in turn leads to faster response times. In the spirit of open science, Meta released a pre-trained model for code completion under a non-commercial and research license.

◆3: Meta JASCO
With the advent of generative AI, tools capable of generating music files from text have emerged. While existing AI models such as MusicGen rely primarily on text input for music generation, Meta JASCO (Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation) is able to accept various conditional inputs, such as specific chords or beats, to improve control over the generated music output. Specifically, it applies an information bottleneck layer in combination with temporal blurring to extract relevant information about specific controls. This allows it to incorporate both symbolic and audio-based conditions compared to AI models that generate music from the same text.

Meta has published the JASCO research paper , and the inference code will be released under the MIT license as part of Meta's sound generation tool, AudioCraft , in late 2024.

◆4: AudioSeal
Generative AI tools encourage people to share their creations with friends, family and social media followers. That's why Meta said, 'As with all AI innovation, it is our role to ensure that these tools are used responsibly.' As a tool for this, Meta has announced ' AudioSeal .' AudioSeal is a watermarking technology designed for speech-generating AI tools that allows users to accurately identify AI-generated segments within longer audio snippets.

AudioSeal innovates traditional audio watermarking techniques by focusing on detecting AI-generated content rather than steganography . Unlike traditional methods that rely on complex decoding algorithms, AudioSeal's localized detection approach allows for faster and more efficient detection. This makes AudioSeal up to 485 times faster to detect than traditional methods, making it suitable for use in large-scale real-time applications. AudioSeal also achieves state-of-the-art performance for audio watermarking techniques in terms of robustness and imperceptibility.



AudioSeal has been released under a commercial license, and it is said that AudioSeal is also used in the voice samples generated by SeamlessM4T v2 and Audiobox , which are basic text-to-speech translation models.

◆5: Partnership to support the release of PRISM datasets
Obtaining feedback from a diverse set of people is important to improve LLMs, but the research community has open questions about the methods, domains, and objectives of the feedback process. Meta is working with external partners to address these issues and has supported the release of the PRISM dataset, which maps the socio-demographics and preferences of 1500 diverse participants from 75 countries. Meta is advising external partners on their compilation of the PRISM dataset by focusing conversations around interpersonal and cross-culturally contentious topics, centering on subjective and multicultural perspectives.

Meta demonstrates the usefulness of the PRISM dataset through three case studies on dialogue diversity, preference diversity, and welfare outcomes, and points out that it is important which humans set the alignment criteria. Meta explains that it hopes the PRISM dataset will serve as a resource for the AI community, encouraging broader participation in AI development and encouraging a more inclusive approach to technology design.

◆6: Measuring and improving geographic differences in text-to-image generation systems
Reflecting the geographic and cultural diversity of the world is crucial for AI models that can generate images from text. Improving these models requires new tools to help researchers better understand where existing models fall short.

Meta developed an automated metric called ' DIG In ' to assess potential geographic differences in text-to-image AI models, and also conducted a large-scale annotation study to understand how people in different regions perceive geographical representations differently. Over 65,000 annotations and 20 survey responses covering attractiveness, similarity, consistency, and common recommendations were collected to improve automated and human evaluation of text-to-image AI models.

The study found that people use specific components within an image to recognize geographical representations in generated images, rather than looking at the whole image as a whole. As part of the collaborative approach at Meta FAIR, Meta is also guiding a team of graduate students from the University of Massachusetts Amherst in a follow-up evaluation to decompose the automated indicators deployed so far into foreground concepts and background representations.

Based on information from DIG In, we also explored ways to improve the diversity of output in text-to-image generation AI models, resulting in a guidance called Contextualized Vendi Score , which expands on Meta's previously published feedback guidance work and uses inference-time intervention to guide state-of-the-art image generation AI models to increase the representational diversity of generated examples while maintaining or improving image quality and consistency of prompt generation.

in Software, Posted by logu_ii