NVIDIA and other research teams release 'Sana,' an AI model that can automatically generate images with a resolution of up to 4096 x 4096 within seconds



A research team from NVIDIA, Massachusetts Institute of Technology (MIT), and Seika University has announced ' Sana ,' an image generation AI that can generate images with a maximum resolution of 4096 x 4096 within a few seconds.

[2410.10629] SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

https://arxiv.org/abs/2410.10629

Sana
https://nvlabs.github.io/Sana/



Below is an example of an image actually created with Sana. With the prompt 'astronaut in a jungle, cold color palette, muted colors, detailed, 8k', you can generate an image like this.



Below is the image generated by the prompt 'a cyberpunk cat with a neon sign that says 'SANA'.



When I entered the prompt 'portrait photo of a girl, photograph, highly detailed face, depth of field,' a realistic image of a person was generated.



According to the Sana development team, unlike conventional autoencoders that can only compress images by 8 times, Sana trains an autoencoder that can compress images by up to 32 times, effectively reducing the number of potential tokens while efficiently training and generating ultra-high-resolution images with 4K resolution.

In addition, the decoder uses the language model Gemma as a text encoder to enhance the performance of prompt understanding and inference. Unlike the conventional T5 , Gemma has excellent text understanding, so it can improve image and text alignment while dealing with training instability. In addition, a mechanism called 'Flow-DPM-Solver' is introduced to reduce sampling steps, which reduces the number of sampling steps from 28-50 to 14-20 compared to 'Flow-Euler-Solver', achieving efficient caption labeling and selection.



As a result of these efforts, Sana is as competitive as the latest high-performance image generation AIs such as

Flux , but can generate images more than 100 times faster. According to the development team, SANA-0.6B, with a parameter size of 600 million, can be deployed on a GPU-equipped laptop with 16GB of memory and takes less than a second to generate an image with a resolution of 1024 x 1024. Below is a graph comparing the time it takes to generate an image on Sana, with SANA-1.6B, with a parameter size of 1.6 billion, capable of generating a 1024 x 1024 image in 1.2 seconds. A 4096 x 4096 image can be generated in 15.9 seconds. In addition, SANA-0.6B can generate a 1024 x 1024 image in 0.9 seconds and a 4096 x 4096 image in just 9.6 seconds.



Below is a table comparing the performance of Sana with various image generation AIs. It has been reported that each model of Sana has higher performance than other image generation AIs in terms of

throughput .



At the time of writing, Sana's source code is scheduled to be released soon.

in Software, Posted by log1r_ut