``FlexGen'' that can process large-scale language models such as GPT-3 even with a single GPU appears



Processing large language models such as GPT-3 is computationally intensive and memory intensive, typically requiring multiple high-end

AI accelerators . A generation engine `` FlexGen '' has been released for executing this process even with a single GPU with limited memory capacity, such as NVIDIA Tesla T4 with 16 GB of memory and NVIDIA GeForce RTX 3090 with 24 GB of memory.

GitHub - Ying1123/FlexGen: Running large language models like OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.
https://github.com/Ying1123/FlexGen#readme

FlexGen is an engine created for the purpose of reducing the inference resource requirements of large language models to a single GPU and making it flexible to various hardware. It is up to 100x faster than other offload-based systems when running the language model OPT-175B.

The benchmark results are as follows. The numbers are production throughput, in tokens per second. A T4 (16 GB) instance on GCP with 208GB DRAM and 1.5TB SSD is used for testing.

System OPT-6.7B OPT-30B OPT-175B
Huggingface Accelerate 25.12 0.62 0.01
DeepSpeed ZeRO-Inference 9.28 0.60 0.01
Petals - - 0.05
FlexGen 25.26 7.32 0.69
FlexGen (with compression) 29.12 8.38 1.12



The figure below shows the latency-throughput trade-off for the three offload-based systems on the OPT-175B (left) and OPT-30B (right). Blue indicates FlexGen (with compression), orange indicates FlexGen, green indicates DeepSpeed, and red indicates Accelerate. FlexGen achieves a new Pareto optimal solution with 100 times the maximum throughput of the other two on the OPT-175B. Other systems were unable to increase throughput further due to lack of memory.



Future plans for FlexGen include Apple M1/M2 support, Google Colaboratory support, and latency optimization for chatbot applications.

in Software, Posted by log1p_kr