Feb 22, 2023 07:00:00

``FlexGen'' that can process large-scale language models such as GPT-3 even with a single GPU appears

Processing large language models such as GPT-3 is computationally intensive and memory intensive, typically requiring multiple high-end

AI accelerators . A generation engine `` FlexGen '' has been released for executing this process even with a single GPU with limited memory capacity, such as NVIDIA Tesla T4 with 16 GB of memory and NVIDIA GeForce RTX 3090 with 24 GB of memory.

GitHub - Ying1123/FlexGen: Running large language models like OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems.
https://github.com/Ying1123/FlexGen#readme

FlexGen is an engine created for the purpose of reducing the inference resource requirements of large language models to a single GPU and making it flexible to various hardware. It is up to 100x faster than other offload-based systems when running the language model OPT-175B.

The benchmark results are as follows. The numbers are production throughput, in tokens per second. A T4 (16 GB) instance on GCP with 208GB DRAM and 1.5TB SSD is used for testing.

System	OPT-6.7B	OPT-30B	OPT-175B
Huggingface Accelerate	25.12	0.62	0.01
DeepSpeed ZeRO-Inference	9.28	0.60	0.01
Petals	-	-	0.05
FlexGen	25.26	7.32	0.69
FlexGen (with compression)	29.12	8.38	1.12

The figure below shows the latency-throughput trade-off for the three offload-based systems on the OPT-175B (left) and OPT-30B (right). Blue indicates FlexGen (with compression), orange indicates FlexGen, green indicates DeepSpeed, and red indicates Accelerate. FlexGen achieves a new Pareto optimal solution with 100 times the maximum throughput of the other two on the OPT-175B. Other systems were unable to increase throughput further due to lack of memory.

Future plans for FlexGen include Apple M1/M2 support, Google Colaboratory support, and latency optimization for chatbot applications.

Related Posts:

Feb 22, 2023 07:00:00 in Software, Posted by log1p_kr