Introducing 'Punica' that allows you to fine-tune large-scale language models with LoRA at low cost and efficiently



Low Rank Adaptation (LoRA) is a model for performing additional learning to AI with a small amount of calculation. A research team from the University of Washington and Duke University has released `` Punica '', a system that uses LoRA to perform fine-tuning on pre-trained large-scale language models at low cost and efficiently.

GitHub - punica-ai/punica: Serving multiple LoRA finetuned LLM as one

https://github.com/punica-ai/punica



[2310.18547] Punica: Multi-Tenant LoRA Serving
https://arxiv.org/abs/2310.18547



When companies and developers want to have large language models suitable for specific tasks, they need to fine-tune pretrained large language models. However, large-scale language models have billions of parameters, and direct fine-tuning of all parameters requires enormous computational costs.

Punica reportedly includes a CUDA kernel design that enables batch processing of various LoRA models. This allows you to maintain just one copy of a large pre-trained language model when processing multiple different LoRA models, significantly improving GPU cost performance in both memory and computation. that's right.

A large pre-trained language model consumes 100GB of storage. However, the model fine-tuned with LoRA only needs to add a few GB of storage and memory overhead. Using Punica, it seems possible to run multiple LoRA fine-tuned models for the cost of running one model.

Below are HuggingFace Transformers (blue), Microsoft DeepSpeed (orange), NVIDIA Faster Transformer (green), vLLM (red), Punica (purple), and the 7B model, 13B model, and 70B model of Llama2, which is Meta's large-scale language model. This is a bar graph comparing the text generation throughput of . The research team says Punica achieves 12 times the throughput compared to other systems.



Punica is not alone in research on applying LoRA to large-scale language models, and other research teams are also conducting research. On November 6, 2023, a paper on 'S-LoRA', which uses LoRA to fine-tune large-scale language models on GPUs at low cost and efficiently, was published in the unpeer-reviewed paper repository, similar to Punica. Published on arXiv.

[2311.03285] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
https://arxiv.org/abs/2311.03285



In addition, it has been reported that Google has already predicted the appearance of technology that allows LoRA to efficiently handle large-scale language models at low cost. Google has pointed out in an internal document that with the advent of LoRA, the performance of open source large-scale language models will improve, and there is even a possibility that self-developed AI models will be defeated by open source models.

Google's internal AI-related documents have been leaked stating that ``Open source is a threat,'' ``The winner is Meta,'' and ``OpenAI is not important.'' - GIGAZINE

in Software, Posted by log1i_yk