No GPU required / A library 'GGML' that runs chat AI on a real home PC with 16 GB of memory is under intense development, and a demo that runs voice recognition AI on Raspberry Pi has already appeared

The chat AI used in ChatGPT,

Bard , etc. generally requires extremely high machine specs such as tens to hundreds of GB of VRAM not only for training but also for operation. In order to change this situation, the development of the library ' GGML ', which operates chat AI without GPU, is underway.

ggerganov/ggml: Tensor library for machine learning

The features of GGML are as follows.

・Write in C
・Support 16bit float
・Supports quantization with 4bit, 5bit, and 8bit integers
Automatic differentiation
・Equipped with optimization algorithms 'ADAM' and 'L-BFGS'
・Support and optimization for Apple silicon
Use AVX and AVX2 on x86 architecture
・Web support with WebAssembly, WASM, and SIMD
・No third-party dependencies
- Does not use memory during operation
・Supports guided language output

The GGML code is publicly available on GitHub , but it is noted in bold that 'Please note that this project is under development.'

Although GGML is a work in progress project, some demos have been published. For example, the movie below shows how commands are input by voice using GGML and whisper.cpp . If it's just this, it's a normal sight, but it's amazing that this is running on an ultra-lightweight PC called Raspberry Pi.

In addition, there is a demo that runs four models that combine LLaMA with 13 billion parameters (13B) and Whisper on the Apple M1 Pro at the same time, demonstrating its light weight.

If you run the LLaMA model with 7 billion parameters (7B) on the Apple M2 Max , you can process 40 tokens per second. It's pretty fast.

Other test results are as follows.

model machine result
Whisper Small Encoder M1 Pro: 7 CPU threads 600ms/run
Whisper Small Encoder M1 Pro: ANE via Core ML 200ms/run
7B LLaMA (4bit quantization) M1 Pro, 8 CPU threads 43ms/token
13B LLaMA (4bit quantization) M1 Pro, 8 CPU threads 73ms/token
7B LLaMA (4bit quantization) M2 Max GPU 25ms/token
13B LLaMA (4bit quantization) M2 Max GPU 42ms/token

GGML is provided under the MIT license and is free for anyone to use. In addition, the development team is widely recruiting development collaborators, stating, 'Writing code and improving the library will be the greatest support.'

In addition, the editorial department also tried to see if it could actually work, but when I proceeded as described in the document, an error occurred during the build and I could not proceed.

in Software, Posted by log1d_ts