A library called 'GGML' that runs chat AI on a real home PC with no GPU required and 16GB of memory is currently under development, and a demo of voice recognition AI running on Raspberry Pi has already appeared.



Chat AIs such as those used in ChatGPT and

Bard typically require extremely high machine specifications, such as tens to hundreds of GB of VRAM, not just for training but also for operation. To change this situation, development is underway on ' GGML ,' a library that allows chat AI to operate without a GPU.

ggml.ai
http://ggml.ai/



ggerganov/ggml: Tensor library for machine learning
https://github.com/ggerganov/ggml


The features of GGML are as follows:

・Written in C
- Supports 16-bit float
- Supports 4-bit, 5-bit, and 8-bit integer quantization
Automatic differentiation
- Equipped with optimization algorithms 'ADAM' and 'L-BFGS'
・Support and optimization for Apple Silicon
- Uses AVX and AVX2 on x86 architecture
Web support with WebAssembly, WASM, and SIMD
No third-party dependencies
-Do not use memory during operation
- Supports guided language output

The GGML code is available on GitHub , but with a bold warning: 'Please note that this project is under development.'



Although GGML is still under development, several demos have been released. For example, the video below shows how commands can be entered by voice using GGML and whisper.cpp . While this may seem like a normal example, what's amazing is that it runs on a lightweight PC called a Raspberry Pi.



The article also includes a demo of four models combining 13 billion parameters (13B) of LLaMA and Whisper running simultaneously on an Apple M1 Pro , fully demonstrating the lightweight design.



The Apple M2 Max running a 7 billion parameter (7B) LLaMA model can process 40 tokens per second, which is impressive.



Other test results are as follows:

Model Machine result
Whisper Small Encoder M1 Pro: 7 CPU threads 600 ms/run
Whisper Small Encoder M1 Pro: ANE via Core ML 200 ms/run
7B LLaMA (4bit quantization) M1 Pro, 8 CPU threads 43 ms / token
13B LLaMA (4bit quantization) M1 Pro, 8 CPU threads 73 ms / token
7B LLaMA (4bit quantization) M2 Max GPU 25 ms / token
13B LLaMA (4bit quantization) M2 Max GPU 42 ms / token


GGML is available under the MIT license and is free for anyone to use. The development team is also looking for collaborators, stating that 'writing code and improving the library is the greatest support.'

We also tried to check whether it actually worked in our editorial department, but when we followed the instructions in the documentation, an error occurred during the build and we were unable to proceed.

in AI,   Software, Posted by log1d_ts