'llm.c', a large-scale language model training tool using pure C without PyTorch or Python, is released
Training of large-scale language models (LLMs), which can be said to be the main body of AI, is mostly done using PyTorch or Python, but a tool called ' llm.c ' has been released that implements such training using only C. Although it has not yet been optimized and is faster than conventional methods, it can implement training for GPT-2 with about 1,000 lines of clean code.
GitHub - karpathy/llm.c: LLM training in simple, raw C/CUDA
https://github.com/karpathy/llm.c
The author, Andrei Karpathy, was a member of the founding group of OpenAI and was previously the AI director at Tesla.
By using llm.c, it is possible to train large-scale language models without using PyTorch with a capacity of 245MB or cPython with a capacity of 107MB. When Karpathy actually implemented the code to train 'GPT-2', which can be said to be the ancestor of current large-scale language models, on a CPU, he was able to implement it with a small amount of code of about 1,000 lines while reducing dependencies.
Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c: https://t.co/w2wkY0Ho5m
—Andrej Karpathy (@karpathy) April 8, 2024
To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…
The actual code is available on GitHub . The required amount of memory is acquired at the beginning, and memory usage does not fluctuate during training. This code does not use Python libraries, so the forward and backward passes of all individual layers are implemented manually.
You can look at the raw training implementation here: https://t.co/ZiiCwYurMP
—Andrej Karpathy (@karpathy) April 8, 2024
You'll see that we allocate all the required memory a single time in the beginning in one large block of 1D memory. From there on during training, no memory gets created or destroyed, so we stay at… pic.twitter.com/S92d5dPcJZ
Connecting layers required writing code while making sure all the pointers and tensor offsets were correctly placed, which was a very tedious and masochistic task.
Once you have all the layers, you just string all it all together. Not gonna lie, this was quite tedious and masochistic to write because you have to make sure all the pointers and tensor offsets are correctly arranged.
—Andrej Karpathy (@karpathy) April 8, 2024
Left: we allocate a single 1D array of memory and then… pic.twitter.com/KLPz7udGni
At the time of writing, only the training code for the CPU was available, but Karpathy said he was also working on the code for training using CUDA. Karpathy said he expected that by porting to CUDA and making it more efficient, training would be possible at the same speed as PyTorch without heavy dependencies.
Once you have the forward/backward, the rest of it (data loader, Adam update, etc) are mostly trivial.
—Andrej Karpathy (@karpathy) April 8, 2024
The real fun starts now though: I am now porting this to CUDA layer by layer so that it can be made efficient, perhaps even coming within reasonable fraction of PyTorch, but…
In the future, they plan to lower the precision from fp32 to fp16 and support modern architectures such as llama 2 , mistral , and gemma . Karpathy also said that once the system is in a more stable state, he plans to release a movie that builds these codes in detail from scratch.
Related Posts:
in Software, Posted by log1d_ts