May 15, 2024 08:00:00

Stanford University research group releases ThunderKittens, a domain-specific language for AI that runs on GPUs

A research team led by Professor Christopher Le of Stanford University has released ' ThunderKittens ,'

a domain-specific language (DSL) that makes full use of GPUs to maximize the amount of calculations per unit of time.

ThunderKittens: A Simple Embedded DSL for AI kernels Hazy Research
https://hazyresearch.stanford.edu/blog/2024-05-12-quick-tk

GPUs Go Brrr · Hazy Research
https://hazyresearch.stanford.edu/blog/2024-05-12-tk

The research team used the NVIDIA H100 to maximize GPU utilization. The H100's performance in half-precision matrix multiplication calculations using Tensor cores is 989 TFLOPS, far exceeding the total of all other computing capabilities, which is about 60 TFLOPS. In other words, the GPU utilization of the H100 is almost entirely dependent on the utilization of the Tensor cores.

The research team focused on improving four areas: WGMMA instructions, address generation, shared memory, and occupancy, so that the Tensor cores could work on every GPU cycle.

・WGMMA command
The H100 has a new instruction set called 'warp group matrix multiply accumulate (WGMMA)'. With the WGMMA instruction, 128 threads of the streaming multiprocessor (SM) synchronize cooperatively and perform matrix operations directly from shared memory. According to the research team's microbenchmarks, without the WGMMA instruction, GPU utilization would plateau at about 63%.

However, the problem of how to place data in shared memory when using the WGMMA instruction is very complicated, and the research team had a hard time finding the right way to place data because NVIDIA's documentation was incorrect. However, it was an unavoidable problem because not using the WGMMA instruction would result in a loss of 37% of GPU utilization.

Address generation
Because both the Tensor cores and memory of the H100 operate very fast, even generating memory addresses to fetch data consumes a significant amount of chip resources. By using an instruction called Tensor Memory Accelerator (TMA) provided by NVIDIA, it is possible to specify a multidimensional tensor layout in global memory or shared memory and fetch parts of the tensor asynchronously. Using TMA, it is possible to significantly reduce address generation costs.

·shared memory
The latency of a single access to shared memory is relatively small, about 30 cycles, and has been overlooked in the past because other areas were the bottleneck. However, when working on 'maximum optimization' like this, it is important to pay attention to even these small latencies.

The research team tried to reduce data movement between registers and shared memory as much as possible, and when data needed to be moved, they used the WGMMA and TMA instructions to move data asynchronously between shared memory and registers.

·Occupancy
Occupancy is a measure of how many warps a GPU has actually executed compared to the maximum number of

warps it can execute. The H100 chip has enhanced asynchronous capabilities, and has ways of keeping the hardware busy, such as fetching memory, performing matrix multiplications, performing shared memory reduction, and simultaneously performing operations on registers, so it can achieve better performance even with a lower occupancy than previous generations of hardware.

However, there is no doubt that a higher occupancy rate is more likely to improve the actual performance of the hardware. Also, hardware such as the A100 and RTX 4090 are more dependent on synchronous instruction dispatch than the H100, so improving the occupancy rate is important.

To address these issues, the research team designed and released 'ThunderKittens' as a domain-specific language (DSL) to be embedded within CUDA.

The results of measuring the difference in the computational power of

Flash Attention between using PyTorch's FA2 and ThunderKittens (TK) are shown in the figure below. The purple line shows the computational power of PyTorch's FA2, and the light blue line shows the computational power of ThunderKittens. On average, ThunderKittens has succeeded in improving performance by about 30%.

In addition, ThunderKittens can perform calculations for Linear Attention at approximately 215 TFLOPS, which is a 'significant' increase in speed compared to conventional methods.

The ThunderKittens code is available on GitHub under an open source license , so if you're interested, check it out.

Related Posts:

May 15, 2024 08:00:00 in Software, Hardware, Posted by log1d_ts