A list of the AI inference processing capabilities of NVIDIA graphics cards and Apple chips, useful for deciding which graphics card and Mac to buy



Until recently, the main use of graphics boards was for 3D graphics processing such as games, but in recent years, there are more and more cases where graphics boards are chosen for the purpose of 'running AI locally.' I found a webpage ' GPU-Benchmarks-on-LLM-Inference ' that summarizes the processing performance when inference processing of the large-scale language model 'LLaMA 3' is executed on a large number of NVIDIA graphics boards and Apple chips, so I summarized the contents.

GitHub - XiongjieDai/GPU-Benchmarks-on-LLM-Inference: Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

GPU-Benchmarks-on-LLM-Inference is a performance comparison page created by AI researcher Xiongjie Dai , which summarizes the number of tokens processed per second when running LLaMA 3 inference processing on various graphics boards and Apple chips. In addition, ' llama.cpp ' is used to run LLaMA 3, and performance is verified for 'a model with 8B parameters,' 'a quantized model with 8B parameters,' 'a model with 70B parameters,' and 'a quantized model with 70B parameters.'

Below are the results of the verification of the graphics boards for gaming use that are relatively easy to obtain from the graphics boards included in the performance comparison table. The highest performance was shown by the 'RTX 4090 24GB'. In addition, the three models 'RTX 3090 24GB', 'RTX 4080 16GB', and 'RTX 4090 24GB' can run LLaMA 3 8B without quantization. On the other hand, LLaMA 3 70B was not able to run even on the quantized model.

GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
RTX 3070 8GB 70.94 Out of memory Out of memory Out of memory
RTX 3080 10GB 106.40 Out of memory Out of memory Out of memory
RTX 3080 Ti 12GB 106.71 Out of memory Out of memory Out of memory
RTX 3090 24GB 111.74 46.51 Out of memory Out of memory
RTX 4070 Ti 12GB 82.21 Out of memory Out of memory Out of memory
RTX 4080 16GB 106.22 40.29 Out of memory Out of memory
RTX 4090 24GB 127.74 54.34 Out of memory Out of memory


The processing performance of a machine equipped with multiple graphic boards looks like this. Although the memory shortage can be resolved by increasing the number of graphic boards, it can be seen that there is no significant difference in the number of tokens processed per second.
GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
RTX 3090 24GB x 2 108.07 47.15 16.29 Out of memory
RTX 3090 24GB x 4 104.94 46.40 16.89 Out of memory
RTX 3090 24GB x 6 101.07 45.55 16.93 5.82
RTX 4090 24GB x 2 122.56 53.27 19.06 Out of memory
RTX 4090 24GB x 4 117.61 52.69 18.83 Out of memory
RTX 4090 24GB x 8 116.13 52.12 18.76 6.45


The processing performance of graphics boards for computing is as follows:
GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
RTX 4000 Ada 20GB 58.59 20.85 Out of memory Out of memory
RTX 4000 Ada 20GB x 4 56.14 20.58 7.33 Out of memory
RTX 5000 Ada 32GB 89.87 32.67 Out of memory Out of memory
RTX 5000 Ada 32GB x 4 82.73 31.94 11.45 Out of memory
RTX A6000 48GB 102.22 40.25 14.58 Out of memory
RTX A6000 48GB x 4 93.73 38.87 14.32 4.74
RTX 6000 Ada 48GB 130.99 51.97 18.36 Out of memory
RTX 6000 Ada 48GB x 4 118.99 50.25 17.96 6.06


And the inference processing performance when each model of LLaMA 3 is executed on a chip for high performance computing and AI processing is as follows.
GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
A40 48GB 88.95 33.95 12.08 Out of memory
A40 48GB x 4 83.79 33.28 11.91 3.98
L40S 48GB 113.60 43.42 15.31 Out of memory
L40S 48GB 4pcs 105.72 42.48 14.99 5.03
A100 PCIe 80GB 138.31 53.18 24.33 Out of memory
4x A100 PCIe 80GB 117.30 51.54 22.68 7.38
A100 SXM 80GB 133.38 53.18 24.33 Out of memory
A100 SXM 80GB x 4 97.70 45.45 19.60 6.92
H100 PCIe 80GB 144.49 67.79 25.01 Out of memory
4x H100 PCIe 80GB 118.14 62.90 26.20 9.63


Also, the inference processing performance of a Mac equipped with an Apple chip looks like this. Looking at the performance difference between the M2 Ultra and the M3 Max, we can see the importance of memory capacity in processing LLM.
GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
M1 7-Core GPU 8GB 9.72 Out of memory Out of memory Out of memory
M1 Max 32‑Core GPU 64GB 34.49 18.43 4.09 Out of memory
M2 Ultra 76-Core GPU 192GB 76.28 36.25 12.13 4.71
M3 Max 40‑Core GPU 64GB 50.74 22.39 7.53 Out of memory


Based on the performance comparison results, Dai concludes, 'If you want to save money, buy an NVIDIA gaming graphics card, if you want business use, buy a professional graphics card, and if you want power-saving and quiet performance, buy a Mac.'

in Software,   Hardware, Posted by log1o_hf