Dec 09, 2024 06:00:00

A list of the AI inference processing capabilities of NVIDIA graphics cards and Apple chips, useful for deciding which graphics card and Mac to buy

Until recently, the main use of graphics boards was for 3D graphics processing such as games, but in recent years, there are more and more cases where graphics boards are chosen for the purpose of 'running AI locally.' I found a webpage ' GPU-Benchmarks-on-LLM-Inference ' that summarizes the processing performance when inference processing of the large-scale language model 'LLaMA 3' is executed on a large number of NVIDIA graphics boards and Apple chips, so I summarized the contents.

GitHub - XiongjieDai/GPU-Benchmarks-on-LLM-Inference: Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

GPU-Benchmarks-on-LLM-Inference is a performance comparison page created by AI researcher Xiongjie Dai , which summarizes the number of tokens processed per second when running LLaMA 3 inference processing on various graphics boards and Apple chips. In addition, ' llama.cpp ' is used to run LLaMA 3, and performance is verified for 'a model with 8B parameters,' 'a quantized model with 8B parameters,' 'a model with 70B parameters,' and 'a quantized model with 70B parameters.'

Below are the results of the verification of the graphics boards for gaming use that are relatively easy to obtain from the graphics boards included in the performance comparison table. The highest performance was shown by the 'RTX 4090 24GB'. In addition, the three models 'RTX 3090 24GB', 'RTX 4080 16GB', and 'RTX 4090 24GB' can run LLaMA 3 8B without quantization. On the other hand, LLaMA 3 70B was not able to run even on the quantized model.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
RTX 3070 8GB	70.94	Out of memory	Out of memory	Out of memory
RTX 3080 10GB	106.40	Out of memory	Out of memory	Out of memory
RTX 3080 Ti 12GB	106.71	Out of memory	Out of memory	Out of memory
RTX 3090 24GB	111.74	46.51	Out of memory	Out of memory
RTX 4070 Ti 12GB	82.21	Out of memory	Out of memory	Out of memory
RTX 4080 16GB	106.22	40.29	Out of memory	Out of memory
RTX 4090 24GB	127.74	54.34	Out of memory	Out of memory

The processing performance of a machine equipped with multiple graphic boards looks like this. Although the memory shortage can be resolved by increasing the number of graphic boards, it can be seen that there is no significant difference in the number of tokens processed per second.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
RTX 3090 24GB x 2	108.07	47.15	16.29	Out of memory
RTX 3090 24GB x 4	104.94	46.40	16.89	Out of memory
RTX 3090 24GB x 6	101.07	45.55	16.93	5.82
RTX 4090 24GB x 2	122.56	53.27	19.06	Out of memory
RTX 4090 24GB x 4	117.61	52.69	18.83	Out of memory
RTX 4090 24GB x 8	116.13	52.12	18.76	6.45

The processing performance of graphics boards for computing is as follows:

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
RTX 4000 Ada 20GB	58.59	20.85	Out of memory	Out of memory
RTX 4000 Ada 20GB x 4	56.14	20.58	7.33	Out of memory
RTX 5000 Ada 32GB	89.87	32.67	Out of memory	Out of memory
RTX 5000 Ada 32GB x 4	82.73	31.94	11.45	Out of memory
RTX A6000 48GB	102.22	40.25	14.58	Out of memory
RTX A6000 48GB x 4	93.73	38.87	14.32	4.74
RTX 6000 Ada 48GB	130.99	51.97	18.36	Out of memory
RTX 6000 Ada 48GB x 4	118.99	50.25	17.96	6.06

And the inference processing performance when each model of LLaMA 3 is executed on a chip for high performance computing and AI processing is as follows.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
A40 48GB	88.95	33.95	12.08	Out of memory
A40 48GB x 4	83.79	33.28	11.91	3.98
L40S 48GB	113.60	43.42	15.31	Out of memory
L40S 48GB 4pcs	105.72	42.48	14.99	5.03
A100 PCIe 80GB	138.31	53.18	24.33	Out of memory
4x A100 PCIe 80GB	117.30	51.54	22.68	7.38
A100 SXM 80GB	133.38	53.18	24.33	Out of memory
A100 SXM 80GB x 4	97.70	45.45	19.60	6.92
H100 PCIe 80GB	144.49	67.79	25.01	Out of memory
4x H100 PCIe 80GB	118.14	62.90	26.20	9.63

Also, the inference processing performance of a Mac equipped with an Apple chip looks like this. Looking at the performance difference between the M2 Ultra and the M3 Max, we can see the importance of memory capacity in processing LLM.

GPU	8B Q4_K_M	8B F16	70B Q4_K_M	70B F16
M1 7-Core GPU 8GB	9.72	Out of memory	Out of memory	Out of memory
M1 Max 32‑Core GPU 64GB	34.49	18.43	4.09	Out of memory
M2 Ultra 76-Core GPU 192GB	76.28	36.25	12.13	4.71
M3 Max 40‑Core GPU 64GB	50.74	22.39	7.53	Out of memory

Based on the performance comparison results, Dai concludes, 'If you want to save money, buy an NVIDIA gaming graphics card, if you want business use, buy a professional graphics card, and if you want power-saving and quiet performance, buy a Mac.'

Related Posts:

Dec 09, 2024 06:00:00 in Software, Hardware, Posted by log1o_hf