Summary of basic knowledge needed to speed up numerical calculations and AI training using GPU/CUDA with code examples



GPUs have far more cores than CPUs and can perform a large number of parallel processes. IT engineer Rijul Rajesh has summarized the knowledge necessary to take advantage of such GPU performance on his blog.

GPU Survival Toolkit for the AI age: The bare minimum every developer must know

https://journal.hexmos.com/gpu-survival-toolkit/



The Transformer architecture used in modern AI models utilizes parallel processing to significantly improve performance, and if you are involved in the development of such AI, an understanding of parallel processing is essential. CPUs are usually designed to increase single-threaded sequential processing performance, and are not suitable for efficiently distributing and executing the large number of parallel calculations required for complex AI models.

While CPUs are equipped with large and powerful cores, GPUs are equipped with many small cores, and the larger the number of cores, the more parallel processing can be performed at the same time. It is suitable for tasks that rely on parallel processing, such as rendering and complex mathematical calculations. By appropriately using the CPU and GPU depending on the task, processing speed can be greatly improved, such as processing that used to take 4.07 seconds to be completed in 0.0046 seconds, as shown in the figure below.



Additionally, neural network training involves a large number of matrix operations, and matrix operations are highly parallel, so processing can be sped up by utilizing a GPU with a large number of cores.



The differences between CPU and GPU are summarized as follows.

◆CPU
CPUs are designed with an emphasis on sequential processing and are good at executing a single set of instructions in a linear manner. The number of cores is often between 2 and 16 cores, and although each core can independently process its own instruction set, it cannot perform large-scale parallel processing. It is optimized for tasks that require high single-threaded performance, and can be used for:

・General-purpose computing
·System operation
・Processing complex algorithms with conditional branching

◆GPU
GPUs are equipped with up to several thousand cores, and several cores are grouped together into a unit called a streaming multiprocessor. It is designed for parallel processing tasks, such as parallel processing by dividing a task into smaller parallel subtasks, and can efficiently process tasks such as:

・Graphics rendering
・Performing complex mathematical calculations
・Execution of parallelizable algorithms

So, I will actually utilize the GPU using ``CUDA'', a parallel computing platform and programming model developed by Nvidia. First, access

the Nvidia website and enter the information for the machine on which you want to install CUDA. This time, I selected 'Linux', 'x86_64', 'Ubuntu', '22.04', and 'deb (network)'.



When you complete the input to 'Installer Type', a command for installation will appear at the bottom. For 'Base Installer', just run the listed commands in order.



'Driver Installer' allows you to select one from two types. This time, we will use

GDS, which can directly exchange data with GPU memory, so we will install it using the command listed in 'the open kernel module flavor'.



Add the following to '.bashrc' in your home directory and pass the path.
[code]export PATH='/usr/local/cuda-12.3/bin:$PATH'
export LD_LIBRARY_PATH='/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH'[/code]



Also, install GDS using the command below. After installation, restart your system for the changes to take effect.
[code]sudo apt-get install nvidia-gds[/code]



◆Commands useful when programming using GPU
・'lspci | grep VGA'
You can identify and list GPUs in your system.



・'nvidia-smi'
nvidia-smi stands for 'NVIDIA System Management Interface' and provides detailed information about the NVIDIA GPU in your system, including usage, temperature, and memory usage.



・'sudo lshw -C display'
Provides detailed information about the display controllers in your system, including the graphics card.



・'inxi-G'
Provides information about the graphics subsystem, including details about the GPU and display.



・'sudo hwinfo --gfxcard'
You can get detailed information about your system's graphics card.



◆Try using the CUDA framework
In order to demonstrate GPU parallelization, we will perform the task of adding the elements of the two arrays below.

Array A: [1,2,3,4,5,6]
Array B: [7,8,9,10,11,12]

Adding each element gives the following:

Array C: [1+7,2+8,3+9,4+10,5+11,6+12]=[8,10,12,14,16,18]

If you implement the code that performs this calculation on a CPU, it should look like this: Check the elements of the array one by one and add them in order.
[code]#include
int a[] = {1,2,3,4,5,6};
int b[] = {7,8,9,10,11,12};
int c[6];

int main() {
int N = 6; // number of elements

for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}

for (int i = 0; i < N; i++) {
printf('c[%d] = %d', i, c[i]);
}

return 0;
}[/code]



In order to maintain performance even when handling a large number of numbers, the code that parallelizes with CUDA and performs all additions at the same time is as follows. '__global__' indicates that this function is a kernel function called on the GPU, and 'threadIdx.x' represents the index of the thread.
[code]__global__ void vectorAdd(int* a, int* b, int* c)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
return;
}[/code]



Once the kernel function is prepared, we will create the inside of the main function. First, declare variables.
[code]int main(){
int a[] = {1,2,3,4,5,6};
int b[] = {7,8,9,10,11,12};
int c[sizeof(a) / sizeof(int)] = {0};
// Create a pointer to the GPU
int* cudaA = 0;
int* cudaB = 0;
int* cudaC = 0;[/code]



Next, use 'cudaMalloc' to allocate memory within the GPU.
[code]cudaMalloc(&cudaA,sizeof(a));
cudaMalloc(&cudaB,sizeof(b));
cudaMalloc(&cudaC,sizeof(c));[/code]



Copy the contents of arrays a and b to the GPU using 'cudaMemcpy'.
[code]cudaMemcpy(cudaA, a, sizeof(a), cudaMemcpyHostToDevice);
cudaMemcpy(cudaB, b, sizeof(b), cudaMemcpyHostToDevice);[/code]



Then, start the kernel function 'vectorAdd' that you created first. 'sizeof(a) / sizeof(a[0])' means 'size of entire array / size of one element of array', and it is set to start vectorAdd as many times as the number of elements. .
[code]vectorAdd <<<1, sizeof(a) / sizeof(a[0])>>> (cudaA, cudaB, cudaC);[/code]



Get the calculation results from the GPU.
[code]cudaMemcpy(c, cudaC, sizeof(c), cudaMemcpyDeviceToHost);[/code]



Finally, output the calculation results in the usual way.
[code] for (int i = 0; i < sizeof(c) / sizeof(int); i++)
{
printf('c[%d] = %d', i, c[i]);
}

return 0;
}[/code]



Save the above code as GPU.cu and compile it with the 'nvcc' command. When I started the output executable file, the output was as shown below. The entire code is publicly available on GitHub .



In addition to this, Rajesh's blog post includes an example of processing a Mandelbrot set generation task that takes 4.07 seconds on a CPU in 0.0046 seconds using a GPU, and an example of training a neural network. Please check it out if you are interested.

in Software,   Hardware, Posted by log1d_ts