Experts explain the 10-year flow of 'machine learning software' that created ChatGPT and Stable Diffusion



In recent years, various AI such as ChatGPT for interactive AI and Stable Diffusion for image generation AI have appeared and are attracting public attention.

Dylan Patel , an analyst in the semiconductor field, explains the transition of the past 10 years in the machine learning software used to develop these AIs.

How Nvidia's CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0
https://www.semianalysis.com/p/nvidiaopenaitritonpytorch

◆TensorFlow vs. PyTorch
Patel pointed out that Google's TensorFlow was a frontrunner in the machine learning framework ecosystem a few years ago and seemed poised to dominate the machine learning industry. In addition, Google said that the development and deployment of TPU , a processor specialized for machine learning, was an advantage as a pioneer.

However, Google has not been able to take advantage of its first-mover advantage to dominate the machine learning industry, and PyTorch , a Python machine learning library, is increasing its presence. PyTorch was developed by the artificial intelligence research team of Facebook (currently Meta), and at the time of writing the article, it is also used in Stable Diffusion, an image generation AI. Regarding the reason why PyTorch has an advantage over TensorFlow, Mr. Patel said that TensorFlow adopts a script execution method called 'graph mode', while PyTorch adopts a method called 'Eager mode'. I pointed out that it was big.

Eager mode executes each operation line by line like any other Python code, so it has the advantage of making it easier to see the results of intermediate operations to see how the model behaves. This makes the code easier to understand and debug. Graph mode, on the other hand, has only two phases: defining the computation graph representing the execution operations, and executing an optimized version of the computation. Due to this, it is difficult to understand and debug the code because you cannot see what is happening until the calculation is finished.

Regarding the difference between the two modes, Mr. Patel said that the graph mode is similar to a compiled language such as C++ that converts code in advance in advance, and the eager mode is an interpreter that executes while sequentially interpreting intermediate expressions like Python. said to be similar to language . After that, TensorFlow came to have Eager mode by default, but PyTorch is mainly established in the research community and major technology companies.

Google is somewhat isolated within the machine learning community by prioritizing its own software stack and hardware and not using PyTorch and GPUs. It is also said that large language models such as OpenAI and other AI startups are threatening search and natural language processing dominance, but Google is still at the forefront of machine learning models. .


by

ranveer cool

◆ Bandwidth issues in machine learning training
Patel says that there are two simple forms of machine learning training: 'computing (FLOPS)' and 'memory (bandwidth)'. Of these, FLOPS is the factor that limits the speed of calculation execution within each layer, and bandwidth limits the speed at which data reaches the computing resources that perform calculations. In the past, the time required for machine learning training was limited by computing power, but as a result of NVIDIA developing a high-performance GPU, this point was resolved, and at the time of writing the article, bandwidth was a problem. thing.

Bandwidth is a major issue for large language models requiring hundreds of GBs to tens of TBs of memory, as model sizes continue to grow exponentially as AI development advances. The fastest data reading is the on-chip SRAM memory, but even the 20 cm x 22 cm giant chip 'Wafer Scale Engine 2' developed by Cerebras , an artificial intelligence company, has a limit of 40 GB on-chip SRAM, and the price is It is said to be worth several hundred million yen.

Cerebras develops a large chip 'Wafer Scale Engine 2' equipped with 2.6 trillion transistors - GIGAZINE



The next fastest reading speed after on-chip SRAM is DRAM , and although the latency increases by an order of magnitude compared to on-chip SRAM, the cost will drop to about 1/100. Still, DRAM costs have not improved since 2012, and as of 2022, DRAM accounts for 50% of the total cost of servers, Patel points out.

NVIDIA's A100 GPU and next-generation H100 GPU will increase FLOPS by more than 6x, but only increase memory bandwidth by 1.65x. In large-scale language model training, even a FLOPS usage rate of 60% is considered 'very high', and there are concerns that the H100 GPU will further reduce the usage rate.



◆ Increasing PyTorch operators and countermeasures
While PyTorch has the edge over Google's TensorFlow, Eager mode reads data from memory, computes it on each operation, and sends the result back to memory before the next computation is processed. As a result, memory bandwidth will increase significantly if large-scale optimization is not performed.

Therefore, models running in eager mode use operator fusion to combine multiple functions into a single pass to minimize memory reads/writes. While this reduces memory bandwidth and memory size costs, PyTorch's operators have ballooned to over 2000 over the past few years, making it difficult for developers to choose the right operator. It seems that there is

Meanwhile, PyTorch moved to the Linux Foundation independently from Meta, and released 'PyTorch 2.0' in December 2022.

AI framework 'PyTorch', which is also used for Stable Diffusion of image generation AI, moves to Linux Foundation independently from Meta-GIGAZINE



PyTorch 2.0 has many changes, but the big difference is that it now comes with a pre-compiled solution that supports a graph execution model, which makes optimizing hardware resources much easier. thing. In addition, 'PrimTorch', which reduces the number of operators to 250 or less to simplify the implementation, 'TorchDynamo', which determines which intermediate operations can be fused with those that need to be written to memory, and 'TorchDynamo', which reduces the burden of compiler work. TorchInductor' and other new features are also installed , and it seems that memory bandwidth and capacity requirements have been significantly reduced. PyTorch 2.0 is scheduled to be officially released in March 2023.

◆ OpenAI's language for AI 'Triton'
In July 2021, OpenAI, an AI research organization, released the open source programming language 'Triton' for neural networks. Triton enables productivity and high-speed code writing that exceed CUDA , a general-purpose parallel computing platform for GPUs developed and provided by NVIDIA, and is easy to use for machine learning engineers and researchers who are not familiar with GPU programming. It has been with.

OpenAI publishes programming language 'Triton' for open source neural networks - GIGAZINE



Patel points out that CUDA is used by people who specialize in accelerated computing, but not by machine learning researchers and data scientists. The reason is that it is difficult to use CUDA efficiently and it requires a deep understanding of the hardware architecture, which may delay the development process.

Since Triton can fill that gap, Patel sees that if Triton officially supports GPUs other than NVIDIA in the near future, it may break NVIDIA's stronghold. Regarding the problem NVIDIA faces, he said, ``NVIDIA's huge software organization lacks the foresight to take advantage of their enormous dominance in machine learning hardware and software and become the default compiler for machine learning. Because NVIDIA did not emphasize usability, external people such as OpenAI and Meta were able to create software stacks that were portable to other hardware.'

in Software,   Hardware, Posted by log1h_ik