A company's experience of dramatically reducing costs by operating its own data center for ¥800 million, when cloud computing would have cost five times as much

You might think that owning your own data center requires huge costs, land acquisition, and lobbying with politicians, but
Owning a $5M data center - comma.ai blog
https://blog.comma.ai/datacenter/

When a business relies on cloud computing, it places a great deal of trust in its cloud provider. Comma.ai CTO Harald Schäfer points out that once a company chooses a cloud provider, it's hard to leave, and fluctuations in cloud service prices can have a significant impact on the business. 'If you want to control your own destiny, you need to run your own computing,' Schäfer says.
Schafer also said that operating a data center for computing in-house not only increases a company's autonomy, but also inspires excellent engineering. Generally, when using the cloud in the field of machine learning, you can improve results simply by spending money to increase computing resources, but this also carries the risk of being tied to an inefficient and expensive solution. On the other hand, when operating a data center with limited resources in-house, CTO Schafer argues that instead of increasing the budget, you can focus on speeding up code and fixing fundamental problems.
The photo below shows Comma.ai's data center. It has a very simple configuration and is built and maintained by a few engineers and technicians.

Power is required to operate a data center, and Comma.ai's data center uses up to approximately 450 kW. In 2025, the company is expected to pay more than $540,000 (approximately 85 million yen) in electricity bills, accounting for the majority of data center costs.

Data centers typically use computer room air conditioners (CRACs), but this requires significant electricity costs. Comma.ai therefore takes advantage of the mild climate of

The majority of the data center's computing power comes from 75 TinyBox Pro machines, each with two CPUs and eight GPUs, and functions both as a training machine for AI models and as a general computing worker.
The storage is comprised of several racks of Dell R630 and R730 servers with a total capacity of 4PB. There are also several other individual machines running services, such as routers, air conditioning controllers, data capture machines, and storage master servers.
Schaefer explains that the network switch interconnects three 100Gbps Z9264F switches, and two InfiniBand switches are installed to interconnect two Tinybox Pro groups for All-Reduce training. Ubuntu is installed on all servers, and minikeyvalue is used to manage distributed storage.
The main training/computing machines are shown below. Schafer points out that owning a data center can be much cheaper than contracting with a cloud service, especially if you're running a business that trains and runs models. In Comma.ai's case, it spent about $5 million on its own data center, but Schafer estimates that doing the same thing in the cloud would have cost more than $25 million (about 3.92 billion yen).

Related Posts:
in Hardware, Posted by log1h_ik







