SkyPilot automatically selects the most cost-effective cloud for significant cost savings



Cloud computing services are becoming more and more important for business, but along with this, soaring costs, availability problems, and the trouble of choosing services are rapidly increasing. Therefore, a team led by Zongheng Yang, a researcher at the University of California, Berkeley, has developed an open source framework ` ` SkyPilot '' that can automatically select the most cost-effective cloud.

GitHub - skypilot-org/skypilot: SkyPilot is a framework for easily running machine learning workloads on any cloud through a unified interface.

https://github.com/skypilot-org/skypilot

SkyPilot: ML and Data Science on any cloud with massive cost savings | by Zongheng Yang | Nov, 2022 | Medium
https://medium.com/@zongheng_yang/skypilot-ml-and-data-science-on-any-cloud-with-massive-cost-savings-244189cc7c0f

UC Berkeley Launches SkyPilot to Help Navigate Soaring Cloud Costs
https://www.datanami.com/2022/12/12/uc-berkeley-launches-skypilot-to-help-navigate-soaring-cloud-costs/

There are various types of cloud computing services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). There must be many people.

Yang argues that organizations using cloud computing services should preferably implement 'multi-cloud', which uses multiple clouds, or 'multi-region', which allows switching between multiple regions. Benefits of using multicloud/region include:

◆Cost can be reduced
Below is a comparison of the instance usage fees for NVIDIA A100 GPU, AMD CPU, and ARM CPU as of November 2022 on AWS, GCP, and Azure. With 'NVIDIA A100 GPU', you can see that the price differs depending on the service provider even with the same hardware, such as Azure being the cheapest, AWS being 20% more expensive, and GCP being 8% more expensive. Being able to choose the cloud computing service that best fits their needs can lead to significant cost savings.



Similarly, there are price differences between regions and zones even for the same cloud computing service, so it is also important to switch between regions and zones appropriately.



◆ Take advantage of the best hardware
Because different pieces of hardware differ in performance and efficiency, service providers are increasingly offering custom hardware to differentiate themselves from their competitors. Yang gives the following example:

・ GCP's '

TPU ', a high-performance machine for machine learning
・AWS “ Inferentia ” for cost-effective machine learning inference
・ Azure's ' Intel SGX ' that encrypts data in use

Differences in services and hardware provided exist not only between clouds but also between regions. By choosing the best hardware for the task, you can expect cost reduction and performance improvement.

◆ Increase the availability of valuable resources
Due to the growing demand for cloud computing services, it has become difficult to obtain cloud instances that utilize high-end GPUs, and even services that do not may face capacity shortages. Using multi-cloud/region is the best way to increase the availability of valuable cloud computing resources.

Net users have a heated discussion with a report that `` Microsoft Azure's lack of capacity is adversely affecting startup business ''-GIGAZINE



However, even at the University of California, Berkeley, the complexity of operating multiple clouds/regions has been a longstanding challenge. Yang said, ``Our lab relies heavily on public clouds to run projects such as machine learning, data science, systems, databases, and security. We found it difficult and using multiple clouds only exacerbated the burden on end users.'

So Yang and his group developed SkyPilot, an open source framework to simplify and reduce costs for multi-cloud/region usage. After specifying the job and resource requirements (CPU/GPU/TPU), SkyPilot automatically identifies cloud services/regions/zones with computing resources to run the job, selects the cheapest one, and executes it. About.



In addition, SkyPilot can automatically perform failover in case of insufficient capacity or errors, synchronize user code and files to the cluster, and manage job queuing and execution. . Yang uses SkyPilot to choose the cheapest cloud computing service to run his jobs and automatically cleans up idle clusters, often reducing costs by more than a third. I claim that I can.

SkyPilot has been deployed to dozens of researchers belonging to more than 10 organizations over the past few months and has been used in a variety of use cases. For example, scientists at the Salk Institute for Biology use SkyPilot to run regular weekly patch jobs. As a result, the cost was reduced to 1/6 or less compared to running the instance on demand, and the job completion time was also significantly reduced.

“Salk Labs users tell us that SkyPilot abstracts the cloud so researchers can focus on the science rather than learning the intricacies of how the cloud works,” Yang said. I will.”

Mr. Yang said that in the coming months, he will provide more detailed information on the SkyPilot system, detailed use cases, and how to implement it, and plans to improve the function.

in Software,   Web Service, Posted by log1h_ik