How Does Netflix Make Your Data Infrastructure Cost-Effective?



Increasingly, companies and public institutions are adopting cloud services such as AWS as their data infrastructure, but the fees for using such cloud services are by no means low. Learn how Netflix, which manages huge amounts of data in the cloud, is compressing the cost of operating its data infrastructure and making it cost effective.

Byte Down: Making Netflix's Data Infrastructure Cost-Effective | by Netflix Technology Blog | Jul, 2020 | Netflix TechBlog
https://netflixtechblog.com/byte-down-making-netflixs-data-infrastructure-cost-effective-fee7b3235032

Generally, budget caps and strict spend limits are used to manage the cost of the data infrastructure, but with Netflix, which has a highly decentralized data infrastructure and values freedom and responsibility, The method is reciprocal and inefficient. To make the data infrastructure more cost-effective, Netflix has developed a dashboard that aggregates that information in order to make it transparent and to put the cost-effectiveness information as close to the decision makers as possible.

There are static data and dynamic data as the types of data handled by Netflix. Static data refers to data stored in AWS S3, Cassandra , Elasticsearch, etc., and costs are tied to storage. On the other hand, dynamic data is data that is processed by Keystone , Flink, etc., and is explained to be tied to the cost of calculating the data.



Since the above data is owned by various teams, in order to grasp the cost by team, not only the cost is aggregated across platforms, but also the cost is calculated in units of meaningful resources such as tables and indexes. Should be disassembled. In order to realize this disassembly function, Netflix is building the following system.



We have obtained AWS usage charges and S3 inventory metadata from AMS. S3 inventory is a service that outputs the metadata of objects saved in S3. From within Netflix, the Netflix Data Catalog (NDC), which provides all of Netflix's data resources and provides metadata that associates costs with data, and information from dynamic data service APIs, as well as CPU utilization and network throughput It is said that it acquires information from the system 'Atlas' that monitors metrics, calculates cost effectiveness and displays it on the dashboard.

Since AWS usage charges are per platform such as EC2 and S3, it is necessary to allocate the cost spent on those platforms in order to understand the cost for each team. For EC2-based platforms, first identify the bottleneck metrics such as CPU utilization and memory utilization for each platform. After that, Atlas explains that each data calculates the ratio to each metric and determines the allocation basis accordingly. S3 based platforms use S3 inventory to allocate costs according to the amount of data in S3 storage.



The dashboard that actually displays the data uses Druid as the back end. It is said that costs can be grouped according to usage. For example, displaying costs by organizational unit...



It is possible to display costs by resource.



There is also a dashboard that shows costs over time. These dashboards are mainly targeted at engineers and data science teams.



In addition to dashboard cost visualization, Netflix streamlines the way you use your data. While Netflix's big data warehouse was open-ended by owners, there was no way to set an optimal expiration date. In order to improve this situation, we have developed a system that automatically calculates the optimal expiration date to be set in the storage.



It is said that the transaction table that is updated daily is the most expensive in S3. First, use the system (Metacat) that collects S3 access logs and data warehouse metadata to identify which partition was accessed at what time. Then, from the access status of the past 180 days, the longest period required for re-access is analyzed, and it seems that there is a mechanism to set the optimal expiration date. In addition, data owners are provided with a dashboard that shows recommended and currently set expirations, as well as cost savings.



In addition to these dashboards and expiration date proposal systems, we have also built systems such as notifying engineers of the rising cost of data usage, and using these systems, the storage usage of the data warehouse is 10% or more. I was able to reduce it. Netflix says that future issues include 'maintaining the continuity of data when the organization or owner changes' and 'persistence of the state when there is a problem with the data', and further promoting the efficiency of data use I comment that I will go.

in Software,   Web Service, Posted by darkhorse_log