How is Netflix using eBPF, a Linux kernel extension, to improve performance?



Netflix's technical blog posted about how

eBPF , which extends Linux kernel functions, can be used to improve performance.

Noisy Neighbor Detection with eBPF | by Netflix Technology Blog | Sep, 2024 | Netflix TechBlog
https://netflixtechblog.com/noisy-neighbor-detection-with-ebpf-64b1f4b3bbdd


Netflix's Compute and Performance Engineering team regularly investigates performance issues in our multi-tenant environment. When an issue occurs, the first step in our investigation is to isolate whether the cause is in the application or the underlying infrastructure.

Titus, the platform that supports Netflix's services, is a multi-tenant platform that allows multiple services to coexist in one system. In a multi-tenant environment, a 'noisy neighbor' problem is particularly problematic, where one service consumes a large amount of server resources, causing the performance of adjacent containers to degrade.

However, conventional performance analysis tools such as perf have the risk of incurring large overhead and further degrading performance, and even if a performance analysis tool is installed after a problem occurs, the efficiency of the investigation is low. It is difficult to detect the effects of noisy neighbors and debug problems.

So, Netflix's engineering team used eBPF, a tool that extends the Linux kernel functionality, to measure how long a process waits in the scheduling queue before being executed by the CPU using three hooks: sched_wakeup, sched_wakeup_new, and sched_switch.



By continuously monitoring the wait time in the scheduling queue, it will be possible to smoothly catch noisy neighbor problems when they occur. For example, the figure below shows the progress of the scheduling queue wait time on a server with sufficient CPU capacity to run a single container. The container wait time shown by the blue line is 83.4 microseconds on average, with rare spikes of around 400 microseconds.



Starting a second container that fully utilizes the host's CPU during this time spiked the first container's wait time by 131,000 microseconds (131 milliseconds). System preemptions, shown in green, also increased at the same time, highlighting that the actual noisy neighbors are system processes. This pattern is most often seen when an application is handling HTTP traffic.



Netflix says that using eBPF allows them to continuously and efficiently monitor system performance, turning the data into actionable insights.

in Software, Posted by log1d_ts