Nov 28, 2024 16:00:00

A bug caused Cloudflare to lose about 55% of logs sent to users over a 3.5 hour period

Cloudflare, which provides services such as CDN and DDoS protection, has a service called '

Cloudflare Logs ' that provides detailed logs of HTTP requests to users' websites. On November 24, 2024, Cloudflare reported that a bug in Cloudflare Logs caused a failure in the log sending process, resulting in the loss of approximately 55% of the logs that were supposed to be sent to users over a period of three and a half hours.

Cloudflare incident on November 14, 2024, resulting in lost logs
https://blog.cloudflare.com/cloudflare-incident-on-november-14-2024-resulting-in-lost-logs/

Cloudflare says it lost 55% of logs pushed to customers for 3.5 hours
https://www.bleepingcomputer.com/news/security/cloudflare-says-it-lost-55-percent-of-logs-pushed-to-customers-for-35-hours/

Cloudflare broke its logging-service, causing data loss • The Register
https://www.theregister.com/2024/11/27/cloudflare_logs_data_loss_incident/

Cloudflare Logs is a service that collects and transmits website traffic logs, allowing users to identify security incidents and optimize their websites.

Because users often request logs from multiple servers, log files can become redundant and enormous in size. Cloudflare uses a system called ' Logpush ' to consolidate logs into a predictable size and push them at an appropriate frequency.

The following diagram is a simple summary of the Logpush system. First, a service called 'Logfwdr' receives event logs from the Cloudflare system and batches them into appropriate sizes. Next, 'Logreceiver' receives the batched data and sorts them according to the type of event and its purpose. 'Buftee' provides a buffer for each job in Logpush, and finally, 'Logpush' reads the logs from Buftee's buffer and sends the data to the destination configured by the customer.

On November 24, Cloudflare made changes to Logpush to support additional data sets, which included updating the Logfwdr configuration, but a bug caused a 'blank config' to be issued, erroneously telling the system that 'no customers were configured to forward logs.'

The Cloudflare team quickly noticed the bug and reverted the change within five minutes. However, Logfwdr had a 'fail-open protection that would send events to all customers to prevent log loss' if an issue occurred and an individual host could not retrieve the configuration.

As a result, in just five minutes, a large number of logs were sent to customers by Logfwdr, causing a sudden increase in the load on Buftee, which creates a buffer for each customer. The system required 40 times the normal amount of buffers, which led to an overload and caused a system failure. The graph below shows that a change containing a bug was made around 15:10, causing a dramatic increase in the buffers created by Buftee.

The fix required a full reset and restart, slowing recovery and causing long-lasting impacts. As a result, during the 3.5 hours that the Cloudflare Logs service was affected, approximately 55% of logs that would normally be sent were lost.

Cloudflare pointed out that although it was expected that the Logfwdr bug itself might occur at some point, there was a problem with the 'fail-open' protection system to deal with the bug. In addition, although Buftee had a mechanism to prevent a chain of failures due to a sudden increase in buffers, it was not configured correctly. Cloudflare described the series of problems as 'like having a seat belt in your car but not fastening it.'

To prevent future recurrences, Cloudflare has promised to create alerts to catch misconfigurations and regularly conduct overload tests to simulate unexpected spikes in data volume to ensure the security of its systems.

Related Posts:

Nov 28, 2024 16:00:00 in Web Service, Security, Posted by log1h_ik