Cloudflare publishes analysis results of failure caused by internal software error, what was the cause?



Regarding the failure that occurred on October 4, 2023 in the privacy-focused public DNS '

1.1.1.1 ,' Cloudflare, which operates 1.1.1.1, posted a blog post summarizing the cause and measures to prevent recurrence. According to the post, the problem was not due to an attack from someone else, but was caused by an internal software error.

1.1.1.1 lookup failures on October 4th, 2023
https://blog.cloudflare.com/1-1-1-1-lookup-failures-on-october-4th-2023/


This failure occurred from 16:00 to 20:00 Japan time on October 4, 2023, and was caused by the DNS service '1.1.1.1' provided by Cloudflare, as well as services such as Cloudflare Pages and WARP that use 1.1.1.1. was affected.

DNS is an abbreviation for 'Domain Name System' and has the role of extracting the IP address necessary for actually accessing a domain name such as 'cloudflare.com'. For example, when accessing cloudflare.com, you need to ask the server that manages the top-level domain (TLD) '.com' to tell you which server manages the address information for cloudflare.com. Information about the ``root zone'', which indicates where TLD servers such as ``.com'' are located, is published on the root server .



Cloudflare stores a copy of the root zone within the ``1.1.1.1'' service, making it possible to respond quickly even if the root server is not accessible.



When the root zone information is updated, the information in the copy saved on the Cloudflare side will also be updated. Under normal circumstances, it will be updated twice a day.



On September 21, 2023, a new record 'ZONEMD' was added to Root Zone. Although the contents of ZONEMD were simply a checksum of root zone information, there was a problem with the analysis system of 1.1.1.1, and the contents of ZONEMD could not be analyzed. Because Cloudflare failed to obtain a new root zone, Cloudflare continued to provide DNS service by using a cache of root zones prior to the addition of ZONEMD on September 21st, or by directly querying root servers.



However, at 16:00 on October 4, 2023, DNSSEC for the September 21 version of the root zone expired, and servers using the old root zone cache were unable to verify the signature, resulting in an increase in SERVFAIL error responses. . A failure has occurred.



The figure below summarizes the responses to DNS queries around the time the failure occurred. You can see that SERVFAIL, which normally stays at about 3%, has increased to a maximum of about 15% after DNSSEC expired at 7:00 GMT.



Cloudflare regularly restarts resolver servers for tasks such as kernel updates. Instances restarted between September 21st and October 4th failed to load the root zone at startup, started querying the root server, and used the Serve Stale feature to remove stale data. It is stated that the damage was able to be reduced due to factors such as being able to distribute the information.

To prevent this from happening again, Cloudflare will take the following measures.

◆Visibility
Alert and notify you when using an old root zone.

◆Durability
We're rethinking how root zones are retrieved and distributed to ensure that new record types can be processed without interruption.

◆Test
Although we were testing ZONEMD, we were not properly testing what would happen if root zone parsing failed, so we will continue to improve coverage and related processes.

◆Design
Better manage the expiration of cached root zones, as you should not use a copy of the root zone after a certain age.

in Web Service, Posted by log1d_ts