A recent article by Cloudflare has shed light on a fascinating issue that led to the outage of their popular 1.1.1.1 DNS service. The core of the problem? An unclear specification in the RFC (Request for Comments) standards, which caused a chain reaction of events. But here's where it gets controversial... some argue that it's not the RFC's fault, but rather a misunderstanding by developers at Cloudflare. Let's dive in and explore this intriguing debate.
The RFC Mystery
On January 8, a routine update to Cloudflare's DNS service changed the order of CNAME records in responses. This seemingly minor change had a significant impact, causing some DNS clients to fail when resolving names. The issue? These clients expected alias records to come first, and when the order was altered, chaos ensued.
The Root Cause
Cloudflare's team identified the problem as an ambiguity in older DNS standards regarding record order. They proposed a clarified specification to prevent such issues in the future. But here's the twist: while most modern software treats record order as irrelevant, some implementations, like the getaddrinfo function in glibc, expect CNAME records to appear before other record types. This expectation led to the outage.
The Outage Explained
When a DNS resolver looks up a name with a CNAME record, it follows a chain of alias records linking the original name to a final address. Each step is cached with its own expiry time. If part of this chain expires in the cache, the resolver only re-fetches the expired portion and combines it with the valid parts to form the complete response. However, when Cloudflare changed the order of CNAME records, this delicate balance was disrupted, leading to failed DNS resolutions and, ultimately, the outage.
The Change and Its Impact
Sebastiaan Neuteboom, a systems engineer at Cloudflare, explains the change: "While optimizing our cache implementation to reduce memory usage, we introduced a subtle alteration to CNAME record ordering." This change, introduced on December 2, 2025, and deployed on January 7, 2026, altered the way CNAMEs were handled. Previously, the code created a new list, inserted the CNAME chain, and then appended new records. To save memory, the code was changed to append CNAMEs to the existing answer list. As a result, the responses from 1.1.1.1 sometimes had CNAME records appearing at the bottom, after the final resolved answer.
The Impact on DNS Clients
While many DNS client implementations are unaffected by record order, others, like the getaddrinfo function, handle the chain by keeping track of expected names and iterating sequentially. They expect to find CNAME records before any answers. This expectation was disrupted by Cloudflare's change, leading to the outage.
The Discussion and Debate
On various online platforms, users discuss the root cause of the issue. Some praise Cloudflare's engineering standards and post-mortem analysis, while others question their testing practices and global impact. The debate also centers on whether the RFC is truly unclear or if developers at Cloudflare misinterpreted it. Patrick May offers an insightful comment, referencing Hyrum's Law and Postel's Law, highlighting the importance of considering all observable behaviors and being liberal in what you accept.
Cloudflare's Proposal
In an Internet-Draft to be discussed at the IETF, Cloudflare proposes an RFC that explicitly defines how to handle CNAME records in DNS responses correctly. This proposal aims to prevent similar issues in the future.
Timeline and Resolution
Cloudflare began the global rollout on January 7, reaching 90% of servers by January 8 at 17:40 UTC. The company quickly declared the incident, began reverting the change at 18:27 UTC on January 8, and completed the rollback by 19:55 UTC. A swift response to minimize the impact.
This incident highlights the delicate balance of technology and the importance of clear specifications. It also raises questions about testing practices and the global impact of such changes. What do you think? Is the RFC truly at fault, or was it a misunderstanding? Share your thoughts in the comments!