Breaking the Internet: Swapping Backhoes for BGP

September 5, 2017 David Belson

The term “break[ing] the Internet” has taken hold over the last few years – it sounds significant, and given the role that the Internet has come to play in our daily lives, even a little scary. A Google search for “break the Internet” returns 14.6 million results, while “broke the Internet” returns just under a half million results.

Interestingly, Google Trends shows a spike in searches for the term in November 2014 (arguably representing its entry into mainstream usage), coincident with Kim Kardashian’s appearance in Paper Magazine, and on the magazine’s Web site. (Warning: NSFW) To that end, Time Magazine says “But in the context of viral media content, ‘breaking the Internet’ means engineering one story to dominate Facebook and Twitter at the expense of more newsworthy things.” Presumably in celebration of those efforts, there’s even now a “Break the Internet” Webby Award.

“Breaking the Internet” in this context represents, at best, the failure of a website to do sufficient capacity planning, such as using a content delivery network (CDN) to help improve the scalability and performance of the Web site in the face of increased traffic from a flash crowd from the viral spread of a story.  (Full disclosure: I spent 18 years at Akamai, a leading CDN service provider, before joining Oracle in July 2017.)

However, the definition and award are arguably based on a common misconception – that the Internet and the World Wide Web are the same thing. To be clear, they aren’t – the Web, as we know it, is based on a set of application-layer protocols (including HTTP and DNS) that make use of core Internet protocols (such as TCP, IP, and UDP) and infrastructure (like terrestrial and submarine cables and the routing hardware that connects networks).  In short, the Web rides on top of the Internet.  (In addition, the Internet can trace its beginnings back to the ARPANET in 1969, while Tim Berners-Lee’s Web efforts at CERN followed 20 years later.)

These points begin to get at the true cause of a broken Internet – the underlying infrastructure, and perhaps more importantly, the configuration of that infrastructure.

Sometimes, problems result from issues with infrastructure that is used by a large number of popular Web sites and applications.  For example, in late February 2017, Amazon’s Simple Storage Service (S3) experienced a service disruption in the Northern Virginia (US-EAST-1) Region as the result of an incorrect input to a command intended to remove a small number of servers for one of the subsystems that is used by its billing process.  According to published reports, popular Web sites and tools including Slack, Venmo, Trello, GitHub, Quora, ChartBeat, and Imgur, among others (including other AWS services), experienced availability issues for several hours.

In many cases, though, the plumbing of the Internet breaks – that is, problems with the underlying networks and the connections between them. Interestingly, these problems happen more often than you realize, but their impact usually isn’t very widespread. However, extensive monitoring done by Oracle Dyn enables us to see these problems as they occur, and to measure their impact via hundreds of millions of traceroutes each day from hundreds of global vantage points, in addition to collecting routing information in over 700 locations around the world. This enables us to have an incredibly detailed, near-real time view of how the networks that comprise the Internet are interconnected, changes that occur, and the geographic scope and duration of such changes.

Based on our data analysis and associated observations, the Internet breaks multiple times a day, every day. However, most of these problems tend to be short-lived, and generally rather localized.  However, some are significant enough, with sufficiently broad impact, to warrant more detailed examination.  Examples include:

  • Routing leaks: These occur when one provider announces routes for blocks of IP addresses that don’t originate from their (or a downstream) network – these routes are often learned from peered networks. Such announcements are usually inadvertent in nature, and are often the result of filter misconfigurations. This can cause some Internet traffic to take a multi-continent detour, as happened recently with Google in Japan, or with several different providers nearly two years ago.

  • Route hijacks: These occur when a provider intentionally claims to have a more specific route to a set of IP addresses normally controlled by another provider. Hijacks may be malicious in nature (to effect a man-in-the-middle attack or send spam, for instance), may be politically motivated, or may be due to mistakes (such as transposed digits). Earlier this year, Iran’s state telecom (shown as TIC, ASN 12880 in the diagram below) was observed to be hijacking address space belonging to websites containing adult content, as well as Apple’s iTunes services, ostensibly with the intent of censoring content found on those sites/services. Nearly a decade ago, Pakistan Telecom advertised a small part of YouTube’s assigned address space, causing some Internet users to attempt to reach YouTube via Pakistan Telecom’s network, effectively creating a black hole for that traffic.

  • Lack of connection diversity: Most countries have gradually moved towards increased diversity in their Internet infrastructure over the last decade, especially as it concerns international connectivity to the global Internet. However, some countries remain at severe risk of Internet disconnection, with only one or two providers at their “international frontier”. This minimal diversity is often maintained for political purposes, making it easier to disable international Internet connectivity if deemed necessary, as we have seen in a number of countries in the Middle East. Blog posts published in 2012 and 2014 explored the risk of Internet disconnection for countries around the world, and we plan to publish an updated overview later this year, exploring if and how things have changed in the five years since the original post.

To summarize, there are multiple ways to “break the Internet”. However, they are more likely to be related to routing misconfigurations (intentional and accidental), government-directed service disruptions, and physical damage to fiber-optic cables — more mundane than the latest celebrity scandal going viral on social media, for sure, but also arguably more important.  And of course, when something as important as the Internet breaks, it needs to be fixed. (To paraphrase a colleague, there is a lot of work that goes into putting Humpty Dumpty back together again on a daily basis.)

Oracle Dyn’s monitoring enables us to quickly identify and actively observe problems related to Internet connectivity, and we will either notify affected parties that something needs to be fixed, or help fix it directly, collaborating with the larger  Internet infrastructure community.  We are proud to be an active participant in that community, working together (often behind the scenes) to address these issues, making the Internet a better, more performant, and safer place to work and play for customers and Internet users around the world.

Read more...

Previous Article
Internet Impacts of Hurricanes Harvey, Irma, and Maria
Internet Impacts of Hurricanes Harvey, Irma, and Maria

Devastation caused by several storms during the 2017 Atlantic hurricane season has been significant, as Hur...

Next Article
Large BGP Leak by Google Disrupts Internet in Japan
Large BGP Leak by Google Disrupts Internet in Japan

At 03:22 UTC on Friday, 25 August 2017, the internet experienced the effects of another massive BGP routing...