Routing Redundancy: How much is enough?

August 15, 2009 Earl Zmijewski
router-r

router-f

Internet connectivity is a good thing. Many of us depend on it for everything from our livelihoods to our entertainment. However, the Internet is very fragile and even the The New York Times is worried about it. But they’re primarily concerned with overloads that can occur when everyone on the planet does the same thing at roughly the same time, such as surfing for news about Michael Jackson. Unfortunately, we will never avoid all such scenarios. Physical systems are designed around average and typical peak loads, not around extremely high loads associated with very unlikely events. Who would pay for that?

And this applies to other complex systems besides the Internet. I was in India during 9/11 and, for two days, I could not make a traditional phone call to the US. Why? Everyone in India knows someone in NYC, and they all picked up the phone at the same time to check in on them. The circuits were so overloaded, I couldn’t even get the friendly “Your call cannot be completed as dialed” message.

No system is ever going to be engineered for insanely high loads. If everyone in your town decides to take a shortcut through your neighborhood to avoid an accident on the highway, you are going to have trouble getting out of your driveway. But rather than give up and wait it out, there is something you can do in advance and at reasonable cost: build a second driveway to a different street on the other side of your house, one that isn’t fed by the same access roads from the highway. This blog is about building such redundancy into your Internet connectivity, so you aren’t disconnected by a single failure. And while it’s good that the New York Times and various governments are watching the problem, if your business depends on the Internet, you’re largely on your own to audit and verify that you are buying a sufficient level of redundancy for your budget. A lot of fragility problems could be solved by more informed consumers performing the necessary due diligence.

To plan for inevitable disruptions in service, you must find and then eliminate single points of failure. A small business with a single provider, for example, has a single point of failure because if that provider goes offline, so does the business. Larger organizations often combat this problem by having multiple providers, but it still may not be enough. In particular, if all of their providers ultimately rely on the same still-further-upstream provider. This kind of indirect single point of failure, illustrated in the first graphic below, is usually more difficult to identify. Other kinds of hidden single points of failure are also important, such as when one’s two providers both run their cables through the same physical conduit. If one errant backhoe severs the conduit, this “redundancy” is rendered useless. For this discussion, though, we’ll focus on organization-level redundancy.



Single points of failure: direct (left) and indirect (center and right)

Planning for direct provider redundancy is generally straightforward. But avoiding indirect upstream single points of failure requires an awareness of the ever-evolving business relationships found on the Internet. To understand these relationships, we first have to introduce some terminology. As readers of this blog will know, every organization with multiple Internet providers will tend to have two things under their control: block(s) of IP addresses (prefixes) and an identifying number, or ASN. Renesys currently observes over 110,000 relationships between pairs of ASes on the Internet that have decided to interconnect. By carefully studying routing data from numerous vantage points around the globe, we are able to classify these relationships and determine both the direct and indirect dependencies between providers. Once you map out these relationships, you can then look for the single points of failure.

True diversity — complete freedom from single points of failure — is illustrated in the following graph, where AS 1 has three providers, each of which is richly connected to the rest of the Internet. One or even two of AS 1’s providers can have technical problems and AS 1 will still be able to maintain its Internet presence. This organization has planned well for eventual failures.


Diversity: no single points of failure

So securing direct redundancy is fairly easy, but beyond that it gets complicated. Large ASes typically encompass many geographic locations and might even have offices all over the world. If AS 1 has three providers, does it have them at all of its worldwide locations? If not, maybe they have an internal network that connects their offices. Hence, an office whose local Internet connection fails might still be able to reach the Internet via one of the other offices. In other words, you cannot view an AS as a single entity, as it may in fact consist of many independent locations with their own providers and private connectivity to other parts of the organization. As if sorting through all this wasn’t enough, there is a final complication. Mergers, buyouts, spin-offs and other aspects of modern capitalism can result in a single organization consisting of many ASes. Finding them all is quite complex, as we discussed in a previous blog.

To truly measure diversity, we cannot merely look at ASes and their relationships. We have to look at organizations as a whole and how their associated prefixes are routed. Does this particular prefix show diversity in how it is routed or is it solely dependent on one provider? Nor is it sufficient to look at one instant in time. Rather we have to track the prefix over longer periods in order to uncover seldom-used backup routes. If the prefix really has multiple ways to reach it, you want to account for that and give credit where it is due.

With these ideas in mind, we set about trying to measure Internet diversity for each organization. Every associated prefix is given a score based on the amount of diversity observed over a period of two months. Then the individual prefix scores are combined to give an overall score to the organization in the range of 0 — 100. A score of zero represents no diversity at all (like the left side of the first graphic), and a score of 100 represents very rich diversity (as in the second graphic). We leave the details of the algorithm to a follow-up piece, which will also include data for the Internet as a whole. For now let’s look at a few examples of individual organizations.

Amazon is a well-known and hugely successful commercial retailer with a worldwide presence. They have around a half dozen ASes and almost three dozen prefixes. At an AS level, the Renesys Market Intelligence tool shows Amazon with a rich set of top-tier providers, such as Level 3, NTT, Tinet (formally Tiscali), Qwest and others. But are all Amazon prefixes seen via such a diverse set of providers, at least once in a while? In fairness, Amazon might only care about diversity for some of its prefixes, but since we can’t know which ones matter, we treat all prefixes the same and score them all based on their observed diversity. Then we combine the results to get an overall Amazon score.

And the answer is … (drum roll please) … 99.16 for Amazon’s total score. They miss out on a perfect score of 100 only because of a few less diverse prefixes, one of which geo-locates to Hong Kong and is only seen via NTT. As you might expect, Amazon has very rich global Internet diversity and generally has no single points of failure with respect to its logical connectivity.

In total, out of over 75,000 organizations that we have identified, not all of which have their own ASNs, over 9,500 score a perfect 100. But most of these are rather small or geographically constrained, so they perhaps have an easier time of implementing diversity once they realize the importance. An example of a larger organization achieving perfection is Vietnam Posts and Telecommunications (AS 7643). Their hundreds of prefixes are all seen via numerous large providers ranging from AT&T to China Telecom.

At the other end of the spectrum, we have the United States Social Security Administration. Their 18 prefixes show no diversity at all, as they are all seen only via Verizon (AS 701). No matter how many physical connections Social Security has to Verizon and no matter how reliable Verizon has been for them in the past, no service provider can guarantee their customers uninterrupted service to the entire Internet. Through no fault of Social Security or even necessarily Verizon, if another top-tier Internet provider were to de-peer Verizon, then Social Security would lose access to parts of the Internet. Since we saw two such top-tier de-peerings in 2008 alone,
between Cogent and TeliaSonera and between Cogent and Sprint, this is not a far-fetched concern. To ensure their connectivity and improve their score, Social Security would need to pick up additional providers, ones not solely dependent on Verizon. But they are not alone: over 26,000 other organizations also have zero diversity.

Now let’s return to the New York Times, whose article motivated us to write up this work in the first place. Should the Times worry about its own Internet diversity? Looking at its approximately half-dozen prefixes, the company receives a respectable score of 85.37. But if we disregard two of its smaller prefixes, keeping just the larger ones and the one in which www.nytimes.com lives, its score increases to 93.5. So while the Times has some room for improvement, someone in the organization is clearly paying attention to the issue of redundancy when provisioning for its most important networks.

For our more technically inclined readers, a subsequent blog will provide more of the details around our diversity algorithm and give results for the Internet as a whole. Not surprisingly, as our examples suggest, the results vary widely.

The post Routing Redundancy: How much is enough? appeared first on Dyn Research.

Read more...

About the Author

Earl leads a peerless team of data scientists who are committed to analyzing Dyn’s vast Internet Performance data resources and applying their expertise to continually improve upon Dyn’s products and services.

More Content by Earl Zmijewski
Previous Article
Staring Into The Gorge: Router Exploits
Staring Into The Gorge: Router Exploits

I’m writing this blog entry from the campground at Vermont’s beautiful...

Next Article
The Proxy Fight for Iranian Democracy
The Proxy Fight for Iranian Democracy

If you put 65 million people in a locked room, they’re going to find all the...