Since the Internet became properly commercialised in the 1990s, businesses have always had a pretty one-sided view of how to handle their connections. If their website or web-based services are knocked offline, they’ll in most cases blame their ISP or whoever is directly delivering their Internet connections rather than looking further into where the outage may have happened.
They’ll more than likely be looking at the situation from their business out, constantly thinking about how they can protect themselves and use their firewall to deflect attacks. They’ll be on top of their security and performance, ensuring nothing can attack from the outside in.
Businesses are also pretty good at protecting their traffic from the inside out, at least the first step of the chain. However, the Internet then becomes a complicated mesh of different network companies and layers of communication that work together to transport requests across the Internet to their end destinations.
What happens if something happens further down the line, where a business doesn’t have direct visibility of their network, but near their end destination or even halfway across the Atlantic?
If the DNS lookup (which determines the destination of an online request and the route it needs to take to get there) is disrupted along the way, the result will appear to be an outage to the user.
A major streaming media company runs its service on AWS and streams its data from its home-grown VPN to users around the world.
Last year, major markets in Europe, Boston, Dallas, Seattle and LA went down at different times of the year for an hour at a time. In every case, Amazon said its dashboards were up and running the entire time. The ISPs were reporting they weren’t down either.
Because users couldn’t get to their administrator dashboard, they couldn’t start playing a movie, they couldn’t view or change their billing, they couldn’t search for content, they couldn’t do most of the things that people do with a service like this, apart from streaming content that was already playing.
It transpired over time that it was one of the network companies further along the route that was suffering from an outage at the exact moment customers were reporting their service as unavailable.
When the network company went down, so did the streaming service. Because the streaming media company wasn’t monitoring the entire journey away from its infrastructure, it just had to wait until those network companies came back up before service was restored.
If the company had been monitoring its DNS, it could have shifted the user to another administrator area running from a different AWS location, for instance.
An alternative would have been to shift to another set of IPs, update the recourse of the DNS, and transfer to a different path to get to the original site.
On a much larger scale, a major hosting provider suffered a major DNS outage at the beginning of the year, meaning many of its European customers were unable to access their websites and email. Like the problems at the streaming media company, all the company’s status indicators were saying the service was up and running as expected, but all was not so rosy.
Locations hosted around Europe were dropping out and although some were still up and running in the US, it was becoming apparent the problem was pretty serious and costing businesses thousands because they couldn’t service their customers.
It was revealed by the company’s chief information security officer that the company was subject to a DDoS attack after the service had been down for six hours.
“Earlier this morning, we experienced a DDoS attack on some of our DNS servers that impacted a small number of customers’ domains for approximately six hours,” he said at the time. “We apologize for the inconvenience. The issue has been resolved and service has been restored.”
Although a DDoS attack wasn’t the fault of the hosting company directly, the way the company dealt with the issue could have been better. Many of the company’s customers were offline for hours as the company tried to find the root of the problem. Some of its customers were reporting outages for hours after the problem was supposedly fixed too.
Had the company been monitoring its DNS servers properly, it could have re-routed those affected to alternative locations across the world, instead of losing them thousands of dollars.
Following the incident, users reported they had made the decision to add a backup DNS with a failover solution included (in many of these cases, Dyn) to their hosting package just in case it happened again.
Major search engine
In August, some of the biggest ISPs in Japan went down, because of a major mishap by a major search engine. The company made a huge volume of IPs available for websites to route their traffic through, known as a BGP routing leak.
What it didn’t realise was that many of these IP addresses belonged to Japanese ISPs including KDDI and NTT, which provides Internet connections to millions of people around the country.
However, the company opening them up to the general population of the world meant the ISPs couldn’t keep up with the requests and unsurprisingly, crumbled under the weight. Another issue is that the company isn’t an ISP and doesn’t have the facilities to route traffic, so all the requests just ended by being directed into a void, meaning requests couldn’t resolve.
“We set wrong information for the network and, as a result, problems occurred. We modified the information to the correct one within eight minutes. We apologize for causing inconvenience and anxieties (among Internet users).”
An American telecommunications company was also involved in the mix up, because it didn’t detect any issues, despite the search engine substantially increasing the number of prefixes it usually sends the company’s way – from 50 on a regular day, to a massive 160,000. This should have been detected on the telecommunications company’s end and triggered an automated response, but it didn’t, allowing the requests to go through the company’s servers in Chicago and overloading the Internet in Japan.
What’s important to note from these examples is no one really owns the path or the overall experience across the Internet – it’s typically owned in pieces by those transporting the data at that specific moment.
Although it’s all too easy to blame the ISP for not having enough bandwidth to make the service work or the destination website for not having enough throughput to handle the load, in reality, it’s the network companies that are creating the links for users to enterprise or to website servers or to applications that are more often than not suffering the outages.
Without looking at connections that bring the experience together, a business will have no idea of how to resolve the problem. And that is the challenge at the heart of a lot of the Internet instability and consistency problems that websites have all over the world, no matter what platform they’re on.
The solution to all this is to restructure a business’s thinking to include the entire Internet journey, quickly identifying problems further away from the enterprise. It could be part of the infrastructure on your end, it could be a network you use and you don’t realise how dependent you are on it, or it could also be at the end destination.
So, when, a problem occurs, if you are only monitoring what you own, how do you know if there is a problem outside of the network you own? If you knew about the issue, you could re-route your connection to another path that you know is up to handle your load, allowing you to resolve your problem very quickly. This way, you could keep running as an organization with minimal disruption and, actually, lower operating paths by running it across the Internet.
It’s not the responsibility of a cloud company, not the responsibility of the ISP, and, frankly, it’s not even the responsibility of the individual transit network companies across the Internet to make sure the service is running smoothly. It’s really up to the enterprise running the app or service to make sure that everything is running as best as possible.