Ogni tanto anche Internet si rompe…
Ci avviciniamo alla fine dell’anno e partono tutte le compilation delle meglio o peggio cose accadute.
Questa di RoyalPingdom è la lista dei più significativi malfunzionamenti occorsi su Internet nel 2010.
Morale della favola: in un mondo globalizzato e che fa sempre più riferimento ad applicativi e siti Internet per le operazioni quotidiane più svariate, incidenti come quello causato da China Telecom (vedi sotto) o outage di servizio un minimo prolungati come quello di Google (o quello di Skype di ieri sera) possono decisamente rompere il catso a un sacco di gente.
La lista nello spoiler:
Wikipedia’s failover fail
Wikipedia has become so ubiquitous that it can’t go down for a minute without people noticing. According to Google Trends for Websites, the site has roughly 50 million visitors per day.
In March, when servers in Wikimedia’s European data center overheated and shut down, the service was supposed to fail over to a US data center. Unfortunately, the failover mechanism didn’t work properly and broke the DNS lookups for all of Wikipedia. This effectively rendered the site unreachable worldwide. It took several hours before everyone could access the site again.
WordPress.com’s big-blog crash
WordPress.com got a pretty bad start this year when a network issue caused the biggest outage the service had seen in four years. The outage became extra noticeable not just because of the sheer number of blogs it hosts (at the time 10 million, now many more), but also because so many high-profile blogs use it. The WordPress.com outage took down blogs such as TechCrunch, GigaOM and the Wired blogs for almost two hours in February.
Gmail’s multiple outages
Gmail is one of the world’s most popular email services, and is an integral part of Google Apps. Unfortunately, it’s had several notable outages this year. These issues haven’t always affected Gmail’s entire user base, but enough of it to make headlines in the news.
In February, a routine maintenance caused a disruption that cascaded from data center to data center, knocking out Gmail worldwide for about 2.5 hours. In March, Gmail had an issue that lasted as much as 36 hours for some users. Another incident happened early in September, when overloaded routers made the service completely unavailable for almost two hours.
China reroutes the Internet
In April, China Telecom spread incorrect traffic routes to the rest of the Internet. In this specific case it meant that during 18 minutes, potentially as much as 15% of the traffic on the Internet was sent via China because routers believed it was the most effective route to take.
Similar incidents have happened before, for example when YouTube was hijacked globally by a small Pakistani ISP two years ago. Normally this results in a crash since the ISP can’t handle the traffic. However, China Telecom was able to handle the traffic, so most people never noticed this. At most they noticed increased latency as traffic to the affected networks took a very long and awkward route across the Internet (via China).
Even though no serious outage happened as a result of this, we think it’s such a fascinating disruption of the traffic flow that we felt it was worth including here. This is an inherent weakness of today’s Internet infrastructure, which largely relies on the honor system. Renesys has a more in-depth explanation of this incident and how it could happen. We should state that it wasn’t necessarily an intentional hijacking.
Twitter’s World Cup woes
Twitter seemed like the ideal companion to the World Cup (soccer to you Americans, football to the rest of the world, John Cleese explains it best). Tweeting about the World Cup proved so popular that it slowed down or broke Twitter several times during the weeks of the event. The upside is that this effectively load tested Twitter’s infrastructure, revealing potential weaknesses. As a result, Twitter’s service today is most likely more stable than it might otherwise have been.
Facebook’s feedback loop
Facebook has become a true juggernaut with more than 500 million users. That hasn’t changed its development philosophy, “don’t be afraid to break things.” This aggressive approach to speedy development has been key to Facebook’s success, but, well, sometimes it will break things.
Facebook’s worst outage in four years came in September when a seemingly innocent update to Facebook’s backend code caused a feedback loop that completely overloaded its databases. The only way for Facebook to recover was to take down the entire site and remove the bad code before taking the site back online. Facebook was offline for approximately 2.5 hours.
Foursquare’s double whammy
Foursquare’s location-based social network has been a resounding success and has in little time gathered a following of millions, so when the service went down for roughly 11 hours early in October, people of course noticed. The culprit was an overloaded database. And as if to add insult to injury, almost exactly the same thing happened the day after, taking the site down for an additional six hours.
Paypal’s payment problems
When Paypal stumbles, so do the many thousands of merchants that rely on Paypal to handle payments, not to mention the millions of regular consumers who use Paypal for their online payments. You can imagine the effect, and sales lost, if Paypal stops working for hours on end. Which was exactly what happened in October when a problem with Paypal’s network equipment crippled the service for as much as 4.5 hours. At its peak the issue affected all of Paypal’s members worldwide for 1.5 hours.
Tumblr was (and still is) one of the great social media successes of 2010, but with rapid growth comes scalability challenges. This has become increasingly noticeable, and culminated with a 24-hour outage early in December when all of Tumblr’s 11 million blogs were offline due to a broken database cluster.
The Wikileaks drama
If you’ve missed this you must have been hiding under a rock, which in turn was buried below a mountain of rocks. The site issues that Wikileaks experienced during the so-called Cablegate were significant. First the site was the victim of a large-scale distributed denial-of-service attack which forced Wikileaks to switch to a different web host. After Wikileaks moved to Amazon EC2 to better handle the increased traffic, Amazon soon shut them down. In addition to this, several countries blocked access to the Wikileaks site. And then the possibly largest blow came when the DNS provider for the official Wikileaks.org domain, EveryDNS, shut down the domain itself.
Without a working domain name in place, Wikileaks could for a time only be reached by its IP address. Since then, Wikileaks has spread itself out, mirroring the content over hundreds of sites and different domain names, including a new main site at Wikileaks.ch.
As if this wasn’t enough drama, you have to add the reactions from some of Wikileaks’ supporters (not from Wikileaks itself). The services that cut off Wikileaks in various ways (Paypal, VISA, Mastercard, Amazon, EveryDNS, etc.) were subjected to distributed denial-of-service attacks from upset supporters across the world, which resulted in even more downtime. There was also collateral damage, when some attackers mistook the DNS provider EasyDNS for EveryDNS, aiming their attacks at the wrong target.
The Wikileaks drama is without a doubt the Internet incident of the year.
The events we have listed here above really are just a small sample of everything that has happened in 2010. Even without Wikileaks, it’s been a very eventful year on the Internet. That said, this is something we find ourselves saying every year. The truth is that the Internet is not quite as stable and solid as most of us would like to believe. It’s a complex system, like a living organism, and things do break from time to time. Sometimes it’s small-scale enough that nobody notices, and sometimes hundreds of millions of people are affected.