Thoughts on the Cloud Outage at Amazon – Part I

9. May 2011

Hello everyone

What do we learn from the outage of the Amazon Cloud at Easter?

On the Easter weekend of 2011, there was a major outage in one of Amazon’s data centers, which provides virtual computers as an “elastic cloud” (EC2) (http://aws.amazon.com/message/65648/). Many servers were unavailable for more than 24 hours. In a very small proportion of them, data was also irretrievably lost (http://www.heise.de/newsticker/meldung/Wolkenbruch-bei-Amazon-Datenverlust-in-der-Cloud-1234444.html).

The technical causes of this outage have now been investigated and Amazon will certainly take action. Enough has already been written about this. In short, the failure of network infrastructure disrupted the connection between the data stores, causing them to copy huge amounts of data to failure systems, further crippling the network. In addition, the amount of data was probably too large for the available reserve systems.

In the discussion, however, I miss a more general view of the consequences for the IT landscape and what the individual entrepreneur can do to protect himself.

Cloud data centers are highly complex and not infallible

The published forensics of the outage provide a picture of the highly complex redundancies and automatisms with which Amazon has equipped its cloud services. An error led to automatic corrective actions that were intended for the failure of individual systems, but resulted in a senseless flood of data transfers when larger structures failed.

A human operator would never have started such a mass of replication jobs at the same time!

Complexity and availability are difficult to reconcile

At WorNet, the “KISS principle” (Keep it simple and stupid) applies, according to which everything is to be designed in such a way that the employee on duty, who is called out of bed at 2 a.m., can safely carry out troubleshooting even under lack of sleep and stress.

Only trivial things are allowed to happen automatically, and it must be possible to switch off this automatism safely. This means more frequent small failures, which an automatism would probably have absorbed. But the big outages are rarer and last less long, because an automatic run amok does not make the situation even worse.
For example, a problem at the DECIX in Frankfurt had an impact on one of our routers in Munich last year. The operator could simply deactivate this router by cutting off the power supply and browse the log files in peace. The operation stabilized immediately and the number of affected systems was kept very small. Only after the incident was understood was the system put back into operation.

Thanks to the “KISS principle”, we have never had to cope with more than 4 hours of downtime .

The second part is about the importance of individual data centers and how customers can protect themselves against failures.

IT remains exciting,

Yours, Christian Eich

SecuMail-Blog

Legal information

Cookies