
Thoughts on the cloud outage at Amazon – Part II
Hello everyone
Failures have an unpleasant characteristic, they do not follow those who try to prevent them. Rather, they obey Murphy’s law: “If anything can go wrong it will.”
So how can you protect yourself against the annoyance of an outage?
- Through service level agreements? They don’t prevent the outage, but at least you feel better if the worst comes to the worst.
- By investing in maximum availability? This (hopefully) reduces the likelihood of failure, but also the complexity and costs.
- By resigning to the inevitable? Yes, why not?
If mistakes happen anyway, you can also protect yourself by taking away their horror. By making it part of your culture. I love this approach because it’s deeply pragmatic and puts the focus on the result instead of the number of 9s in a statistic.

That’s exactly what the successful American film distributor Netflix did and invented the “Chaos Monkey“. The Chaos Monkey is a process that randomly terminates parts of the Netflix software. This process runs constantly in the productive system!
Why would anyone do something like that in a production environment? Because the mistakes happen anyway without the Chaos Monkey (just less often). The Chaos Monkey helps developers deal with errors in a meaningful way, not to make wrong assumptions and to always deliver a meaningful result, no matter what goes wrong. Because no one thinks “this won’t go wrong” anymore, rather the developer has the certainty that the Chaos Monkey will find its code. At the same time, all automatic processes that correct errors, restart processes, etc. are continuously tested.
In this way, the Chaos Monkey improves both the software quality and the robustness of the system. And that helped Netflix survive the Amazon outage unscathed, even though many of their servers were affected.

This reminds me a bit of the judo principle of “winning by giving in”. Instead of bracing your processes like fortresses against the failure, you let the failure happen to you and are quickly ready for action again, and deal with it in the best possible way during this time. Because it is not the failure statistics that decide at the end of the day, but the overall result!
Other interesting articles:
IT remains exciting,
Christian Eich