Wednesday, February 26, 2014

Chaos Gorilla High Availability Tests on IBM Cloud with NetflixOSS

At the IBM cloud conference this week, IBM Pulse, I presented on how we are using the combination of IBM Cloud (SoftLayer), IBM middleware, and the Netflix Open Source cloud platform technologies to operationalize IBM's own public cloud services.  I focused on high availability, automatic recovery, elastic and web scale, and continuous delivery devops.  At Dev@Pulse, I gave a very quick overview of the entire platform.  You can grab the charts on slideshare.

The coolest part of the talk was the fact that it included a live demo of Chaos Gorilla testing.  Chaos Gorilla is a type of chaos testing that emulates an entire datacenter (or availability zone) going down.  While our deployment of our test application (Acme Air) and runtime technologies was already setup to survive such a test, it was our first time doing such a test.  It was very interesting to see how our system reacted (the workload itself and the alerting/monitoring systems).  Knowing how this type of failure manifests will help us as we roll this platform and other IBM hosted cloud services into production.  The goal of doing such Chaos testing is to prove to yourself that you can survive failure before the failure occurs.  However, knowing how the system operates in this degraded state of capacity is truly valuable as well.

To be fair when compared to Netflix (who pioneered Chaos Gorilla and other more complicated testing), so far this was only killing all instances of the mid tier services of Acme Air.  It was not killing the data tier, which would have more interesting stateful implications.  With a proper partitioned and memory replicated data tier that includes failure recovery automation, I believe the data tier would also have survived such of an attack, but that is work remaining within each of our services today and will be the focus of follow-on testing.

Also, as noted in a quick review by Christos Kalantzis from Netflix, this was more targeted destruction.  The true Netflix Chaos Gorilla is automated such that it randomly decides what datacenter to kill.  Until we automate Chaos Gorilla testing, we had to pick a specific datacenter to kill.  The application and deployment approach demonstrated is architected in a way that should have worked for any datacenter going down.  Dallas 05 was chosen arbitrarily in targeted testing until we have more advanced automation.

Finally, we need to take this to the next level beyond basic automation.  Failing an entire datacenter is impressive, but it is mostly a "clean failure".  By clean I mean the datacenter availability goes from 100% available to 0% availability.  There are more interesting Netflix chaos testing like Split Brian Monkey and Latency Monkey that would present cases of availability that are worse than perfect systems but not as clean as gone (0%).  These are also places where we want to continue to test our systems going forward.  You can read more about the entire suite of Chaos testing on the Netflix techblog.

Take a look at the following video, which is a recording of the Chaos Gorilla testing.

Direct Link (HD Version)