Wednesday, February 26, 2014

Chaos Gorilla High Availability Tests on IBM Cloud with NetflixOSS

At the IBM cloud conference this week, IBM Pulse, I presented on how we are using the combination of IBM Cloud (SoftLayer), IBM middleware, and the Netflix Open Source cloud platform technologies to operationalize IBM's own public cloud services.  I focused on high availability, automatic recovery, elastic and web scale, and continuous delivery devops.  At Dev@Pulse, I gave a very quick overview of the entire platform.  You can grab the charts on slideshare.

The coolest part of the talk was the fact that it included a live demo of Chaos Gorilla testing.  Chaos Gorilla is a type of chaos testing that emulates an entire datacenter (or availability zone) going down.  While our deployment of our test application (Acme Air) and runtime technologies was already setup to survive such a test, it was our first time doing such a test.  It was very interesting to see how our system reacted (the workload itself and the alerting/monitoring systems).  Knowing how this type of failure manifests will help us as we roll this platform and other IBM hosted cloud services into production.  The goal of doing such Chaos testing is to prove to yourself that you can survive failure before the failure occurs.  However, knowing how the system operates in this degraded state of capacity is truly valuable as well.

To be fair when compared to Netflix (who pioneered Chaos Gorilla and other more complicated testing), so far this was only killing all instances of the mid tier services of Acme Air.  It was not killing the data tier, which would have more interesting stateful implications.  With a proper partitioned and memory replicated data tier that includes failure recovery automation, I believe the data tier would also have survived such of an attack, but that is work remaining within each of our services today and will be the focus of follow-on testing.

Also, as noted in a quick review by Christos Kalantzis from Netflix, this was more targeted destruction.  The true Netflix Chaos Gorilla is automated such that it randomly decides what datacenter to kill.  Until we automate Chaos Gorilla testing, we had to pick a specific datacenter to kill.  The application and deployment approach demonstrated is architected in a way that should have worked for any datacenter going down.  Dallas 05 was chosen arbitrarily in targeted testing until we have more advanced automation.

Finally, we need to take this to the next level beyond basic automation.  Failing an entire datacenter is impressive, but it is mostly a "clean failure".  By clean I mean the datacenter availability goes from 100% available to 0% availability.  There are more interesting Netflix chaos testing like Split Brian Monkey and Latency Monkey that would present cases of availability that are worse than perfect systems but not as clean as gone (0%).  These are also places where we want to continue to test our systems going forward.  You can read more about the entire suite of Chaos testing on the Netflix techblog.

Take a look at the following video, which is a recording of the Chaos Gorilla testing.

Direct Link (HD Version)

Monday, February 10, 2014

Acme Air Cloud Native Code Details - Part 1

Previously, I blogged at a high level of what changes were required to get Acme Air ported to be cloud native using IBM middleware, the IBM cloud, and NetflixOSS technology.  I have gotten the question of how to see these code changes with more specifics.  In my next few blog posts I'll try to give you code pointers to help you copy the approach. Note that you could always just go to the two source trees and do a more detailed compare. The github project for the original is here. The github project for the NetflixOSS enabled version is here.

First, even before we started on NetflixOSS enabling Acme Air, we worked to show that it could be run at Web Scale. As compared to other desktop and mobile applications, there were two basic things we did to make this possible.

Stateless Design

We focused on making sure every request was independent from the server perspective.  That would allow us handle any request on any server.  As classic JEE developers, this was pretty easy except for two commonly stateful aspect of JEE, storing application specific transient state and authentication state stored in session objects.

As a performance guy, the first was pretty easy to avoid.  I've spent years advising clients on avoiding storing large application specific transient state in JEE sessions.  While app servers have spent considerable time optimizing memory to memory session state technologies, nothing can help performance when developers start overusing session in their applications.  App server memory to memory replication is problematic when you want to move to a web scale implementation as it is usually server specific vs. user specific and requires semi-static specific configuration across a small cluster.   Database backed replication is problematic at web scale as it requires a database that can web scale which wasn't the case when session replication was designed in most application servers.  Web 2.0 and AJAX has helped application specific transient state.  It is now rather straight forward to offload such state storage to the web browser or mobile application.  All you need to do is send to the client what they want the store and let the client send back what is pertinent to each specific request.  Of course, like server side session data, be careful to keep the size of such transient state to the bare minimum to keep request latency in check.  In the case of state that doesn't belong on the client, consider a distributed data cache keyed on the user with time to live expiration.  Unlike server side session data, be sure to validate all data sent back to the server from the client.

The second was a bit harder to avoid.  The state of authentication to the services being offered is certainly something we cannot trust the client to supply. We implemented an approach similar to what we've seen from other online web scale properties.  When a client logs into the application, we generate a token and send that back as a cookie to the client while storing this token in a web scale data tier designed for low latency - specifically WebSphere eXtreme Scale.  Later on all REST calls to that service we expect the client to send back the token as a cookie and before we call any business logic we check that the token is valid (that we created it and it hasn't timed out).  In JEE we implemented this check as a ServletFilter to ensure it was run before any business logic in the called REST service.  We implemented this with a security naive implementation, but recommended authentication protocols would follow a similar approach.

public class RESTCookieSessionFilter implements Filter {
  public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) throws IOException, ServletException {

    Cookie cookies[] = request.getCookies();
    Cookie sessionCookie = null;
    if (cookies != null) {
      for (Cookie c : cookies) {
        if (c.getName().equals(LoginREST.SESSIONID_COOKIE_NAME)) {
          sessionCookie = c;

      String sessionId = "";
      if (sessionCookie != null)
        sessionId = sessionCookie.getValue().trim();

      CustomerSession cs = customerService.validateSession(sessionId);
      if (cs != null) {
        request.setAttribute(LOGIN_USER, cs.getCustomerid());
        // session is valid, let the request continue to flow
        chain.doFilter(req, resp);
      else {

You can find this code in and see it added to the REST chain in web.xml.

As we'll see in follow-on articles the change to have a stateless service tier was important in enabling elastic scaling with auto recovery. Once you free your service from having to remember state, any server instance that exists (or even ones that didn't exist when the first request was handled) can handle the next request.

Partitioned and Scale Out Data Tier Design

In an application we used before Acme Air for working on web scale experiments, we learned how to design our data storage to be well partitioned.  The previous application worked against a monolithic relational database and we found that the definition of the data access services wasn't ready for a partitioned NoSQL storage engine.  We learned to fix this you need to more carefully consider your data model and what impacts that has on your data access service interfaces in your application.

The previous application had three aspects and we learned that it was easy to convert two of the tree aspects to singly keyed set of related data tables.  In these two aspects we further found cases where secondary indexes were used, but really were not needed.  Specifically we found places where we relied on a secondary index when the primary key was known but not passed down to the data service due to a data service tier that was designed assuming index scans and joins were possible and efficient when running on a single database system.  We were able to remedy this problem by redesigning our data access services to always require callers provide the primary key.  You can see this design reflected in Acme Air's data service interfaces.  Note that you will always see the partition key for each data type passed on all data service calls.

For the third aspect of the previous application, there was data that needed to be queried by two different keys (consider bookings that need to be queried by passenger id as well as flight instance id).  In partitioned data storage there are two ways to deal with this.  You can store the data once by one key and implement a user maintained index for the second key.  Alternatively you can duplicate the data (denormalizing based on which type of access is required).  This approach (as compared to a user maintained index) trades off storage with ensuring a guaranteed single network access for the data regardless of which key type is being queried.

In Acme Air currently, we have implemented carefully keyed partitioned data storage.  We haven't yet added operations that would key the same data differently.  If we added a view of the booking data per flight as required by gate agents (so far bookings are purely accessed by passenger), it would require we implement user maintained indexes or duplicate data.

Much like there was side benefits of doing stateless mid tier design, there is an elastic scaling and auto recovery benefit of using a purely partitioned and non-monolithic instanced data tier.  Once the data service holding your data is partitioned NoSQL technologies can easily replicate and move that data around within a data cluster as required when the instances fail or need to scale out.

In the next post, I'll move on from this foundation of stateless mid tier services and partitioned / scale out data tier to cover topics such as breaking the application into micro-services and containers that make highly available instance registration and cross instance routing easier.