Wednesday, February 26, 2014

Chaos Gorilla High Availability Tests on IBM Cloud with NetflixOSS

At the IBM cloud conference this week, IBM Pulse, I presented on how we are using the combination of IBM Cloud (SoftLayer), IBM middleware, and the Netflix Open Source cloud platform technologies to operationalize IBM's own public cloud services.  I focused on high availability, automatic recovery, elastic and web scale, and continuous delivery devops.  At Dev@Pulse, I gave a very quick overview of the entire platform.  You can grab the charts on slideshare.

The coolest part of the talk was the fact that it included a live demo of Chaos Gorilla testing.  Chaos Gorilla is a type of chaos testing that emulates an entire datacenter (or availability zone) going down.  While our deployment of our test application (Acme Air) and runtime technologies was already setup to survive such a test, it was our first time doing such a test.  It was very interesting to see how our system reacted (the workload itself and the alerting/monitoring systems).  Knowing how this type of failure manifests will help us as we roll this platform and other IBM hosted cloud services into production.  The goal of doing such Chaos testing is to prove to yourself that you can survive failure before the failure occurs.  However, knowing how the system operates in this degraded state of capacity is truly valuable as well.

To be fair when compared to Netflix (who pioneered Chaos Gorilla and other more complicated testing), so far this was only killing all instances of the mid tier services of Acme Air.  It was not killing the data tier, which would have more interesting stateful implications.  With a proper partitioned and memory replicated data tier that includes failure recovery automation, I believe the data tier would also have survived such of an attack, but that is work remaining within each of our services today and will be the focus of follow-on testing.

Also, as noted in a quick review by Christos Kalantzis from Netflix, this was more targeted destruction.  The true Netflix Chaos Gorilla is automated such that it randomly decides what datacenter to kill.  Until we automate Chaos Gorilla testing, we had to pick a specific datacenter to kill.  The application and deployment approach demonstrated is architected in a way that should have worked for any datacenter going down.  Dallas 05 was chosen arbitrarily in targeted testing until we have more advanced automation.

Finally, we need to take this to the next level beyond basic automation.  Failing an entire datacenter is impressive, but it is mostly a "clean failure".  By clean I mean the datacenter availability goes from 100% available to 0% availability.  There are more interesting Netflix chaos testing like Split Brian Monkey and Latency Monkey that would present cases of availability that are worse than perfect systems but not as clean as gone (0%).  These are also places where we want to continue to test our systems going forward.  You can read more about the entire suite of Chaos testing on the Netflix techblog.

Take a look at the following video, which is a recording of the Chaos Gorilla testing.

Direct Link (HD Version)

Monday, February 10, 2014

Acme Air Cloud Native Code Details - Part 1

Previously, I blogged at a high level of what changes were required to get Acme Air ported to be cloud native using IBM middleware, the IBM cloud, and NetflixOSS technology.  I have gotten the question of how to see these code changes with more specifics.  In my next few blog posts I'll try to give you code pointers to help you copy the approach. Note that you could always just go to the two source trees and do a more detailed compare. The github project for the original is here. The github project for the NetflixOSS enabled version is here.

First, even before we started on NetflixOSS enabling Acme Air, we worked to show that it could be run at Web Scale. As compared to other desktop and mobile applications, there were two basic things we did to make this possible.

Stateless Design

We focused on making sure every request was independent from the server perspective.  That would allow us handle any request on any server.  As classic JEE developers, this was pretty easy except for two commonly stateful aspect of JEE, storing application specific transient state and authentication state stored in session objects.

As a performance guy, the first was pretty easy to avoid.  I've spent years advising clients on avoiding storing large application specific transient state in JEE sessions.  While app servers have spent considerable time optimizing memory to memory session state technologies, nothing can help performance when developers start overusing session in their applications.  App server memory to memory replication is problematic when you want to move to a web scale implementation as it is usually server specific vs. user specific and requires semi-static specific configuration across a small cluster.   Database backed replication is problematic at web scale as it requires a database that can web scale which wasn't the case when session replication was designed in most application servers.  Web 2.0 and AJAX has helped application specific transient state.  It is now rather straight forward to offload such state storage to the web browser or mobile application.  All you need to do is send to the client what they want the store and let the client send back what is pertinent to each specific request.  Of course, like server side session data, be careful to keep the size of such transient state to the bare minimum to keep request latency in check.  In the case of state that doesn't belong on the client, consider a distributed data cache keyed on the user with time to live expiration.  Unlike server side session data, be sure to validate all data sent back to the server from the client.

The second was a bit harder to avoid.  The state of authentication to the services being offered is certainly something we cannot trust the client to supply. We implemented an approach similar to what we've seen from other online web scale properties.  When a client logs into the application, we generate a token and send that back as a cookie to the client while storing this token in a web scale data tier designed for low latency - specifically WebSphere eXtreme Scale.  Later on all REST calls to that service we expect the client to send back the token as a cookie and before we call any business logic we check that the token is valid (that we created it and it hasn't timed out).  In JEE we implemented this check as a ServletFilter to ensure it was run before any business logic in the called REST service.  We implemented this with a security naive implementation, but recommended authentication protocols would follow a similar approach.

public class RESTCookieSessionFilter implements Filter {
  public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) throws IOException, ServletException {

    Cookie cookies[] = request.getCookies();
    Cookie sessionCookie = null;
    if (cookies != null) {
      for (Cookie c : cookies) {
        if (c.getName().equals(LoginREST.SESSIONID_COOKIE_NAME)) {
          sessionCookie = c;

      String sessionId = "";
      if (sessionCookie != null)
        sessionId = sessionCookie.getValue().trim();

      CustomerSession cs = customerService.validateSession(sessionId);
      if (cs != null) {
        request.setAttribute(LOGIN_USER, cs.getCustomerid());
        // session is valid, let the request continue to flow
        chain.doFilter(req, resp);
      else {

You can find this code in and see it added to the REST chain in web.xml.

As we'll see in follow-on articles the change to have a stateless service tier was important in enabling elastic scaling with auto recovery. Once you free your service from having to remember state, any server instance that exists (or even ones that didn't exist when the first request was handled) can handle the next request.

Partitioned and Scale Out Data Tier Design

In an application we used before Acme Air for working on web scale experiments, we learned how to design our data storage to be well partitioned.  The previous application worked against a monolithic relational database and we found that the definition of the data access services wasn't ready for a partitioned NoSQL storage engine.  We learned to fix this you need to more carefully consider your data model and what impacts that has on your data access service interfaces in your application.

The previous application had three aspects and we learned that it was easy to convert two of the tree aspects to singly keyed set of related data tables.  In these two aspects we further found cases where secondary indexes were used, but really were not needed.  Specifically we found places where we relied on a secondary index when the primary key was known but not passed down to the data service due to a data service tier that was designed assuming index scans and joins were possible and efficient when running on a single database system.  We were able to remedy this problem by redesigning our data access services to always require callers provide the primary key.  You can see this design reflected in Acme Air's data service interfaces.  Note that you will always see the partition key for each data type passed on all data service calls.

For the third aspect of the previous application, there was data that needed to be queried by two different keys (consider bookings that need to be queried by passenger id as well as flight instance id).  In partitioned data storage there are two ways to deal with this.  You can store the data once by one key and implement a user maintained index for the second key.  Alternatively you can duplicate the data (denormalizing based on which type of access is required).  This approach (as compared to a user maintained index) trades off storage with ensuring a guaranteed single network access for the data regardless of which key type is being queried.

In Acme Air currently, we have implemented carefully keyed partitioned data storage.  We haven't yet added operations that would key the same data differently.  If we added a view of the booking data per flight as required by gate agents (so far bookings are purely accessed by passenger), it would require we implement user maintained indexes or duplicate data.

Much like there was side benefits of doing stateless mid tier design, there is an elastic scaling and auto recovery benefit of using a purely partitioned and non-monolithic instanced data tier.  Once the data service holding your data is partitioned NoSQL technologies can easily replicate and move that data around within a data cluster as required when the instances fail or need to scale out.

In the next post, I'll move on from this foundation of stateless mid tier services and partitioned / scale out data tier to cover topics such as breaking the application into micro-services and containers that make highly available instance registration and cross instance routing easier.

Wednesday, January 8, 2014

Acme Air and Cassandra Part Two

As I mentioned in my last blog post, I had a multi-faceted project over the holidays to try out continuous delivery, Cassandra, and Docker in the context of Acme Air.  In this post, I wanted to focus on the Cassandra aspect.

First, why is this called part two if there is no part one?  There is a part one of this story, but I never blogged about it.  When Jonathan Bond and I were working on the Netflix Cloud Prize sample work, Jonathan implemented a Cassandra set of services for Acme Air using the NetflixOSS Astyanax Cassandra client and Astyanax's JPA like entity persister support.  You can see his cassandra-cli DDL script for the column families, some samples of how he did queries and what the entities looked like by reading through github.

I started by looking at what Jonathan did and decided to recode for the following reasons.  Jonathan was working under a time crunch for the cloud prize work and we had decided to keep the application as portable as possible across the tested data services.  In the past we have implemented back end implementations for WebSphere eXtreme Scale and MongoDB.  The reality is each different back end required specific additions to the entity model of the data returned to the web applications.  WebSphere eXtreme Scale requires the use of PartitionableKey interfaces on primary key classes to co-locate related data in the grid (Users and User Bookings for example).  Mongo required, when using the Spring Data MongoDB support, specific annotations to help the mapping and, when using Morphia, specific deserializers for things like BigDecimal.  Jonathan's implementation tried to come up with a common interface for entities and then implementations specific to each back end, but I think the entity code became the least common denominator vs. the best demonstration of how to use Astyanax/Cassandra.  Next Jonathan worked the problem top down starting with an entity model allowing the Astyanax persister support to create the columns dynamically.  It wasn't clear to us, especially since we never performance or scale tested the final code, if this top down approach functioned well under load.  Finally, I'm the kind of guy who wants to start with the most code possible and then move to higher level abstractions proving that they add value without sacrificing things like performance or maintainability of the tables in production.  In the relational world, I have seen problems come from pure top down mapping approaches of relational to Java objects that were solved with more JDBC like approaches or more sophisticated meet in the middle mappings.

This all said, I started to back up and look for a way to start with a more (in the relational world) JDBC/SQL like approach.  Given I was new to Cassandra, I got myself bootstrapped two ways.  First, I grabbed Nicolas Favre-Felix's Docker Cassandra automation (more on that in my last blog post).  Second, I went through the Datastax Java Developer online course in order to learn Cassandra and how to code to Cassandra in Java.  I highly recommend the course but be aware that it will lead you to quickly embrace Cassandra 2.0.x and CQL3 (vs. Cassandra 1.2.x and Thrift) which eventually caused me problems.  I didn't actually work through the sample application that accompanies the course, but instead planned to apply my learning to Acme Air.  The course took three days off and on and I ended up passing the final exam to get my certification (yeah!).

My education complete, I took the current Acme Air NetflixOSS enabled codebase and branched it to a "Astyanax" branch.  I coded up a static CQL3 DDL script and a bit of the data loader program and ran it.  I quickly ran into the dreaded "InvalidRequestException(why:Not enough bytes to read value of component 0)" error.  I found others confused by this on the Netflix Astyanax Google Group and found that I could add "COMPACT STORAGE" to my table definition and get limping along again.  Later Michael Oczkowski responded to the forum post with great links I should have found that explained why COMPACT STORAGE was needed and what issues I was likely up against using Astyanax (which is based on the Thrift client to Cassandra) against a data model created with the assumption of using the CQL3 protocol.

As I wanted to use Astyanax, I decided that I wanted to move back to Thrift and cassandra-cli created tables.  You can see the "new" Thrift based DDL in github that I created at that point.  You can see the Astyanax code I used to persist to these column families in older code on github.  Note that both of these older versions of code aren't complete or up to date with the final version of the DataStax Java Driver code as I gave up.  I was able to get some of the less interesting entities and queries working, but started to run into issues when I needed the equivalent of "composite keys" in CQL3.  An example was Flight which in CQL3 I define with a primary key of (text flight_segment_id, Date scheduled_departure_time) with a secondary index of text flight_id.  I believe either CQL3 is easier for SQL-historical guys like me to understand or there is less documentation for Thrift for these composite type of keys (or both).  I played around with various Thrift definitions looking at the resulting CQL3 view but could never replicate such a composite key.  I tried to simplify to a case where I was trying to do a primary key that was a partition key of text and then two clustering columns.  I could force it to be something like PRIMARY KEY((String, String) String) or PRIMARY KEY (String, (String, String)), but never PRIMARY KEY(String, String, String).  For what it's worth I gave up vs. pushing my way through total understanding here due to lack of clear documentation on these scenarios for Thrift.  I also later realized I was somewhat confused on how composite key'ed rows were stored.  I assumed that the partitioning and the clustering columns defined the hash of the node for storage.  After reading this article, I realized that my key/value + SQL model was wrong.  Also, I tried to define annotated composite keys in Astyanax as you can see commented in the latest code.  Finally, I really wanted to use the CQL3 syntactic sugar for collections.  I understand such collections are possible with coding approaches with Thrift, but the sugar seemed more palatable with CQL3.

All of the above was really more of a commentary on Thrift vs. CQL3.  However, I did have a few issues unique to Astyanax.

First, there is very little end to end documentation on Astyanax.  The wiki quotes both the Netflix RSS Recipe and Flux Capacitor.  Unfortunately both are very simple single column family examples of Cassandra with little complexity to the primary keys and data modeling.  If you look at the source code both are really simple keyed rows with at most 2 columns with add, query by primary key and remove.  Additionally for composite keys, there is a wiki example of how to query, but not how to store or update.  There is a slightly better example in the Astyanax serializer tests, but the table definition isn't very easy to understand as it was written as a test case vs. a sample application.  I'm guessing (but haven't verified) that the example code in the sample that goes along with the DataStax Java education course would be far more end to end as it would include more complex data modeling with composite keys and secondary indexes in already working code.

Next the Astyanax API is more to abstract the query language to something familiar to Java programmers.  I found, with my previous JDBC/SQL knowledge, the DataStax to be less to learn as I was already pretty familiar with the concepts of getSession, prepareStatement (against a known QL string), bindStatement, execute, and walk rows.  Being able to work more directly with the queries as Strings vs. build them through a Java API fit better with my experience.  I need to get back to Astyanax to see if this sort of pattern is supported (the entity persister has a query language that can be executed with find).

Finally, two parts of the Astyanax API left me with unanswered questions.  I couldn't see how to pass BigDecimal to ColumnListMutations which was easy to do with the DataStax Java Driver via bind(BigDecimal) and Row.getDecimal().  Also, it seemed like I needed to use MutationBatch to update (or initially create) multiple columns in the same row.  It wasn't clear to me if this would perform poorly (the DataStax online education mentioned batch and how it was slower) and if the simpler "INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?)" DataStax Java Driver was more efficient for this scenario.

The above is more of a state in time level of understanding of Thrift and Astyanax vs. CQL3 and the DataStax driver and Cassandra in general.  As mentioned in Netflix's blog, they are working on Astyanax over the DataStax Java Driver and CQL3.  I wonder how much of what I've been focusing on will be handled by this Astyanax update.  Also, as I start to performance test, scale, and operationalize the tables I created, it is very likely that I'll learn there are still some concepts in Cassandra I don't fully understand.  I certainly know I'm not yet using Cassandra enough in the dynamic column and sparse row aspects so my understanding of de-normalization and optimal Cassandra data modeling might not yet be complete.

Based on skills I gathered during the DataStax Java online education and due to some of the fore mentioned issues, I decided to switch horses to use CQL3 fully and the DataStax Java driver.  You can see in the latest github code the CQL3 DDL and the service implementation that makes calls through the DataStax driver.  I have the application fully functional and tested against a three node Cassandra cluster.  I have not yet scaled the data up nor performance or scale tested.  With my experience with Java/JDBC plus the DataStax Java education, I was able to code this in about two days or focused effort.

I think it would be good to have a part three of this adventure into Cassandra with Acme Air.  It would be good to take the new (or even old) Astyanax driver and complete a Thrift as well as CQL3 version based on this completed CQL3/DataStax Java Driver implementation.  I would additionally like to see usability and performance/scale comparisons.  I think the Netflix tech blog does a good job of showing primitive performance comparisons, but it would be interesting to see how the more complex data and query model of Acme Air compares.

As is likely obvious, I was new to Cassandra in this work.  Therefore comments, as always, are welcome.  I'm not sure when I'll get back to this work, but if I do, pointers on Thrift/Astyanax documentation would be welcomed.  Finally, if you have the time and knowledge of Astyanax, I would welcome a OSS port of the working code.

Tuesday, January 7, 2014

Experiments With Docker For Acme Air Dev

Over the holidays, I decided to do a side project to learn a few things:

1) Strategies and technology for continuous delivery
2) Cassandra and Astyanax
3) Docker for local laptop development

I did all of these projects together with the goal of having a locally developed and tested "cloud" version of Acme Air (NetflixOSS enabled) that upon code commit produced wars that could be put into a animator like build or chef scripts and immediately deployed into "production". I wanted to learn Cassandra for some time now and it was good to do in tandem to continuous delivery when using CloudBees for continuous build as the branch for Cassandra could use all openly available (OSS on maven central). For this blog post, I'll be focusing on Docker local laptop cloud development but I thought it would be good for you to understand that I did these all together.  If you're interested in picking up from this work, the CloudBees CI builds are here.

I decided to make the simplest configuration of Acme Air which would be a single front end web app tier (serving Web 2.0 requests to browsers/mobile) connecting to a single back end micro service for user session creation/validation/management (the "auth" service). Then I connected it to a pretty simple Cassandra ring of three nodes.

Here you can see the final overall configuration as run on my Macbook Pro:

This is all running on my Macbook Pro.  The laptop runs vagrant as a virtual machine.  Vagrant runs Docker containers for the actual cluster nodes.  In a testing setup, I usually have five Docker containers (three Cassandra nodes, one node to run the data loader part of Acme Air and ad hoc Cassandra cqlsh queries, one node to run the auth service micro service web app, and one node to run the front end web app).  Starting up this configuration takes about three minutes and about ten commands.  I could likely automate this all down to one command and would suggest anyone following this for their own development shops perform such automation.  If automated the startup time would be less than 30 seconds.

As a base Docker image, I went with the Nicolas Favre-Felix's Docker Cassandra automation.  He put together a pretty complete system to allow experimentation with Cassandra 1.2.x and 2.0.x.  In making this work, I think he created a pretty general purpose networking configuration for Docker that I'll explain.  Nicolas used pipework and dnsmasq to provide a Docker cluster with well known static hostnames and IP addresses.  When any Docker container is started, he forced it to have a hostname of "cassX" with X being between 1 and 254 (using the docker run -h option).  He did that so he could have the Cassandra ring always start at "cass1" and have all other nodes (and clients) known the "cass1" hostname is the first (seed) Cassandra node.  Then he used pipeworks to add an interface to each node with the IP address of 192.168.100.X.  In order to make these host names resolve to these hostnames across all nodes, he used dnsmasq with every 192.168.100.X address mapped to every cassX hostname.  Further, to make non-Docker hostnames resolveable dnsmasq was configured to resolve other hostnames from the well known Google nameservers and the container itself was configured to use the dnsmasq nameserver locally (using the docker -d option).  With all of this working it is easy to start any container on any IP address/hostname with all other containers being able to address the hostname statically.

I'd like to eventually generalize this setup with hostnames of "host1", "host2", etc. vs. "cass1", "cass2".  In fact, I already extended Nicolas' images for my application server instances knowing that I'd always start the auto service on "cass252" and the web app front end on "cass251".  That meant when the front end web app connected to the auth service, I hardcoded Ribbon REST calls to http://cass252/rest/api/... and I knew it would resolve that to  Eventually I'd like to startup a Eureka Docker container on a well known hostname which would allow me to remove the REST call hardcoding (I'd just hardcode for this environment the Eureka configuration).  Further, I can image a pretty simple configuration driven wrapper to such a setup that said startup n node types on hosts one through ten, m node types on hosts eleven through twenty, etc.  This would allow me to have full scale out testing in my local development/testing.

This networking setup gave me a "VLAN" accessible between all nodes in the cluster of 192.168.100.X that was only accessible inside of the Vagrant virtual machine.  To complete the networking and allow testing, I needed to expose the front end web app to my laptop browser.  I did this by using port mapping in Docker and host only networking in Vagrant.  To get the port exposed from the Docker container to Vagrant I used the docker run -p option:
docker run -p 8080:8080
At this point I could curl http://localhost:8080 on the Vagrant virtual machine.  To get a browser to work from my laptop's desktop (OS X), I needed to add a "host only" network to the Vagrantfile configuration:
Vagrant.configure("2") do |config| "private_network", ip: ""
At that point, I could load the full web application by browsing locally to  I could also map other ports if I wanted (the auto service micro service, a JDWP debug port, etc.).  Being able to map the Java remote debug port to a "local cloud" with local latency is game changer.  Even with really good connectivity "remote cloud" debugging is still a challenge due to latency.

So, returning to the diagram above, the summary of networking is (see blue and grey lines in diagram starting at bottom right):

1) Browser to which forwards to
2) Vagrant port 8080 which forwards to
3) VLAN which forwards to
4) Docker image for the front end web app listening on port 8080 which connects to
5) VLAN which is
6) Docker image for the auth service listening on port 8080

Both #4 and #6 connect to which is

7) Cassandra Dockers nodes listening on ports 9042.

I also wanted to have easier access to my Macbook Pro filesystem from all Docker containers.  Specifically I wanted to be able to work with the code on my desktop with Eclipse and compile within an OS X console, but then I wanted to be able to easily access the wars, Cassandra loader Java code and Cassandra DDL scripts.  I was able to "forward" the filesystem through each of the layers as follows (see black lines in diagram starting at middle bottom):

When starting Vagrant, I used:
Vagrant.configure("2") do |config|
  config.vm.synced_folder "/Users/aspyker/work/", "/vagrant/work"
When starting each Docker container, I used:
docker run ... -v /vagrant/work:/root/work ...
Once configured, any Docker instance has a shared view of my laptop filesystem at /root/work.  Being able to "copy" war files to Docker containers was instantaneous (30 meg file copies were done immediately).  Also, any change on my local system is immediately reflected in every container.  Again, this is game changing as compared to working with a remote cloud.  Remote cloud file copies are limited by bandwidth and most virtual instances do not allow shared filesystems so copies need to be done many times for the same files (or rsync'ed).

In the end, this left me with a very close to "production" cloud environment locally on my laptop with local latency for debugging and file manipulation regardless of by networking speed (think coffee shop and/or airplane hacking).  With a bit more automation, I could extend this environment to every team member around me ensuring a common development that was source controlled under git.

I have seen a few blogs that mention setting up "local cloud" development environments like this for their public cloud services.  I hope this blog post showed some tricks to make this possible.  I am now considering how to take this Acme Air sample and extend it to IBM public cloud services we are working on at IBM.  I expect by doing this development cloud spend will decrease and development will speed up as it will allow developers to work locally with faster turn around on code/compile/test cycles.  I do wonder how easy the transition will be to integration tests in the public cloud given the Docker environment won't be exactly the same as the public cloud, but I think the environments are likely close enough that the trade-off and risk is justified.

I need to find a way to share these Docker images and Dockerfile scripts.  Until then, if you have questions feel free to ask.

Tuesday, December 17, 2013

Zookeeper as a cloud native service registry

I have been working with IBM teams to deploy IBM public cloud services using large parts of the Netflix OSS cloud platform.  I realize that there are multiple ways to skin every cat and there are parts of the Netflix approach that aren’t well accepted across the majority of cloud native architectures within the industry (as cloud native is still an architecture being re-invented and improved across the industry).   Due to this current state of the world, I get asked questions along the lines of “why should we use Netflix technology or approach foobar vs. some other well adopted cloud approach baz that I saw at a meetup”.  Sometimes the answer is easy.  For example, things that Netflix might due for web scale reasons that aren’t initially sensible for smaller than web scale.  Sometimes the answers are far harder and the answer likely lies within the experience Netflix has gained that isn’t evident until the service not adopting the technology gets hit with a service disruption that makes the reasoning clear.   My services haven’t yet been through near this battle hardening.  Therefore, I admit that personally I sometimes lack the full insight required to fully answer the questions.

One “foobar” that has bothered me for some time was the use of Eureka for service registration and discovery.  The equivalent “baz” has been zookeeper.  You can read Netflix’s explanation of why they use Eureka for a service registry (vs. zookeeper) on their wiki and more discussion on their mailing list.  The explanations seem reasonable, but it kept bothering me as I see “baz” approaches left and right.  Some examples are Parse’s recent re:invent 2013 session and Airbnb’s SmartStack.  SmartStack has shown more clearly how they are using zookeeper by releasing their Nerve and Synapse in open source, but Charity Majors was pretty clear architecturally in her talk.  Unfortunately both seem to lack clarity in how zookeeper handles cloud native high availability (how it handles known cloud failures and how zookeeper is deployed and bootstrapped).   For what it’s worth, Airbnb already states publically “The achilles heel of SmartStack is zookeeper”.  I do think there are non-zookeeper aspects of SmartStack that are very interesting (more on that later).

To try to see if I understood the issues I wanted to try a quick experiment around high-availability.  The goal was to understand what happens to clients during a zookeeper node failure or network partition event between availability zones.

I decided to download the docker container from Scott Clasen that has zookeeper and Netflix’s exhibitor already installed and ready to play with.  Docker is useful as it allows you to stich together things so quickly locally without the need for full-blown connectivity to the cloud.  I was able to hack this together running three containers (across a zookeeper ensemble) by using a mix of docker hostnames and but I wasn’t comfortable that this was a fair test given the DNS wasn’t “standard”.   Exhibitor also allowed me to get moving quickly with zookeeper given it’s web based UI.  Once I saw the issue with docker and not being able to edit /etc/hosts, it made me even less trusting if I would be able to simulate network partitions (which I planned to use Jepsen’s iptables DROP like approach).

Instead I decided to move the files Scott setup on the docker image over to cloud instances (hello, where is docker import support on clouds?!).  I started up one zookeeper/exhibitor per availability zone within a region.  Pretty quickly I was able to define (with real hostnames) and run a three node zookeeper ensemble across the availability zones.  I could then do the typical service registration (using using ephemeral nodes:

create –e /hostnames/{hostname or other id} {ipaddress or other service metadata}

Under normal circumstances, I saw the new hosts across availability zones immediately.  Also, after a client goes away (exit from, the other nodes would see the host disappear.  This is the foundation for service registration and ephemeral de-registration.  Using zookeeper for this ephemeral registration and automatic de-registration and watches by clients is rather straightforward.  However, this was under normal circumstances.

I then used exhibitor to stop one of the three instances.  For clients connected to the remaining two instances, I could still interact with the zookeeper ensemble.  This is expected as zookeeper keeps working as long as a majority of the nodes are still in service.  I would also expect instances connected to the failed instance to keep working either through a more advanced client connectivity configuration than I did with or through a discovery client’s caching of last known results.  Also, in a realistic configuration, likely you’d run more than once instance of zookeeper per availability zone so a single failing node isn’t all that interesting.   I didn’t spend too much time on this part of the experiment, as I wanted to instead focus on network partition events.

For network partition events, I decided to follow a Jepsen like approach for simulating network partitions.  I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed that the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away” dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

You can see this happening “live” via the following animated GIF.  Note the colors of the nodes in the exhibitor UI’s as I separate the left node from the middle and right node:

To me this seems like a concern, as network partitions should be considered an event that should be survived.  In this case (with this specific configuration of zookeeper) no new clients in that availability zone would be able to register themselves with consumers within the same availability zone.  Adding more zookeeper instances to the ensemble wouldn’t help considering a balanced deployment as in this case the availability would always be majority (66%) and non-majority (33%).

It is worth noting that if your application cannot survive a network partition event itself, then you likely aren't going to worry so much about your service registry and what I have shown here.  That said, applications that can handle network partitions well are likely either completely stateless or have relaxed eventual consistency in their data tier.

I didn’t simulate the worst-case scenario, which would be to separate all three availability zones from each other.  In that case, all availability of zookeeper would go to zero, as each zone would represent a minority (33% for each zone).

One other thing I didn’t cover here is the bootstrapping of service registry location (ensuring that the service registry location process is highly available as well).  With Eureka, there is a pretty cool trick that only depends on DNS (and DNS text records) on any client.  With Eureka the actual location of Eureka servers is not burned into client images.  I’m not sure how to approximate this with zookeeper currently (or if there are client libraries for zookeeper that keep this from being an issue), but you could certainly follow a similar trick.

I believe most of this comes down to the fact that zookeeper isn’t really designed to be a distributed cloud native service registry.  It is true that zookeeper can make implementing service registration, ephemeral de-registration, and client notifications easy, but I think its focus on consistency makes it a poor choice for a service registry in a cloud environment with an application that is designed to handle network partitions.  That’s not to say there are not use cases where zookeeper is far better than Eureka.  As Netflix has said, for cases that have high consistency requirements (leader election, ordered updates, and distributed synchronization) zookeeper is a better choice.

I am still learning here myself, so I’d welcome comments from Parse and Airbnb on if I missed something important about their use of zookeeper for service registration.  For now, my experiment shows at the very least, that I wouldn’t want to use zookeeper for service registration without a far more advanced approach to handling network partitions.