Tuesday, December 17, 2013

Zookeeper as a cloud native service registry

I have been working with IBM teams to deploy IBM public cloud services using large parts of the Netflix OSS cloud platform.  I realize that there are multiple ways to skin every cat and there are parts of the Netflix approach that aren’t well accepted across the majority of cloud native architectures within the industry (as cloud native is still an architecture being re-invented and improved across the industry).   Due to this current state of the world, I get asked questions along the lines of “why should we use Netflix technology or approach foobar vs. some other well adopted cloud approach baz that I saw at a meetup”.  Sometimes the answer is easy.  For example, things that Netflix might due for web scale reasons that aren’t initially sensible for smaller than web scale.  Sometimes the answers are far harder and the answer likely lies within the experience Netflix has gained that isn’t evident until the service not adopting the technology gets hit with a service disruption that makes the reasoning clear.   My services haven’t yet been through near this battle hardening.  Therefore, I admit that personally I sometimes lack the full insight required to fully answer the questions.

One “foobar” that has bothered me for some time was the use of Eureka for service registration and discovery.  The equivalent “baz” has been zookeeper.  You can read Netflix’s explanation of why they use Eureka for a service registry (vs. zookeeper) on their wiki and more discussion on their mailing list.  The explanations seem reasonable, but it kept bothering me as I see “baz” approaches left and right.  Some examples are Parse’s recent re:invent 2013 session and Airbnb’s SmartStack.  SmartStack has shown more clearly how they are using zookeeper by releasing their Nerve and Synapse in open source, but Charity Majors was pretty clear architecturally in her talk.  Unfortunately both seem to lack clarity in how zookeeper handles cloud native high availability (how it handles known cloud failures and how zookeeper is deployed and bootstrapped).   For what it’s worth, Airbnb already states publically “The achilles heel of SmartStack is zookeeper”.  I do think there are non-zookeeper aspects of SmartStack that are very interesting (more on that later).

To try to see if I understood the issues I wanted to try a quick experiment around high-availability.  The goal was to understand what happens to clients during a zookeeper node failure or network partition event between availability zones.

I decided to download the docker container from Scott Clasen that has zookeeper and Netflix’s exhibitor already installed and ready to play with.  Docker is useful as it allows you to stich together things so quickly locally without the need for full-blown connectivity to the cloud.  I was able to hack this together running three containers (across a zookeeper ensemble) by using a mix of docker hostnames and xip.io but I wasn’t comfortable that this was a fair test given the DNS wasn’t “standard”.   Exhibitor also allowed me to get moving quickly with zookeeper given it’s web based UI.  Once I saw the issue with docker and not being able to edit /etc/hosts, it made me even less trusting if I would be able to simulate network partitions (which I planned to use Jepsen’s iptables DROP like approach).

Instead I decided to move the files Scott setup on the docker image over to cloud instances (hello, where is docker import support on clouds?!).  I started up one zookeeper/exhibitor per availability zone within a region.  Pretty quickly I was able to define (with real hostnames) and run a three node zookeeper ensemble across the availability zones.  I could then do the typical service registration (using zkCli.sh) using ephemeral nodes:

create –e /hostnames/{hostname or other id} {ipaddress or other service metadata}

Under normal circumstances, I saw the new hosts across availability zones immediately.  Also, after a client goes away (exit from zkCli.sh), the other nodes would see the host disappear.  This is the foundation for service registration and ephemeral de-registration.  Using zookeeper for this ephemeral registration and automatic de-registration and watches by clients is rather straightforward.  However, this was under normal circumstances.

I then used exhibitor to stop one of the three instances.  For clients connected to the remaining two instances, I could still interact with the zookeeper ensemble.  This is expected as zookeeper keeps working as long as a majority of the nodes are still in service.  I would also expect instances connected to the failed instance to keep working either through a more advanced client connectivity configuration than I did with zkCli.sh or through a discovery client’s caching of last known results.  Also, in a realistic configuration, likely you’d run more than once instance of zookeeper per availability zone so a single failing node isn’t all that interesting.   I didn’t spend too much time on this part of the experiment, as I wanted to instead focus on network partition events.

For network partition events, I decided to follow a Jepsen like approach for simulating network partitions.  I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed that the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away” dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

You can see this happening “live” via the following animated GIF.  Note the colors of the nodes in the exhibitor UI’s as I separate the left node from the middle and right node:


To me this seems like a concern, as network partitions should be considered an event that should be survived.  In this case (with this specific configuration of zookeeper) no new clients in that availability zone would be able to register themselves with consumers within the same availability zone.  Adding more zookeeper instances to the ensemble wouldn’t help considering a balanced deployment as in this case the availability would always be majority (66%) and non-majority (33%).

It is worth noting that if your application cannot survive a network partition event itself, then you likely aren't going to worry so much about your service registry and what I have shown here.  That said, applications that can handle network partitions well are likely either completely stateless or have relaxed eventual consistency in their data tier.

I didn’t simulate the worst-case scenario, which would be to separate all three availability zones from each other.  In that case, all availability of zookeeper would go to zero, as each zone would represent a minority (33% for each zone).

One other thing I didn’t cover here is the bootstrapping of service registry location (ensuring that the service registry location process is highly available as well).  With Eureka, there is a pretty cool trick that only depends on DNS (and DNS text records) on any client.  With Eureka the actual location of Eureka servers is not burned into client images.  I’m not sure how to approximate this with zookeeper currently (or if there are client libraries for zookeeper that keep this from being an issue), but you could certainly follow a similar trick.

I believe most of this comes down to the fact that zookeeper isn’t really designed to be a distributed cloud native service registry.  It is true that zookeeper can make implementing service registration, ephemeral de-registration, and client notifications easy, but I think its focus on consistency makes it a poor choice for a service registry in a cloud environment with an application that is designed to handle network partitions.  That’s not to say there are not use cases where zookeeper is far better than Eureka.  As Netflix has said, for cases that have high consistency requirements (leader election, ordered updates, and distributed synchronization) zookeeper is a better choice.


I am still learning here myself, so I’d welcome comments from Parse and Airbnb on if I missed something important about their use of zookeeper for service registration.  For now, my experiment shows at the very least, that I wouldn’t want to use zookeeper for service registration without a far more advanced approach to handling network partitions.