I have been working with IBM teams to deploy IBM public
cloud services using large parts of the Netflix OSS cloud platform. I realize that there are multiple ways to
skin every cat and there are parts of the Netflix approach that aren’t well
accepted across the majority of cloud native architectures within the industry
(as cloud native is still an architecture being re-invented and improved across
the industry). Due to this current
state of the world, I get asked questions along the lines of “why should we use
Netflix technology or approach foobar vs. some other well adopted cloud
approach baz that I saw at a meetup”.
Sometimes the answer is easy. For
example, things that Netflix might due for web scale reasons that aren’t
initially sensible for smaller than web scale.
Sometimes the answers are far harder and the answer likely lies within
the experience Netflix has gained that isn’t evident until the service not
adopting the technology gets hit with a service disruption that makes the
reasoning clear. My services haven’t yet been through near this
battle hardening. Therefore, I admit
that personally I sometimes lack the full insight required to fully answer the
questions.
One “foobar” that has bothered me for some time was the use
of Eureka for service registration and discovery. The equivalent “baz” has been zookeeper. You can read Netflix’s explanation of why
they use Eureka for a service registry (vs. zookeeper) on their wiki and
more discussion on their mailing list.
The explanations seem reasonable, but it kept bothering me as I see
“baz” approaches left and right. Some
examples are Parse’s recent re:invent 2013 session and Airbnb’s SmartStack. SmartStack has shown more clearly
how they are using zookeeper by releasing their Nerve and Synapse in open
source, but Charity Majors was pretty clear architecturally in her talk. Unfortunately both seem to lack clarity in
how zookeeper handles cloud native high availability (how it handles known
cloud failures and how zookeeper is deployed and bootstrapped). For what it’s worth, Airbnb already states
publically “The achilles heel of SmartStack is zookeeper”. I do think there are non-zookeeper aspects of
SmartStack that are very interesting (more on that later).
To try to see if I understood the issues I wanted to try a
quick experiment around high-availability.
The goal was to understand what happens to clients during a zookeeper
node failure or network partition event between availability zones.
I decided to download the docker container from Scott Clasen that has zookeeper and Netflix’s exhibitor already installed and ready to
play with. Docker is useful as it allows
you to stich together things so quickly locally without the need for full-blown
connectivity to the cloud. I was able to
hack this together running three containers (across a zookeeper ensemble) by
using a mix of docker hostnames and xip.io but I wasn’t comfortable that this
was a fair test given the DNS wasn’t “standard”. Exhibitor also allowed me to get moving
quickly with zookeeper given it’s web based UI.
Once I saw the issue with docker and not being able to edit /etc/hosts, it made me even less trusting if I would be able to simulate network
partitions (which I planned to use Jepsen’s iptables DROP like approach).
Instead I decided to move the files Scott setup on the
docker image over to cloud instances (hello, where is docker import support on
clouds?!). I started up one zookeeper/exhibitor
per availability zone within a region.
Pretty quickly I was able to define (with real hostnames) and run a
three node zookeeper ensemble across the availability zones. I could then do the typical service
registration (using zkCli.sh) using ephemeral nodes:
create –e /hostnames/{hostname or other id} {ipaddress or other service metadata}
Under normal circumstances, I saw the new hosts across
availability zones immediately. Also,
after a client goes away (exit from zkCli.sh), the other nodes would see the
host disappear. This is the foundation
for service registration and ephemeral de-registration. Using zookeeper for this ephemeral
registration and automatic de-registration and watches by clients is rather
straightforward. However, this was under
normal circumstances.
I then used exhibitor to stop one of the three
instances. For clients connected to the
remaining two instances, I could still interact with the zookeeper ensemble. This is expected as zookeeper keeps working
as long as a majority of the nodes are still in service. I would also expect instances connected to
the failed instance to keep working either through a more advanced client
connectivity configuration than I did with zkCli.sh or through a discovery
client’s caching of last known results.
Also, in a realistic configuration, likely you’d run more than once
instance of zookeeper per availability zone so a single failing node isn’t all
that interesting. I didn’t spend too
much time on this part of the experiment, as I wanted to instead focus on network
partition events.
For network partition events, I decided to follow a Jepsen like approach for simulating network partitions. I went into one of the instances and quickly
did an iptables DROP on all packets coming from the other two instances. This would simulate an availability zone
continuing to function, but that zone losing network connectivity to the other
availability zones. What I saw was that
the two other instances noticed that the first server “going away”, but they
continued to function as they still saw a majority (66%). More interestingly the first instance noticed
the other two servers “going away” dropping the ensemble availability to
33%. This caused the first server to stop
serving requests to clients (not only writes, but also reads).
You can see this happening “live” via the following animated
GIF. Note the colors of the nodes in
the exhibitor UI’s as I separate the left node from the middle and right node:
To me this seems like a concern, as network partitions
should be considered an event that should be survived. In this case (with this specific
configuration of zookeeper) no new clients in that availability zone would be
able to register themselves with consumers within the same availability
zone. Adding more zookeeper instances to
the ensemble wouldn’t help considering a balanced deployment as in this case the
availability would always be majority (66%) and non-majority (33%).
It is worth noting that if your application cannot survive a network partition event itself, then you likely aren't going to worry so much about your service registry and what I have shown here. That said, applications that can handle network partitions well are likely either completely stateless or have relaxed eventual consistency in their data tier.
It is worth noting that if your application cannot survive a network partition event itself, then you likely aren't going to worry so much about your service registry and what I have shown here. That said, applications that can handle network partitions well are likely either completely stateless or have relaxed eventual consistency in their data tier.
I didn’t simulate the worst-case scenario, which would be to
separate all three availability zones from each other. In that case, all availability of zookeeper
would go to zero, as each zone would represent a minority (33% for each zone).
One other thing I didn’t cover here is the bootstrapping of service
registry location (ensuring that the service registry location process is
highly available as well). With Eureka,
there is a pretty cool trick that only depends on DNS (and DNS text records) on
any client. With Eureka the actual
location of Eureka servers is not burned into client images. I’m not sure how to approximate this with
zookeeper currently (or if there are client libraries for zookeeper that keep
this from being an issue), but you could certainly follow a similar trick.
I believe most of this comes down to the fact that zookeeper
isn’t really designed to be a distributed cloud native service registry. It is true that zookeeper can make
implementing service registration, ephemeral de-registration, and client
notifications easy, but I think its focus on consistency makes it a poor choice
for a service registry in a cloud environment with an application that is designed to handle network partitions.
That’s not to say there are not use cases where zookeeper is far better
than Eureka. As Netflix has said, for
cases that have high consistency requirements (leader election, ordered
updates, and distributed synchronization) zookeeper is a better choice.
I am still learning here myself, so I’d welcome comments
from Parse and Airbnb on if I missed something important about their use of zookeeper
for service registration. For now, my
experiment shows at the very least, that I wouldn’t want to use zookeeper for
service registration without a far more advanced approach to handling network
partitions.