Friday, May 16, 2014

How intrusive do you want your service discovery to be?

In working on the Acme Air NetflixOSS Docker local implementation we ended up having two service discovery mechanisms (Eureka and SkyDNS). This gave me a concrete place to start to ponder issues that have come up inside of IBM relating to service discovery. Specifically the use of Eureka has been called "intrusive" on application design as it requires application changes to enable service registration and service query/location when load balancing. This blog post aims to start a discussion on the pros and cons of service discovery being "intrusive".

First (in the top of the picture), we had the NetflixOSS based Eureka. The back end microservice (the auth service) would, as part of its Karyon bootstrapping, make a call to register itself in the Eureka servers. Then, when the front end web app wanted to call the back end microservice via Ribbon with client side load balancing, it would do so based on information about service instances gained by querying the Eureka server (something Ribbon has native support for). This is how the NetflixOSS based service discovery has worked on Amazon, in our port to the IBM Cloud - SoftLayer, and in our Docker local port.

Next we had SkyDNS and Skydock (in the bottom of the picture). We used this to have DNS naming between containers. Interestingly we used SkyDNS to tell clients how to locate Eureka itself. We also used it to have clients locate services that weren't Eureka enabled (and therefore locatable) - such as Cassandra and our auto scaling service. Using Skydock, we were able to know that containers being started with an image name of "eureka" would easily resolve by other containers to a simple hostname "" (we used the "local" as the environment and "" as the domain name). Similarly cass images registered as Skydock works by registering with the Docker daemon's event API so it sees when containers start and stop (or die). Based on these events Skydock registers the container into SkyDNS on behalf of the container. Skydock also periodically queries for the running containers on a host and will update SkyDNS with a heartbeat to avoid the DNS entry from timing out.

Before I go into comparing these service discovery technologies, let me say that each did what it was intended to do well. Eureka gave us very good application level service discovery, while Skydock/SkyDNS gave us very good basic container location.

If you compare these approaches, roughly:
  1. Skydock and Eureka client (registration) are similar in both perform the registration and heartbeating for service instances
  2. SkyDNS and Eureka server are similar in that both host the information about the service instances
  3. DNS offered by SkyDNS and Eureka client (query) are similar in that both provide lookup to clients that can load balance across instances of a service
One of the biggest differences between these approaches is that Eureka is specifically included in the service instance (above the container or VM line in IaaS) and is up to the service instance to use as part of its implementation while Skydock is outside of the scope of the service instance (and application code).  To be fair to SkyDNS, it doesn't necessarily have to be called in a mode like Skydock does.  Someone could easily write code like Eureka client that stored its data in SkyDNS instead of Eureka without using Skydock.  However, the real comparison I'm trying to make is service registration that is "intrusive" (on instance) vs. "not intrusive" (off instance).

One interesting aspect of moving service registration out of the application code or below the container/VM boundary line is that there is no application knowledge at this layer.  As an example, Karyon is written to only call the Eureka registration for the auth service once all bootstrapping of the application is done and the application is ready to receive traffic.  In the case of Skydock, the registration with SkyDNS occurs as soon as the container reports that the process is started.  If there was any initialization required in the service, this initialization wouldn't be completed and clients could find out about the service and thus receive requests before the service was at the application level ready to handle requests.

Similar to initial service registration, a service registration client outside of the application code or below the container/VM boundary cannot know true instance health.  If the VM/container is running a servlet and the application is throwing massive errors, there is no way for Skydock to know this.  Therefore Skydock will happily keep sending heartbeats to SkyDNS which means requests will keep flowing to an unhealthy instance.  Alternatively with Eureka and Karyon's integrated health management, it can stop sending heartbeats as soon as the application code deems itself unhealthy regardless of if that container/VM is running or not.

Next let's focus on SkyDNS itself and its query and storage.  SkyDNS picked DNS for each of these to lessen the impact on client applications which is a good thing when your main concern is lack of "intrusive" changes in your client code.

SkyDNS helps you not have to recode your clients by exposing service queries through standard DNS.  While I think this is beneficial, DNS in my mind wasn't designed to support an ephemeral cloud environment.  It is true that SkyDock's DNS has TTL's and heartbeats can effectively control smaller than "internet" facing TTL's typical in standard DNS servers.  However, it is well known that there are clients that don't correctly time out TTL's in their caches.  Java is notorious for ignoring TTL's without changes to the JVM security properties as lower TTL's open you up to DNS spoofing attacks.  Eureka, on the other hand, forces the clients to use the Eureka re-querying and load balancing (either through custom code or through Ribbon abstractions) that is aware of Eureka environment and service registration timeouts.

Next, SkyDNS stores the information about service instances in DNS SRV records.  SkyDNS stores (using a combination of DNS SRV records and parts of the hostname and domain used in lookup) the following information - name of service, version of service, environment (prod/test/dev/etc), region of service, host, port and TTL.  While DNS SRV records are somewhat more service oriented (they add things to DNS that wouldn't typically be there for host records like services name, port) they do not cover all of the things that Eureka allows to be shared for a service.  In addition to the service attributes provided by SkyDNS, there are more in InstanceInfo.  Some examples are important urls (status page, health check, home page), secure vs. non-secure port, instance status (UP, DOWN, STARTING, OUT_OF_SERVICE), a metadata bag per app application, and datacenter info (image name, availability zone, etc.).  I think while SkyDNS does a good job of using DNS SRV records, it has to go pretty far into domain name paths to add as much information as required on top of DNS.  Also, the extended attributes not there that exist in Eureka provide for key functionality not yet possible in a SkyDNS environment.  Two specific examples would be the instance status and datacenter info.  Instance status is used in the NetflixOSS environment by Asgard in red/black deployments.  Asgard marks service instances as OUT_OF_SERVICE allowing older clusters to remain in the service registry, but not be stopped, so that roll backs to older clusters is possible.  The extended datacenter info is useful especially in SoftLayer as we can share very specific networking (VLAN's, routers, etc.) that can make routing significantly smarter.  In the end Eureka's custom service domain model fits with a more complete description of services than DNS.

One area where non-intrusive service discovery is cited as a benefit is support of multiple languages/runtimes.  The Skydock approach doesn't care what type of runtime is being registered into SkyDNS, so it automatically works across languages/runtimes.  While Eureka has REST based interfaces to interact with clients, it is far easier today to use the Eureka Java clients for registration and query (and using higher level load balancers like Zuul and Ribbon make it even easier).  These Java clients for using the Eureka REST API's are not implemented in other languages.  At IBM, we have enabled Eureka to manage non-Java services (C based servers and NodeJS servers).  We have taken two approaches to make this easier for non-Java services.  First we have implemented an on-instance (same container or VM) Eureka "sidecar" which provides some of the same benefits external to the main service process that Eureka and Karyon provide.  We have done this both for Eureka registration and query.  Second, we have started to see users who see value in the entire NetflixOSS (including Eureka) platform implement native Eureka clients for Python and NodeJS.  These native implementations aren't complete at this point, but they could be made more complete.  Between these two options the "sidecar" approach is a stopgap.  Separating the application from the "sidecar" has some of the same issues (not as bad, but still worse than in-process) mentioned above when considering on-instance service registration.  For instance, you have to be careful about bootstrap (initialization vs. service registration) and healthcheck.  Both become more complicated to be synchronized across the service process and side car.  Also, in Docker container based clouds, having a second side car process tends to break the single process model, so having the service registration/query in process just fits better.

One final note: This comparison used SkyDNS and Skydock as the non-intrusive off-instance service registration and query.  I believe this discussion applies to any service registration technology that isn't intrusive to the service implementation or instance.  Skydock is an example of a service registry that is designed to be managed below the container/VM level.  I believe the issues presented in this blog are the reason why an application centric service registry isn't offered by IaaS clouds today.  Until IaaS clouds have a much better way for applications to report their status in a standard way to the IaaS API's, I don't think non-intrusive service discovery will be possible with the full functionality of intrusive and application integrated service discovery.

Interesting Links:

  1. SkyDNS announcement blog post
  2. Eureka wiki
  3. Service discovery for Docker via DNS
  4. Open Source Service Discovery
I do admit I'm still learning in the space.  I am very interested in thoughts from those who have used less intrusive service discovery.

FWIW, I also avoided a discussion of high availability of the deployment of the service discovery server.  That is critically important as well and I have blogged on that topic before.

Monday, May 5, 2014

Cloud Services Fabric (and NetflixOSS) on Docker

At IBM Impact 2014 last week we showed the following demo:

Direct Link (HD Version)

The demo showed an end to end NetflixOSS based environment running on Docker on a laptop.  The components running in Docker containers shown included:
  1. Acme Air Web Application - The front end web application that is NetflixOSS enabled.  In fact, this was run as a set of containers within an auto scaling group.  The web application looks up (in Eureka) ephemeral instances of the auth service micro-service and performs on-instance load balancing via Netflix Ribbon.
  2. Acme Air Auth Service - The back end micro-service application that is NetflixOSS enabled.  In fact, this was run as a set of containers within an auto scaling group.
  3. Cassandra - This was the Acme Air Netflix port that runs against Cassandra.  We didn't do much with the data store in this demo, other than making it into a container.
  4. Eureka - The NetflixOSS open source service discovery server.  The ephemeral instances of both the web application and auth service automatically register with this Eureka service.
  5. Zuul - The NetflixOSS front end load balancer.  This load balancer looks up (in Eureka) ephemeral instances of the front end web application instances to route all incoming traffic across the rest of the topology.
  6. Asgard - The NetflixOSS devops console, which allows an application or micro-service implementer to configured versioned clusters of instances.  Asgard was ported to talk to the Docker remote API as well as the Auto scaler and recovery service API.
  7. Auto scaler and recovery service.  Each of the instances ran an agent that communicates via heartbeats to this service.  Asgard is responsible for calling API's on this Auto scaler to create clusters.  The auto scaler then called Docker API's to create instances of the correct cluster size.  Then if any instance died (stopped heartbeating), the auto scaler would create a replacement instance.  Finally, we went as far as implementing the idea of datacenters (or availability zones) when launching instances by tagging this information in a "user-data" environment variable (run -e) that had an "az_name" field.
You can see the actual setup in the following slides:

Docker Demo IBM Impact 2014 from aspyker

Once we had this setup, we can locally test "operational" scenarios on Docker including the following scenarios:
  1. Elastic scalability.  We can easily test if our services can scale out and automatically be discovered by the rest the environment and application.
  2. Chaos Monkey.  As shown in the demo, we can test if killing single instances impacted overall system availability and if the system auto recovered a replacement instance.
  3. Chaos Gorilla.  Given we have tagged the instances with their artificial data center/availability zone, we can kill all instances within 1/3 of the deployment emulating a datacenter going away.  We showed this in the public cloud SoftLayer back at dev@Pulse.
  4. Split Brain Monkey.  We can use the same datacenter/availability tagging to isolate instances via iptables based firewalling (similar to Jepsen).
We want to use this setup to a) help our Cloud Service Fabric users understand the Netflix based environment more quickly b) allow our users to do simple localized "operational" tests as listed above before moving to the cloud and c) use this in our continuous integration/delivery pipelines to do mock testing on a closer to production environment than possible on bare metal or memory hungry VM based setups.  More strategically, this work shows that if clouds supported containers and the Docker API we could move easily between a NeflixOSS powered virtual machine and container based approach.

Some details of the implementation:

The Open Source

Updated 2014/06/09 - This project is now completely open source.  For more details see the following blog entry.

The Auto Scaler and agent

The auto scaler and on instance agents talking to the auto scaler being used here are working prototypes from IBM research.  Right now we do not have plans to open source this auto scaler which makes open sourcing the entire solution impossible.  The work to implement an auto scaler is non-trivial and was a large piece of work.

The Asgard Port

In the past, we had already ported Asgard to talk to IBM's cloud (SoftLayer) and its auto scaler (RightScale).  We extended this porting work to instead talk to our Auto scaler and Docker's remote API.  The work was pretty similar and therefore easily achieved in a week or so of work.

The Dockerfiles and their containers

Other than the aforementioned auto scaler and our Asgard port, we were able to use the latest CloudBees binary releases of all of the NetflixOSS technologies and Acme Air.  If we could get the auto scaler and Asgard port moved to public open source, anyone in the world could replicate this demo themselve easily.  We have a script to compile all of our Docker files (15 in all, including some base images) and it takes about 15 minutes on a decent Macbook.  This time is spent mostly in download time and compile steps for our autoscaler and agent.

Creation of these Dockerfiles took about a week to get the basic functionality.  Making them work with the autoscaler and required agents took a bit longer.

We choose to run our containers as "fuller" OS's vs. single process.  On each node we ran the main process for the node, a ssh daemon (to allow more IaaS like access to the filesystem) and the auto scaling agent.  We used supervisord to allow for easy management of these processes inside of Ubuntu on Docker.

The Network

We used the Eureka based service location throughout with no changes to the Eureka registration client.  In order to make this easy to humans (hostnames vs. IP's) we used skydock and skydns to give each tier of the application it's own domain name using --dns and --name options when running containers to associate incremental names for each cluster.  For example, when starting two cassandra nodes, they would show up in skydns as and  We also used routing and bridging to make the entire environment easy to access from the guest laptop.

The Speed

The fact that I can start this all on a single laptop isn't the only impressive aspect.  I ran this with my virtual box being set to three gigs of memory for the boot2docker VM.  Running the demo spins the cooling fan as this required a good bit of CPU, but in terms of memory it was far lighter than I've seen in other environments.
The really impressive aspect is that I can in 90 seconds (including a 20 second sleep waiting for Cassandra to peer) restart an entire environment including two auto scaling clusters of two nodes each and the other five infrastructural services.  This includes all the staggered starts required for starting the database, loading it with data, starting service discovery and dns, starting an autoscaler and defining the clusters to the auto scaler and the final step of them all launching and interconnecting.

Setting this up in a traditional cloud would have taken at least 30 minutes based on my previous experience.

I hope this explanation will be of enough interest to you to consider future collaboration.  I also hope to get up to Dockercon in June in case you also want to talk about this in person.

The Team

I wanted to give credit where credit is due.  The team of folks working on this included folks across IBM developer and research including Takahiro Inaba, Paolo Dettori, and Seelam Seetharami.