iSpyker (Performance, Cloud, Containers and Beyond): How intrusive do you want your service discovery to be?

In working on the Acme Air NetflixOSS Docker local implementation we ended up having two service discovery mechanisms (Eureka and SkyDNS). This gave me a concrete place to start to ponder issues that have come up inside of IBM relating to service discovery. Specifically the use of Eureka has been called "intrusive" on application design as it requires application changes to enable service registration and service query/location when load balancing. This blog post aims to start a discussion on the pros and cons of service discovery being "intrusive".

First (in the top of the picture), we had the NetflixOSS based Eureka. The back end microservice (the auth service) would, as part of its Karyon bootstrapping, make a call to register itself in the Eureka servers. Then, when the front end web app wanted to call the back end microservice via Ribbon with client side load balancing, it would do so based on information about service instances gained by querying the Eureka server (something Ribbon has native support for). This is how the NetflixOSS based service discovery has worked on Amazon, in our port to the IBM Cloud - SoftLayer, and in our Docker local port.

Next we had SkyDNS and Skydock (in the bottom of the picture). We used this to have DNS naming between containers. Interestingly we used SkyDNS to tell clients how to locate Eureka itself. We also used it to have clients locate services that weren't Eureka enabled (and therefore locatable) - such as Cassandra and our auto scaling service. Using Skydock, we were able to know that containers being started with an image name of "eureka" would easily resolve by other containers to a simple hostname "eureka.local.flyacmeair.net" (we used the "local" as the environment and "flyacmeair.net" as the domain name). Similarly cass images registered as cass.local.flyacmeair.net. Skydock works by registering with the Docker daemon's event API so it sees when containers start and stop (or die). Based on these events Skydock registers the container into SkyDNS on behalf of the container. Skydock also periodically queries for the running containers on a host and will update SkyDNS with a heartbeat to avoid the DNS entry from timing out.

Before I go into comparing these service discovery technologies, let me say that each did what it was intended to do well. Eureka gave us very good application level service discovery, while Skydock/SkyDNS gave us very good basic container location.

If you compare these approaches, roughly:

Skydock and Eureka client (registration) are similar in both perform the registration and heartbeating for service instances
SkyDNS and Eureka server are similar in that both host the information about the service instances
DNS offered by SkyDNS and Eureka client (query) are similar in that both provide lookup to clients that can load balance across instances of a service

One of the biggest differences between these approaches is that Eureka is specifically included in the service instance (above the container or VM line in IaaS) and is up to the service instance to use as part of its implementation while Skydock is outside of the scope of the service instance (and application code). To be fair to SkyDNS, it doesn't necessarily have to be called in a mode like Skydock does. Someone could easily write code like Eureka client that stored its data in SkyDNS instead of Eureka without using Skydock. However, the real comparison I'm trying to make is service registration that is "intrusive" (on instance) vs. "not intrusive" (off instance).

One interesting aspect of moving service registration out of the application code or below the container/VM boundary line is that there is no application knowledge at this layer. As an example, Karyon is written to only call the Eureka registration for the auth service once all bootstrapping of the application is done and the application is ready to receive traffic. In the case of Skydock, the registration with SkyDNS occurs as soon as the container reports that the process is started. If there was any initialization required in the service, this initialization wouldn't be completed and clients could find out about the service and thus receive requests before the service was at the application level ready to handle requests.

Similar to initial service registration, a service registration client outside of the application code or below the container/VM boundary cannot know true instance health. If the VM/container is running a servlet and the application is throwing massive errors, there is no way for Skydock to know this. Therefore Skydock will happily keep sending heartbeats to SkyDNS which means requests will keep flowing to an unhealthy instance. Alternatively with Eureka and Karyon's integrated health management, it can stop sending heartbeats as soon as the application code deems itself unhealthy regardless of if that container/VM is running or not.

Next let's focus on SkyDNS itself and its query and storage. SkyDNS picked DNS for each of these to lessen the impact on client applications which is a good thing when your main concern is lack of "intrusive" changes in your client code.

SkyDNS helps you not have to recode your clients by exposing service queries through standard DNS. While I think this is beneficial, DNS in my mind wasn't designed to support an ephemeral cloud environment. It is true that SkyDock's DNS has TTL's and heartbeats can effectively control smaller than "internet" facing TTL's typical in standard DNS servers. However, it is well known that there are clients that don't correctly time out TTL's in their caches. Java is notorious for ignoring TTL's without changes to the JVM security properties as lower TTL's open you up to DNS spoofing attacks. Eureka, on the other hand, forces the clients to use the Eureka re-querying and load balancing (either through custom code or through Ribbon abstractions) that is aware of Eureka environment and service registration timeouts.

Next, SkyDNS stores the information about service instances in DNS SRV records. SkyDNS stores (using a combination of DNS SRV records and parts of the hostname and domain used in lookup) the following information - name of service, version of service, environment (prod/test/dev/etc), region of service, host, port and TTL. While DNS SRV records are somewhat more service oriented (they add things to DNS that wouldn't typically be there for host records like services name, port) they do not cover all of the things that Eureka allows to be shared for a service. In addition to the service attributes provided by SkyDNS, there are more in InstanceInfo. Some examples are important urls (status page, health check, home page), secure vs. non-secure port, instance status (UP, DOWN, STARTING, OUT_OF_SERVICE), a metadata bag per app application, and datacenter info (image name, availability zone, etc.). I think while SkyDNS does a good job of using DNS SRV records, it has to go pretty far into domain name paths to add as much information as required on top of DNS. Also, the extended attributes not there that exist in Eureka provide for key functionality not yet possible in a SkyDNS environment. Two specific examples would be the instance status and datacenter info. Instance status is used in the NetflixOSS environment by Asgard in red/black deployments. Asgard marks service instances as OUT_OF_SERVICE allowing older clusters to remain in the service registry, but not be stopped, so that roll backs to older clusters is possible. The extended datacenter info is useful especially in SoftLayer as we can share very specific networking (VLAN's, routers, etc.) that can make routing significantly smarter. In the end Eureka's custom service domain model fits with a more complete description of services than DNS.

One area where non-intrusive service discovery is cited as a benefit is support of multiple languages/runtimes. The Skydock approach doesn't care what type of runtime is being registered into SkyDNS, so it automatically works across languages/runtimes. While Eureka has REST based interfaces to interact with clients, it is far easier today to use the Eureka Java clients for registration and query (and using higher level load balancers like Zuul and Ribbon make it even easier). These Java clients for using the Eureka REST API's are not implemented in other languages. At IBM, we have enabled Eureka to manage non-Java services (C based servers and NodeJS servers). We have taken two approaches to make this easier for non-Java services. First we have implemented an on-instance (same container or VM) Eureka "sidecar" which provides some of the same benefits external to the main service process that Eureka and Karyon provide. We have done this both for Eureka registration and query. Second, we have started to see users who see value in the entire NetflixOSS (including Eureka) platform implement native Eureka clients for Python and NodeJS. These native implementations aren't complete at this point, but they could be made more complete. Between these two options the "sidecar" approach is a stopgap. Separating the application from the "sidecar" has some of the same issues (not as bad, but still worse than in-process) mentioned above when considering on-instance service registration. For instance, you have to be careful about bootstrap (initialization vs. service registration) and healthcheck. Both become more complicated to be synchronized across the service process and side car. Also, in Docker container based clouds, having a second side car process tends to break the single process model, so having the service registration/query in process just fits better.

One final note: This comparison used SkyDNS and Skydock as the non-intrusive off-instance service registration and query. I believe this discussion applies to any service registration technology that isn't intrusive to the service implementation or instance. Skydock is an example of a service registry that is designed to be managed below the container/VM level. I believe the issues presented in this blog are the reason why an application centric service registry isn't offered by IaaS clouds today. Until IaaS clouds have a much better way for applications to report their status in a standard way to the IaaS API's, I don't think non-intrusive service discovery will be possible with the full functionality of intrusive and application integrated service discovery.

Interesting Links:

I do admit I'm still learning in the space. I am very interested in thoughts from those who have used less intrusive service discovery.

FWIW, I also avoided a discussion of high availability of the deployment of the service discovery server. That is critically important as well and I have blogged on that topic before.

Friday, May 16, 2014

How intrusive do you want your service discovery to be?