Monday, August 18, 2014

A NetflixOSS sidecar in support of non-Java services

In working on supporting our next round of IBM Cloud Service Fabric service tenants, we found that the service implementers came from very different backgrounds.  Some were skilled in Java, some Ruby and others were C/C++ focused and therefore their service implementations were just as diverse.  Given the size of the team of the services we're on-boarding and timeframe for going public, recoding all of these services to use NetflixOSS Java libraries that bring the operational excellence (like Archaius, Karyon, Eureka, etc) seemed pretty unlikely.

For what it is worth, we faced a similar challenge in earlier services (mostly due to existing C/C++ applications) and we created what was called a "sidecar".  By sidecar, what I mean is a second process on each node/instance that did Cloud Service Fabric operations on behalf of the main process (the side-managed process).  Unfortunately those sidecars all went off and created one-offs for their particular service.  In this post, I'll describe a more general sidecar that doesn't force users to have these one-offs.

Sidenote:  For those not familiar with sidecars, think of the motorcycle sidecar below.  Snoopy would be the main process with Woodstock being the sidecar process.  The main work on the instance would be the motorcycle (say serving your users' REST requests).  The operational control is the sidecar (say serving health checks and management plane requests of the operational platform).


Before we get started, we need to note there are multiple types of sidecars.  Predominantly there are two main types of sidecars.  There are sidecars that manage durable and or storage tiers.  These sidecars need to manage things that other sidecars do not (like joining a stateful ring of servers, or joining a set of slaves and discovering masters, or backup and recovery of data).  Some sidecars that exist in this space are Priam (for Cassandra) and Exhibitor (for Zookeeper).  The other type is for managing stateless mid-tier services like microservices.  An example of this is AirBNB's Synapse and Nerve.  You'll see that in the announcement of Synapse and Nerve on AirBNB's blog that they are trying to solve some (but not all) of the issues I will mention in this blog post.

What are some things that a microservice sidecar could do for a microservice?

1. Service discovery registration and heartbeat

This registration with service discovery would have to happen only after the sidecar detects the side-managed process as ready to receive requests.  This isn't necessarily the same as if the instance is "healthy" as an instance might be healthy well before it is ready to handle requests (consider an instance that needs to pre-warm caches, etc.).  Also, all dynamic configuration of this function (where and if to register) should be considered.

2.  Health check URL

Every instance should have a health check url that can communicate out of band the health of an instance.  The sidecar would need to query the health of the side-managed process and expose this url on behalf of the side-managed process.  Various systems (like auto scaling groups, front end load balancers, and service discovery queries) would query this URL and take sick instances out of rotation.

3.  Service dependency load balancing

In a NetflixOSS based microservice, routing can be done intelligently based upon information from service discovery (Eureka) via smart client side load balancing (Ribbon).  Once you move this function out of the microservice implementation, as AirBNB noted as well, it is likely unneeded and problematic in some cases to move back to centralized load balancing.  Therefore it would be nice if the sidecar would perform load balancing on behalf of the side-managed process.  Note that Zuul (on instance in the sidecar) could fill this role in NetflixOSS.  In AirBNB's stack, the combination of service discovery and this item is done through Synapse.  Also, all dynamic configuration of this function (states of routes, timeouts, retry strategy, etc) should be considered.

One other area to consider here (especially in the NetflixOSS space) would be if the sidecar should provide for advanced devops filters in load balancing that go beyond basic round robin load balancing.  Netflix has talked about the advantages of Zuul for this in the front/edge tier, but we could consider doing something in between microservices.

4.  Microservice latency/health metrics

Being able to have operational visibility into the error rates on calls to dependent services as well as latency and overall state of dependencies is important to knowing how to operate the side-managed process.  In NetflixOSS by using the Hystrix pattern and API, you can get such visibility through the exported Hystrix streams.  Again, Zuul (on instance in the sidecar) can provide this functionality.

5.  Eureka discovery

We have found service implementation in IBM that already have their own client side load balancing or cluster technologies.  Also, Netflix has talked about other OSS systems such as Elastic Search.  For these systems it would be nice if the sidecar could provide a way to expose Eureka discovery outside of load balancing.  Then the client could ingest the discovery information and use it however it felt necessary.  Also, all dynamic configuration of this function should be considered.

6.  Dynamic configuration management

It would nice if the sidecar could expose to the side-managed process dynamic configuration.  While I have mentioned the need to have previous sidecar functions items dynamically configured, it is important that the side-managed process configuration to be considered as well.  Consider the case where you want the side-managed process to use a common dynamic configuration management system but all it can do is read from property files.  In NetflixOSS this is managed via Archaius but this requires using the NetflixOSS libraries.

7.  Circuit breaking for fault tolerance to dependencies

It would nice if the sidecar could provide an approximation of circuit breaking.  I believe this is impossible to do as "cleanly" as using NetflixOSS Hystrix natively (as this wouldn't require the user to write specific business logic to handle failures that reduce calls to the dependency), but it might be nice to have some level of guarantee of fast failure of scenarios using #3.  Also, all dynamic configuration of this function (timeouts, etc) should be considered.

8.  Application level metrics

It would be nice if the sidecar provided could allow the side-managed process to more easily publish application specific metrics to the metrics pipeline.  While every language likely already has a nice binding to systems like statsd/collectd, it might be worth making the interface to these systems common through the sidecar.  For NetflixOSS, this is done through Servo.

9. Manual GUI and programmatic control

We have found the need to sometimes quickly dive into a specific instance with human eyes.  Having a private web based UI is far easier than loading up ssh.  Also, if you want to script access to the functions and data collected by the sidecar, we would like a REST or even JMX interface to the control offered in the sidecar.

This all said, I started a quick project last week to create a sidecar that does some of these functions using NetflixOSS so it integrated cleanly into our existing IBM Cloud Services Fabric environment.  I decided to do it in github, so others can contribute.

By using Karyon as a base for the sidecar, I was able to get a few of the items on the list automatically (specifically #1, #2 partially and #9).  I started with the most basic sidecar in the trunk project.  Then I added two more things:


Consul style health checks:


In work leading up to this work Spencer Gibb pointed me to the sidecar agents checks that Consul uses (which they said they based on Nagios).  I based a similar set of checks for my sidecar.  You can see in this archaius config file how you'd configure them:

com.ibm.ibmcsf.sidecar.externalhealthcheck.enabled=true
com.ibm.ibmcsf.sidecar.externalhealthcheck.numchecks=1

com.ibm.ibmcsf.sidecar.externalhealthcheck.1.id=local-ping-healthcheckurl
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.description=Runs a script that curls the healthcheck url of the sidemanaged process
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.interval=10000
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.script=/opt/sidecars/curllocalhost.sh 8080 /
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.workingdir=/tmp

com.ibm.ibmcsf.sidecar.externalhealthcheck.2.id=local-killswitch
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.description=Runs a script that tests if /opt/sidecarscripts/killswitch.txt exists
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.interval=30000
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.script=/opt/sidecars/checkKillswitch.sh

Specifically you define a check as an external script that the sidecar executes and if the script returns a code of 0, the check is marked as healthy (1 = warning, otherwise unhealthy).  If all checks defined come back as healthy for greater than three iterations, the instance is healthy.  I have coded up some basic shell scripts that we'll likely give to all of our users (like curllocalhost.sh and checkkillswitchtxtfile.sh).  Once I had these checks being executed by the sidecar, it was pretty easy to change the Karyon/Eureka HealthCheckHandler class to query the CheckManager logic I added.


Integration with Dynamic Configuration Management


We believe most languages can easily register events based on files changing and can easily read properties files.  Based on this, I added another feature configured this archiaus config file:

com.ibm.ibmcsf.sidecar.dynamicpropertywriter.enabled=true
com.ibm.ibmcsf.sidecar.dynamicpropertywriter.file.template=/opt/sidecars/appspecific.properties.template
com.ibm.ibmcsf.sidecar.dynamicpropertywriter.file=/opt/sidecars/appspecific.properties

What this says is that a user of the sidecar puts all of the properties they care about in the file.template properties file and then as configuration is dynamically updated in Archaius the sidecar sees this and writes out a copy to the main properties file with the values filled in.

With these changes, I think we now have a pretty solid story for #1, #2, #6 and #9.  I'd like to next focus on #3, #4, and #7 adding a Zuul and Hystrix based sidecar process but I don't have users (yet) pushing for these functions.  Also, I should note that the code is a proof of concept and needs to be hardened as it was just a side project for me.

PS.  I do want to make it clear that while this sidecar approach could be used for Java services (as opposed to languages that don't have NetflixOSS bindings), I do not advocate moving these functions to external to your Java implementation.  There are places where offering this function in a side-car isn't as "excellent" operationally and more close to "good enough".  I'll let it to the reader to understand these tradeoffs.  However, I hope that work in this microservice sidecar space leads to easier NetflixOSS adoption in non-Java environments.

PPS.  This sidecar might be more useful in the container space as well at a host level.  Taking the sidecar and making it work across multiple single process instances on a host would be an interesting extension of this work.


Wednesday, August 6, 2014

Sidecars and service registration

I have been having internal conversations on sidecars to manage microservices.  By sidecar, I mean a separate process on each instance node that performs things on behalf of the microservice instance like service registration, service location (for dependencies), dynamic configuration management, service routing (for dependencies), etc.  I have been talking about how an in-process (vs. sidecar) approach to provide these functions while intrusive (requires every microservice to code to or implement a certain framework) is better.  I believe it is hard for folks to understand why things are "better" without actually running into nasty things that happen in real world production scenarios.

Today I decided to simulate a real world scenario.  I decided to play with Karyon which is the NetflixOSS in process technology to manage bootstrap and lifecycle of microservices.  I did the following:

  1. I disabled registry queries which Karyon by default does for the application assuming it might need to look up dependencies (eureka.shouldFetchRegistry=false).  I did this just to simply the timing of pure service registration.
  2. I "disabled" heartbeats for service registration (eureka.client.refresh.interval=60).  Again, I did this just to simplify the timing of initial service registration.
  3. I shortened the time for the initial service registration to one second (eureka.appinfo.initial.replicate.time=1).  I did this to be able to force the registration to happen immediately.
  4. I added a "sleep" to my microservice registration (@Application initialize() { .. Thread.sleep(1000*60*10) } ).  I did this to simulate a microservice that takes some time to "startup".
Once I did this, I saw the following:

The service started up and immediately called initialize, but of course this stalled.  The service also then immediately registered itself into the Eureka service discovery server.  At this point, a query of the service instance in the service registry returns a status of "STARTING".  After 10 minutes, the initialization finishes.  At this later point, the query of the service instance returns a status "UP".  Pretty sensible, no?

I then started to think if a sidecar could somehow get this level of knowledge by poking it's side-managed process.  If you look at Airbnb Nerve (a total sidecar based approach) it does exactly this.  I could envision a Eureka sidecar that was similar to Nerve that pinged the "healthcheck URL" already exposed by Karyon.

This got me thinking of if a health check URL returning 200 (OK) would be a sufficient replacement for deciding on service registration status.  Specifically if healthcheck returns OK for three or so checks, have the sidecar put the service into service discovery as "up".  Similarly if three or so checks return != 200.

I started up a twitter question on this idea and received great feedback from Spencer Gibb.  His example was a service that needed to do database migration before starting up.  In that case, while the service is healthy, until the service is up it shouldn't tell others that it was ready to handle requests.  This is especially true if the health manager of your cluster is killing off instances that aren't "healthy", so you can't solve the issue as just reporting "unhealthy" until the service is ready to handle requests.

This said, if a sidecar is to decide on when a service should be marked as ready to handle traffic, it would seem to reason that every side managed process needs a separate URL (from health check and/or the main microservice interface) for state of boot of the service.  Also, this would imply the side managed process likely needs a framework to consistently decide on the state to be exposed by that url.  In NetflixOSS that framework is Karyon.

I will keep thinking about this, but I find it hard to understand how a pure sidecar based approach with zero changes to a microservice (without a framework embedded into the side managed process) when a service is really "UP" and ready to handle requests vs. "STARTING" vs. "SHUTTING DOWN", etc.  I wonder if AirBNB asks its service developers to define a "READYFORREQUESTS" url's and that is what they pass as configuration to Nerve?