Wednesday, August 6, 2014

Sidecars and service registration

I have been having internal conversations on sidecars to manage microservices.  By sidecar, I mean a separate process on each instance node that performs things on behalf of the microservice instance like service registration, service location (for dependencies), dynamic configuration management, service routing (for dependencies), etc.  I have been talking about how an in-process (vs. sidecar) approach to provide these functions while intrusive (requires every microservice to code to or implement a certain framework) is better.  I believe it is hard for folks to understand why things are "better" without actually running into nasty things that happen in real world production scenarios.

Today I decided to simulate a real world scenario.  I decided to play with Karyon which is the NetflixOSS in process technology to manage bootstrap and lifecycle of microservices.  I did the following:

  1. I disabled registry queries which Karyon by default does for the application assuming it might need to look up dependencies (eureka.shouldFetchRegistry=false).  I did this just to simply the timing of pure service registration.
  2. I "disabled" heartbeats for service registration (eureka.client.refresh.interval=60).  Again, I did this just to simplify the timing of initial service registration.
  3. I shortened the time for the initial service registration to one second (eureka.appinfo.initial.replicate.time=1).  I did this to be able to force the registration to happen immediately.
  4. I added a "sleep" to my microservice registration (@Application initialize() { .. Thread.sleep(1000*60*10) } ).  I did this to simulate a microservice that takes some time to "startup".
Once I did this, I saw the following:

The service started up and immediately called initialize, but of course this stalled.  The service also then immediately registered itself into the Eureka service discovery server.  At this point, a query of the service instance in the service registry returns a status of "STARTING".  After 10 minutes, the initialization finishes.  At this later point, the query of the service instance returns a status "UP".  Pretty sensible, no?

I then started to think if a sidecar could somehow get this level of knowledge by poking it's side-managed process.  If you look at Airbnb Nerve (a total sidecar based approach) it does exactly this.  I could envision a Eureka sidecar that was similar to Nerve that pinged the "healthcheck URL" already exposed by Karyon.

This got me thinking of if a health check URL returning 200 (OK) would be a sufficient replacement for deciding on service registration status.  Specifically if healthcheck returns OK for three or so checks, have the sidecar put the service into service discovery as "up".  Similarly if three or so checks return != 200.

I started up a twitter question on this idea and received great feedback from Spencer Gibb.  His example was a service that needed to do database migration before starting up.  In that case, while the service is healthy, until the service is up it shouldn't tell others that it was ready to handle requests.  This is especially true if the health manager of your cluster is killing off instances that aren't "healthy", so you can't solve the issue as just reporting "unhealthy" until the service is ready to handle requests.

This said, if a sidecar is to decide on when a service should be marked as ready to handle traffic, it would seem to reason that every side managed process needs a separate URL (from health check and/or the main microservice interface) for state of boot of the service.  Also, this would imply the side managed process likely needs a framework to consistently decide on the state to be exposed by that url.  In NetflixOSS that framework is Karyon.

I will keep thinking about this, but I find it hard to understand how a pure sidecar based approach with zero changes to a microservice (without a framework embedded into the side managed process) when a service is really "UP" and ready to handle requests vs. "STARTING" vs. "SHUTTING DOWN", etc.  I wonder if AirBNB asks its service developers to define a "READYFORREQUESTS" url's and that is what they pass as configuration to Nerve?