Wednesday, August 6, 2014

Sidecars and service registration

I have been having internal conversations on sidecars to manage microservices.  By sidecar, I mean a separate process on each instance node that performs things on behalf of the microservice instance like service registration, service location (for dependencies), dynamic configuration management, service routing (for dependencies), etc.  I have been talking about how an in-process (vs. sidecar) approach to provide these functions while intrusive (requires every microservice to code to or implement a certain framework) is better.  I believe it is hard for folks to understand why things are "better" without actually running into nasty things that happen in real world production scenarios.

Today I decided to simulate a real world scenario.  I decided to play with Karyon which is the NetflixOSS in process technology to manage bootstrap and lifecycle of microservices.  I did the following:

  1. I disabled registry queries which Karyon by default does for the application assuming it might need to look up dependencies (eureka.shouldFetchRegistry=false).  I did this just to simply the timing of pure service registration.
  2. I "disabled" heartbeats for service registration (eureka.client.refresh.interval=60).  Again, I did this just to simplify the timing of initial service registration.
  3. I shortened the time for the initial service registration to one second (eureka.appinfo.initial.replicate.time=1).  I did this to be able to force the registration to happen immediately.
  4. I added a "sleep" to my microservice registration (@Application initialize() { .. Thread.sleep(1000*60*10) } ).  I did this to simulate a microservice that takes some time to "startup".
Once I did this, I saw the following:

The service started up and immediately called initialize, but of course this stalled.  The service also then immediately registered itself into the Eureka service discovery server.  At this point, a query of the service instance in the service registry returns a status of "STARTING".  After 10 minutes, the initialization finishes.  At this later point, the query of the service instance returns a status "UP".  Pretty sensible, no?

I then started to think if a sidecar could somehow get this level of knowledge by poking it's side-managed process.  If you look at Airbnb Nerve (a total sidecar based approach) it does exactly this.  I could envision a Eureka sidecar that was similar to Nerve that pinged the "healthcheck URL" already exposed by Karyon.

This got me thinking of if a health check URL returning 200 (OK) would be a sufficient replacement for deciding on service registration status.  Specifically if healthcheck returns OK for three or so checks, have the sidecar put the service into service discovery as "up".  Similarly if three or so checks return != 200.

I started up a twitter question on this idea and received great feedback from Spencer Gibb.  His example was a service that needed to do database migration before starting up.  In that case, while the service is healthy, until the service is up it shouldn't tell others that it was ready to handle requests.  This is especially true if the health manager of your cluster is killing off instances that aren't "healthy", so you can't solve the issue as just reporting "unhealthy" until the service is ready to handle requests.

This said, if a sidecar is to decide on when a service should be marked as ready to handle traffic, it would seem to reason that every side managed process needs a separate URL (from health check and/or the main microservice interface) for state of boot of the service.  Also, this would imply the side managed process likely needs a framework to consistently decide on the state to be exposed by that url.  In NetflixOSS that framework is Karyon.

I will keep thinking about this, but I find it hard to understand how a pure sidecar based approach with zero changes to a microservice (without a framework embedded into the side managed process) when a service is really "UP" and ready to handle requests vs. "STARTING" vs. "SHUTTING DOWN", etc.  I wonder if AirBNB asks its service developers to define a "READYFORREQUESTS" url's and that is what they pass as configuration to Nerve?


8 comments:

  1. The sidecar approach is how http://www.consul.io views things. If you use some good abstractions, all configuration and service registration and discovery are gathered from a local sidecar. It uses exit codes for health status: http://www.consul.io/docs/agent/checks.html Pretty powerful. Might be interesting to have a eureka based sidecar. That would potentially eliminate the java-centric nature of eureka.

    ReplyDelete
  2. at airbnb, we definitely ask that developers provide an endpoint that allows nerve to determine if the service is healthy. we ask that it (a) lives at a canonical endpoint (/health) (b) return a 200 OK if things are healthy and (c) that it do real work to determine if the service is capable of responding to normal requests. for example, see https://www.airbnb.com/health

    we take this pretty seriously. for instance, default dropwizard settings put the health check on a separate port with a separate web service process. we've had to reconfigure dropwizard to put the health endpoint on the same web server that serves real service requests to make the check more indicative.

    ReplyDelete
  3. to further address your question, i think you're overthinking things. there's a binary world. in that world, your service instance is either available to serve requests, or it is not. in your initial example, nerve would continue to detect your service as down until such time as it became up, whether that was 10 minutes later or 10 hours later. that's the only thing a service registration framework needs to do.

    if your cluster manager is killing off your processes before they ever become ready, that's a problem with your cluster manager. it's like the old "doctor, it hurts when i do this" saw.

    ReplyDelete
  4. Adding back in some conversation from twitter from Netflix (Sudhir and Nitesh) on :

    @stonse - Apps have a lifecycle and varying degrees of readiness

    @NiteshKant- karyon integrates with eureka healthcheck, so HC==200 => UP. healthy & !UP does not seem correct.

    @NiteshKant - one can be STARTING in which case that's the status in eureka with health status being a custom value (we use 204)

    @NiteshKant - if discovery is the only medium to find an instance; healthy but !UP, will not help as the instance isn't discoverable.

    ReplyDelete
  5. @Spencer

    Can you expand on your comments on Consol. A quick look seems to suggest that I could provide checks that I could define as /v1/agent/check/passed/UPpassed and /v1/agent/check/passed/STARTINGpassed.

    I'm not sure how you would define a "status" using this check model.

    What would be the general approach to marking non-healthcheck status for state of application instance boot like I described in Eureka?

    ReplyDelete
  6. @Andrew

    In general my mentioning consul was that the sidecar approach is becoming more popular. A service only needs to no how to connect to the local sidecar while the sidecar deals with bootstrapping (in this case joining the consul cluster).

    As far as the consul checks, consul still just has up or down, so 'starting' would still be considered down.

    Sessions could be where you implement some state that is more just up or down.
    http://www.consul.io/docs/internals/sessions.html
    "Sessions act as a binding layer between nodes, health checks, and key/value data"

    ReplyDelete
    Replies
    1. @Spencer, I just started a general NetflixOSS (Eureka) sidecar on github and I have started to play with adding consul style check definitions via Archaius configuration.

      Check it out:

      https://github.com/aspyker/netflixoss-sidecar/blob/addchecks/src/main/resources/SIMPLESIDECAR_CHANGETHIS.properties

      Delete
  7. Are you looking for free Google+ Circles?
    Did you know that you can get them AUTOMATICALLY AND ABSOLUTELY FOR FREE by getting an account on Like 4 Like?

    ReplyDelete