Thursday, July 24, 2014

Multitenancy models (and frats/sororities, toilets, and kids)

I have had more than a few discussions lately with various IBM teams as we move forward with some of our internal cloud technologies leveraging NetflixOSS technology.  I have found that one of the conversations that is hard to talk through is multitenancy.

Let's set the stage with the definition of multitenancy and how this affects cloud computing.

Wikipedia definition of multitenancy:

"Multitenancy refers to a principle in software architecture where a single instance of the software runs on a server, serving multiple client-organizations (tenants). Multitenancy contrasts with multi-instance architectures where separate software instances (or hardware systems) operate on behalf of different client organizations. With a multitenant architecture, a software application is designed to virtually partition its data and configuration, and each client organization works with a customized virtual application."

Wikipedia further explains multitenancy in the context of cloud computing:

Multitenancy enables sharing of resources and costs across a large pool of users thus allowing for:
centralization of infrastructure in locations with lower costs (such as real estate, electricity, etc.),
peak-load capacity increases (users need not engineer for highest possible load-levels),

utilisation and efficiency improvements for systems that are often only 10–20% utilised.

It seems like everyone comes with their own definition of multitenancy and from what I can see they are all shades of the same definition.  Specifically as you see below the differences are in the "designed to virtually partition its data and configuration" and to what extend that partitioning is possible to be affected by other users.

In an effort to have more meaningful conversations, I propose the following poor analogies based on humans inhabiting space.

The "Big Room"

Consider a really big room with one door.  That door is where all people living in the room enter and exit.  Also, there is a single toilet in the middle of the room.  Also consider that we allow anyone to come in and one of the inhabitants is a bit crazy and loves to run around the room at full speed randomly bouncing off the walls.  While this "big room" is multi-tenant (more than one human could live there) it doesn't well partition or protect inhabitants from each other.  Also, the use of the toilet (a common resource) might be a bit more than embarrassing to say the least.  I think most people I have talked to would consider this environment to lack even the weakest definitions of multi-tenancy.

The "Doorless Single Family Home"

Consider a typical North American single family home but remove all the internal doors.  In this new analogy, we might still have crazy inhabitants (I have two and they are called kids).  We can start to partition them from the rest of us by putting then in a door and they only once and a while escape into the areas affecting others.  Now the toilet and other shared resources are easier to use safely, but still not as safe as you'd want.  One other big change is the type of inhabitants and their ability to share.  In a family, likely they all have semi-common goals and one won't destroy common resources (the toilet) and if they do the family works to ensure that it doesn't happen again.  Finally, there is benefit of this family of living together as they likely share their services freely.

The improved "Single Family Home with Doors".

Consider the previous example, but add back in the typical doors - doors that can be opened and closed, but likely not locked.  Now our private moments are improved.  Also, the crazy kids won't bounce out of their rooms as frequently.  The doors are there to help mistaken bad interactions, but the doors can be opened freely to help achieve family goals more quickly than with the doors closed.

The "Fraternity/Sorority house"

Continuing the poor analogy, what if we make the inhabitants have similar, but more divergent goals than a single family.  Of course all of the inhabitants of a fraternity or sorority house want to graduate college and they might be working on similar subjects that they could share information and learning amongst themselves, but sometimes you really don't want your co-inhabitant to enter your part of the house.  When that co-inhabitant is drunk (never happens in college, right?), you really would like a locked door between you and them.  The co-inhabitant isn't really meaning to cause you harm, but they could cause you harm none-the-less, so you probably added a locked door just in case.  Ok, I'll admit I now have totally lose the part of the analogy of the toilet, but likely there are still shared resources that the house works to protect and shared responsibly.

The "Apartment Building"

Now we finish the analogies with what I think most people I talk about multitenancy consider from the start.  Consider an apartment where every tenant gets his or her own lockable front door.  Also, all of their toilets or important resources are protected to just them.  The inhabitants don't have any common shared goals.  Therefore, the apartment living conditions make sense.  However, these living conditions can be problematic in two ways.  First, this is a more costly way to live and operate both for each inhabitant and their non-shared resources.  Second, if any of the inhabitants have any shared goals, their lack of internal doors means a much slower communication channel and forward progress will be slower.

The Wrap-Up

Now going back to non-analogy aspect.  Many of the NetflixOSS projects (Eureka/Asgard/etc) come from a model that I think best is described by "Doorless Single Family Home".  There is nothing wrong with that for the type of organization Netflix is and likely when deployed inside of Netflix there are more doors added beyond the public OSS.  At IBM, in our own usage I believe we need at least "Single Family Home with Doors" mostly to add some doors that protect us from new users of the cloud technology from accidentally impacting others.  Some have argued that we need the "Fraternity/Sorority house" adding in locked doors until people are confident that people won't even with unlockable doors impact others.  Adding locking of doors means things like adding owner writable only namespaces to Eureka, locking down which clusters can be changed in Asgard, providing segmented network VLAN's, etc.  Finally, if we ever looked to run this fabric across multiple IBM customers (say Coke and Pepsi), it is hard to argue that we wouldn't need the full "Apartment Building" approach.

I hope this helps others in discussing multitenancy.  I hope my own team won't get tired of these new analogies.

Friday, June 27, 2014

How is a multi-host container service different from a multi-host VM service?

Warning:  I am writing this blog post without knowing the answer to the question I am asking in the title.  I am writing this post to force myself to articulate a question I've personally been struggling with as we move towards what we all want - containers with standard formats changing how we handle many cases in the cloud.  Also, I know there are folks that have thought about this for FAR longer than myself and I hope they comment or write alternative blogs so we can all learn together.

That said, I have seen throughout the time leading up to Dockercon and since what seems to be divergent thoughts that when I step back aren't so divergent.  Or maybe they are?  Let's see.

On one hand, we have existing systems on IaaS clouds using virtual machines that have everything controlled by API's with cloud infrastructural services that help build up a IaaS++ environment.  I have specifically avoided using the word PaaS as I define PaaS as something that tends to abstract IaaS to a point where IaaS concepts can't be directly seen and controlled.  I know that everyone doesn't accept such a definition of PaaS, but I use it as a means to help explain my thoughts (please don't just comment exclusively on this definition as it's not the main point of this blog post).  By IaaS++ I mean an environment that adds to IaaS offering services like continuous delivery workflows, high availability fault domains/automatic recovery, cross instance networking with software defined networking security, and operational visibility through monitoring.  And by not calling it PaaS, I suggest that the level of visibility into this environment includes IaaS concepts such as (VM) instances through ssh or other commonly used *nix tools, full TCP network stack access, full OS's with process and file system control, etc.

On the other hand, we have systems growing around resource management systems and schedulers using "The Datacenter as a Computer" that are predominantly tied to containers.  I'll admit that I'm only partially through the book on the subject (now in 2nd edition).  Some of the systems in open source to implement such datacenter as the computer/warehouse scale machines are Yarn (for Hadoop), CoreOS/Fleet, Mesos/Marathon and Google Kubernetes.

At Dockercon, IBM (and yours truly) demoed a Docker container deployment option for the IBM SoftLayer cloud.  We used our cloud services fabric (partially powered by NetflixOSS technologies) on top of this deployment option as the IaaS++ layer.  Given IBM SoftLayer and its current API doesn't support containers as a deployment option, we worked to implement some of ties to the IaaS technologies as part of the demo reusing the Docker API.  Specifically, we showcased an autoscaling service for automatic recovery, cross availability zone placement, and SLA based scaling.  Next we used the Docker private registry along side the Dockerhub public index for image management.  Finally we did specific work to natively integrate the networking from containers into the SoftLayer network.  Doing this networking work was important as it allowed us to leverage existing IaaS provided networking constructs such as load balancers and firewalls.

Last night I watched the Kubernetes demo at Google I/O by Brendan Burns and Craig McLuckie.  The talk kicks off with an overview of the Google Compute Engine VM optimized for containers and then covers the Kubernetes container cluster management open source project which includes a scheduler for long running processes, a labeling system that is important for operational management, a replication controller to scale and auto recover labeled processes, and a service abstraction across labeled processes.

I encourage you to watch the two demo videos before proceeding, as I don't want to force you into thinking only from my conclusions.  Ok, so now that you've watched the videos yourself, let me use the two videos to look at use case comparison points (the links now jump to the right place in each video that are similar):

Fast development and deployment at scale

Brendan demonstrated rolling updates on the cloud.  In the IBM demo, we showed the same, but as an initial deployment on a laptop.  As you see later in the demo, due to the user of Docker, running on the cloud is exactly the same as the laptop.  Also, the IBM cloud services fabric devops console - NetflixOSS Asgard also has the concept of rolling updates as well as the demonstrated initial deployment.  Due to Docker, both demos use essentially the same approach to image creation/baking.

Automatic recovery

I like how Brendan showed through a nice UI the failure and recovery as compared to me watching log files of the health manager.  Other than presentation, the use case and functionality was the same.  The system discovered a failed instance and recovered it.

Service registration

Brendan talked about how Kubernetes offers the concept of services based on tagging.  Under the covers this is implemented by a process that does selects against the tagged containers updating an etcd service registry.  In the cloud services fabric demo we talked about how this was done with NetflixOSS Eureka in a more intrusive (but maybe more app centric valuable) way.  I also have hinted about how important it is to consider availability in your service discovery system.

Service discovery and load balancing across service implementations

Brenda talked about in Kubernetes how this is handled by, currently, a basic round robin load balancer.  Under the covers each Kubernetes node starts this load balancer and any defined service gets started on the load balancer across the cluster with information being passed to client containers via two environment variables, one for the address for the Kubernetes local node load balancer, and one for the port assigned to a specific service.  In the cloud services fabric this is handled by Eureka enabled clients (for example NetflixOSS Ribbon for REST), which does not require a separate load balancer and is more direct and/or the similar NetflixOSS Zuul load balancer in cases where the existing clients can't be used.

FWIW, I haven't seen specifically supported end to end service registration/discovery/load balancing in non-Kubernetes resource managers/schedulers.  I'm sure you could build something similar on top of Mesos/Marathon (or people already have) and CoreOS/etcd, but I think Kubernetes concept of labels and services (much like Eureka) are right in starting to integrate the concept of services into the platform as they are so critical in microservices based devops.

I could continue to draw comparison points for other IaaS++ features like application centric metrics, container level metrics, dynamic configuration management, other devops workflows, remote logging, service interaction monitoring, etc, but I'll let that to the reader.  My belief is that many of these concepts will be implemented in both approaches, as they are required to run an operationally competent system.

Also, I think we need to consider tougher points like how this approach scales (in both demos, under the covers networking was implemented via a subnet per Docker host, which wouldn't necessarily scale well), approach to cross host image propagation (again, both demos used a less than optimal way to push images across every node), and integration with other important IaaS networking concepts (such as external load balancers and firewalls).

What is different?

The key difference that I see in these systems is terminology and implementation.

In the IBM demo, we based the concept of a cluster on what Asgard defines as a cluster.  That cluster definition and state is based on multiple separate, but connected by version naming, auto scaling groups.  It is then, the autoscaler that decides placement based on not only "resource availability", but also high availability (spread deployments across distinct failure domains) and locality policies.  Most everyone is available with the concept of high availability in these policies in existing IaaS - in SoftLayer we use Datacenters or pods, in other clouds the concept is called "availability zones".  Also, in public clouds, the policy for co-location is usually called "placement groups".

Marathon (a long running scheduler on top of the Mesos resource manager), offers these same concepts through the concept of constraints.  Kubernetes today doesn't seem, today, to offer these concepts likely due to its initial focus on smaller scenarios.  Given its roots in Google Omega/Borg, I'm sure there is no reason why Kubernetes couldn't eventually expose the same policy concepts within its replication controller.  In fact, at the end of the Kubernetes talk, there is a question from the crowd on how to make Kubernetes scale across multiple Kubernetes configurations which could have been asked from a more high-availability.

So to me, the concept of an autoscaler and its underlying implementation seems very similar to the concept of a resource manager and scheduler.  I wonder if public cloud auto scalers were open sourced if they would be called resource managers and long running schedulers?

The reason why I ask all of this is as we move forward with containers, I think we might be tempted to build another cloud within our existing clouds.  I also think the Mesos and Kubernetes technologies will have people building clouds within clouds until cloud providers natively support containers as a deployment option.  At that point, will we have duplication of resource management and scheduling if we don't combine the concepts?  Also, what will people do to integrate these new container deployments with other IaaS features like load balancers, security groups, etc?

I think others are asking the same question as well.  As shown in the IBM Cloud demo, we are thinking through this right now.  We have also experimented internally with OpenStack deployments of Docker containers as the IaaS layer under a similar IaaS++ layer.  The experiments led to a similar cloud container IaaS deployment option leveraging existing OpenStack approaches for resource management and scheduling as compared to creating a new layer on top of OpenStack.  Also, there is a public cloud that has likely considered this a long time ago - Joyent.  Joyent has had SmartOS zones which are similar to containers under its IaaS API for a long time without the need to expose the formal concepts of resource management and scheduling to its users.  Also, right at the end of the Kubernetes demo, someone in the crowd asks the same question.  I took this question to ask, when will the compute engine support container deployment this way without having a user setup their own private set of Kubernetes systems (and possibly not have to consider resource management/scheduling with anything more than policy).

As I said in the intro, I'm still learning here.  What are your thoughts?

Friday, June 20, 2014

Quick notes on a day of playing with Acme Air / NetflixOSS on Kubernetes

I took Friday to play with the Kubernetes project open sourced by Google at Dockercon.

I was able to get a basic multi-tier Acme Air (NetflixOSS enabled) application working. I was able to reuse (for the most part) containers we built for Docker local (laptop) from the IBM open sourced docker port. By basic, I mean the front end Acme Air web app, back end Acme Air authentication micro service, Cassandra node and Acme Air data loader, and the NetflixOSS Eureka service discovery server. I ran a single instance of each, but I believe I could have pretty easily scaled up each instance of the Acme Air application itself easily.

I pushed the containers to Dockerhub (as Kubernetes by default pulls all container images from there). This was as pretty easy using these steps:

1. Download and build locally the IBM Acme Air NetflixOSS Docker containers
2. Login to dockerhub (needed once I did a push) via 'docker login'
3. Tag the images - docker tag [imageid] aspyker/acmeair-containername
4. Push the containers to Dockerhub - docker push aspyker/acmeair-containername

I started each container as a single instance via the cloudcfg script:

cluster/ -p 8080:80 run aspyker/acmeair-webapp 1 webapp

I started with "using it wrong" (TM, Andrew 2014) with regards to networking. For example, when Cassandra starts, it needs to know about what seed and peer nodes exist and Cassandra wants to know what IP addresses these other nodes are at. For a single Cassandra node, that means I needed to update the seed list to the IP address of the Cassandra container's config file to itself. Given our containers already listen on ssh and run supervisord to run the container function (Cassandra in this case), I was able to login to the container, stop Cassandra, update the config file with the container's IP address (obtained via docker inspect [containerid] | grep ddr), and restart Cassandra. Similarly I needed to update links between containers (for how the application/micro service found the Cassandra container as well how the application/micro service found Eureka). I could ssh into those containers and update routing information that exists in NetflixOSS Archaius config files inside of the applications.

This didn't perfectly work as the routing in NetflixOSS powered by Ribbon and Eureka use hostnames by default. The hostnames currently assigned to containers in Kubernetes are not resolvable by all other containers (so when the web app tried to route to the auth service based on the hostname registered and discovered in Eureka, it failed with UnknownHostException). We hit this in our SoftLayer runs as well and had patched Eureka client to never register the hostname.  I had asked about this previously on the Eureka mailing list and discovered this is something that Netflix fixes internally in Ribbon. I ended up writing a patch for this for Ribbon to just use IP addresses and patched the ribbon-eureka module in Acme Air.

At this point, I could map the front end web app instance to the Kubernetes minion host via cloudcfg run -p 8080:80 port specification and access Acme Air from the Internet in my browser.

My next steps are to look are running replicationControllers around the various tiers of the application as well as making them services so I can use the Kubernetes built in service location and routing.  I can see how to do this via the guestbook example.  In running that example I can see how if I "bake" into my images an idea of a port for each service, I can locate the port via environment variables.  Kubernetes will ensure that this port is routing traffic to the right service implementations on each Kubernetes host via a load balancer.  That will mean that I can start to route all eureka traffic to port 10000, all web app traffic to port 10001, all Cassandra traffic to port 10002, all auth micro service traffic to port 10003 for example.  This approach sounds pretty similar to an approach used at Netflix with Zuul.

Beyond that I'll need to consider additional items like:

1.  Application data and more advanced routing in the service registration/location

2.  How available the service discovery is, especially as we consider adding availability zones/fault domains.

3.  How do I link this into front facing (public internet) load balancers?

4.  How would I link in the concept of security groups?  Or is the port exposure enough?

5.  How I could start to do chaos testing to see how well the recovery and multiple fault domains works.

I do want to thank the folks at Google that helped me get through the newbie GCE and Kubernetes issues (Brendan, Joe and Daniel).

Tuesday, June 10, 2014

Docker SoftLayer Cloud Talk at Dockercon 2014

The overall concept

Today at Dockercon, Jerry Cuomo went over the concept of borderless cloud and how it relates to IBM's strategy.  He talked about how Docker is one of the erasers of the lines between various clouds with regards to openness.  He talked about how, regardless of vendor, deployment option and location, we need to focus on the following things:


Especially in the age of devops and continuous delivery how lack of speed is a killer.  Even worse, actually unforgivable, having manual steps that introduce error is not acceptable any longer.  Docker helps with this by having layered file systems that allow for just updates to be pushed and loaded.  Also, with its process model it starts as fast as you'd expect your applications to start.  Finally, Docker helps by having a transparent (all the way to source) description model for images which guarantees you run what you coded, not some mismatch between dev and ops.


Optimized means not only price/performance but also optimization of location of workloads.  In the price/performance area IBM technologies (like our IBM Java read-only memory class sharing) can provide for much faster application startup and less memory when similar applications are run on a single node.  Also, getting the hypervisor out of the way can help I/O performance significantly (still a large challenge in VM based approaches) which will help data oriented applications like Hadoop and databases.


Openness of cloud is very important to IBM, just like it was for Java and Unix/Linux.   Docker can provide the same write once, run anywhere experience for cloud workloads.  It is interesting how this openness combined with the fast/small also allows for advances in devops not possible before with VM's.  It is now possible to now run production like workload configurations on premise (and on developer's laptops) in almost the exact same way as deployed in production due to the reduction in overhead vs. running a full virtual machine.


Moving fast isn't enough.  You have to most fast with responsibility.  Specifically you need to make sure you don't ignore security, high availability, and operational visibility when moving so fast.  With the automated and repeatable deployment possible with Docker (and related scheduling systems) combined with micro-service application design high availability and automatic recovery becomes easier.  Also, enterprise deployments of Docker will start to add to the security and operational visibility capabilities.

The demo - SoftLayer cloud running Docker

After Jerry covered these areas, I followed up with a live demo.

On Monday, I showed how the technology we've been building to host IBM public cloud services, the Cloud Services Fabric (CSF), works on top of Docker.  We showed how the kernel of the CSF, based in part on NetflixOSS, and powered by IBM technologies was fully open source and easily run on a developer's laptop.  I talked about how this can even allow developers to Chaos Gorilla test their micro-service implementations.

I showed how building the sample application and its microservice was extremely fast.  Building an update to the war file took more time than containerizing the same war for deployment.  Both were done in seconds.  While we haven't done it yet, I could imagine eventually optimizing this to container generation as part of an IDE auto compile.

In the demo today, I followed this up with showcasing how we could take the exact same environment and marry it with the IBM SoftLayer public cloud.  I took the exact same sample application container image and instead of loading locally, pushing through a Docker registry to the SoftLayer cloud.  The power of this portability (and openness) is very valuable to our teams as it will allow for local testing to mirror more closely production deployment.

Finally, I demonstrated how adding SoftLayer to Docker added to the operational excellence.  Specifically I showed how once we told docker to use a non-default bridge (that was assigned a SoftLayer portable subnet attached to the host private interface), I could have Docker assign IP's out of a routable subnet within the SoftLayer network.  This networking configuration means that the containers spun up would work in the same networks as SoftLayer bare metal and virtual machine instances transparently around the global SoftLayer cloud.  Also, advanced SoftLayer networking features such as load balancers and firewalls would work just as well with the containers.  I also talked about how we deployed this across multiple hosts in multiple datacenters (availability zones) further adding to the high availability options for deployment.  To prove this, I unleashed targeted chaos army like testing.  I showed how I could emulate a failure of a container (by doing a docker rm -f) and how the overall CSF system would auto recover by replacing the container with a new container.

Some links

You can see the slides from Jerry's talk on slideshare.

The video:

Direct Link (HD Version)

Saturday, June 7, 2014

Open Source Release of IBM Acme Air / NetflixOSS on Docker

In a previous blog, I discussed the Docker "local" (on laptop) IBM Cloud Services Fabric powered in part by NetflixOSS prototype.

One big question on twitter and my blog went unanswered.  The question was ... How can someone else run this environment?  In the previous blog post, I mentioned how there was no plan to make key components open source at that point in time.

Today, I am pleased to announce that all of the components to build this environment are now open source and anyone can reproduce this run of IBM Acme Air / NetflixOSS on Docker.  All it takes is about an hour, a decent internet connection, and a laptop with VirtualBox (or boot2docker, or vagrant) installed.

Specifically, the aspects that we have added to open source are:

  1. Microscaler - a small scale instance health manager and auto recovery/scaling agent that works against the Docker remote API.  Specifically we have released the Microscaler service (that implements a REST service), a CLI to make calling Microscaler easier, and a Microscaler agent that is designed to manage clusters of Docker nodes.
  2. The Docker port of the NetflixOSS Asgard devops console.  Specifically we ported Asgard to work against the Docker API for managing IaaS objects such as images and instances as well as the Microscaler API for clusters.  The port handles some of the most basic CRUD operations in Asgard.  Some scenarios (like canary testing, red/black deployment) are yet to be fully implemented.
  3. The Dockerfiles and build scripts that enable anyone to build all of the containers required to run this environment.  The Dockerfiles build containers of the Microscaler, the NetflixOSS infrastructural servers (Asgard, Eureka and Zuul), as well as the full microservices sample application Acme Air (web app, microservice and cassandra data tier).  The build scripts help you build the containers and give easy commands to do the end to end deployment and common administration tasks.
If you want to understand what this runtime showcases, please refer to the previous blog entry.  There is a video that shows the Acme Air application and basic chaos testing that proves the operational excellence of the environment.

Interesting compare:

It is interesting to note that the scope of what we released (the core of the NetflixOSS cloud platform + the Acme Air cloud sample/benchmark application) is similar to we previously released back at the Netflix Cloud Prize in the form of Amazon EC2 AMI's.  I think it is interesting to consider the difference when using Docker in this release as our portable image format.  Using Docker, I was able to easily release the automation of building the images (Dockerfiles) in source form which makes the images far more transparent than an AMI in the Amazon marketplace.  Also, the containers built can be deployed anywhere that Docker containers can be hosted.  Therefore, this project is going to be valuable to far more than a single cloud provider -- likely more on that later as Dockercon 2014 happens next week.

If you want to learn how to run this yourself, check out the following video.  It shows building the containers for open source, starting an initial minimal environment and starting to operate the environment.  After that go back to the previous blog post and see how to perform advanced operations.

Direct Link (HD Version)

Friday, May 16, 2014

How intrusive do you want your service discovery to be?

In working on the Acme Air NetflixOSS Docker local implementation we ended up having two service discovery mechanisms (Eureka and SkyDNS). This gave me a concrete place to start to ponder issues that have come up inside of IBM relating to service discovery. Specifically the use of Eureka has been called "intrusive" on application design as it requires application changes to enable service registration and service query/location when load balancing. This blog post aims to start a discussion on the pros and cons of service discovery being "intrusive".

First (in the top of the picture), we had the NetflixOSS based Eureka. The back end microservice (the auth service) would, as part of its Karyon bootstrapping, make a call to register itself in the Eureka servers. Then, when the front end web app wanted to call the back end microservice via Ribbon with client side load balancing, it would do so based on information about service instances gained by querying the Eureka server (something Ribbon has native support for). This is how the NetflixOSS based service discovery has worked on Amazon, in our port to the IBM Cloud - SoftLayer, and in our Docker local port.

Next we had SkyDNS and Skydock (in the bottom of the picture). We used this to have DNS naming between containers. Interestingly we used SkyDNS to tell clients how to locate Eureka itself. We also used it to have clients locate services that weren't Eureka enabled (and therefore locatable) - such as Cassandra and our auto scaling service. Using Skydock, we were able to know that containers being started with an image name of "eureka" would easily resolve by other containers to a simple hostname "" (we used the "local" as the environment and "" as the domain name). Similarly cass images registered as Skydock works by registering with the Docker daemon's event API so it sees when containers start and stop (or die). Based on these events Skydock registers the container into SkyDNS on behalf of the container. Skydock also periodically queries for the running containers on a host and will update SkyDNS with a heartbeat to avoid the DNS entry from timing out.

Before I go into comparing these service discovery technologies, let me say that each did what it was intended to do well. Eureka gave us very good application level service discovery, while Skydock/SkyDNS gave us very good basic container location.

If you compare these approaches, roughly:
  1. Skydock and Eureka client (registration) are similar in both perform the registration and heartbeating for service instances
  2. SkyDNS and Eureka server are similar in that both host the information about the service instances
  3. DNS offered by SkyDNS and Eureka client (query) are similar in that both provide lookup to clients that can load balance across instances of a service
One of the biggest differences between these approaches is that Eureka is specifically included in the service instance (above the container or VM line in IaaS) and is up to the service instance to use as part of its implementation while Skydock is outside of the scope of the service instance (and application code).  To be fair to SkyDNS, it doesn't necessarily have to be called in a mode like Skydock does.  Someone could easily write code like Eureka client that stored its data in SkyDNS instead of Eureka without using Skydock.  However, the real comparison I'm trying to make is service registration that is "intrusive" (on instance) vs. "not intrusive" (off instance).

One interesting aspect of moving service registration out of the application code or below the container/VM boundary line is that there is no application knowledge at this layer.  As an example, Karyon is written to only call the Eureka registration for the auth service once all bootstrapping of the application is done and the application is ready to receive traffic.  In the case of Skydock, the registration with SkyDNS occurs as soon as the container reports that the process is started.  If there was any initialization required in the service, this initialization wouldn't be completed and clients could find out about the service and thus receive requests before the service was at the application level ready to handle requests.

Similar to initial service registration, a service registration client outside of the application code or below the container/VM boundary cannot know true instance health.  If the VM/container is running a servlet and the application is throwing massive errors, there is no way for Skydock to know this.  Therefore Skydock will happily keep sending heartbeats to SkyDNS which means requests will keep flowing to an unhealthy instance.  Alternatively with Eureka and Karyon's integrated health management, it can stop sending heartbeats as soon as the application code deems itself unhealthy regardless of if that container/VM is running or not.

Next let's focus on SkyDNS itself and its query and storage.  SkyDNS picked DNS for each of these to lessen the impact on client applications which is a good thing when your main concern is lack of "intrusive" changes in your client code.

SkyDNS helps you not have to recode your clients by exposing service queries through standard DNS.  While I think this is beneficial, DNS in my mind wasn't designed to support an ephemeral cloud environment.  It is true that SkyDock's DNS has TTL's and heartbeats can effectively control smaller than "internet" facing TTL's typical in standard DNS servers.  However, it is well known that there are clients that don't correctly time out TTL's in their caches.  Java is notorious for ignoring TTL's without changes to the JVM security properties as lower TTL's open you up to DNS spoofing attacks.  Eureka, on the other hand, forces the clients to use the Eureka re-querying and load balancing (either through custom code or through Ribbon abstractions) that is aware of Eureka environment and service registration timeouts.

Next, SkyDNS stores the information about service instances in DNS SRV records.  SkyDNS stores (using a combination of DNS SRV records and parts of the hostname and domain used in lookup) the following information - name of service, version of service, environment (prod/test/dev/etc), region of service, host, port and TTL.  While DNS SRV records are somewhat more service oriented (they add things to DNS that wouldn't typically be there for host records like services name, port) they do not cover all of the things that Eureka allows to be shared for a service.  In addition to the service attributes provided by SkyDNS, there are more in InstanceInfo.  Some examples are important urls (status page, health check, home page), secure vs. non-secure port, instance status (UP, DOWN, STARTING, OUT_OF_SERVICE), a metadata bag per app application, and datacenter info (image name, availability zone, etc.).  I think while SkyDNS does a good job of using DNS SRV records, it has to go pretty far into domain name paths to add as much information as required on top of DNS.  Also, the extended attributes not there that exist in Eureka provide for key functionality not yet possible in a SkyDNS environment.  Two specific examples would be the instance status and datacenter info.  Instance status is used in the NetflixOSS environment by Asgard in red/black deployments.  Asgard marks service instances as OUT_OF_SERVICE allowing older clusters to remain in the service registry, but not be stopped, so that roll backs to older clusters is possible.  The extended datacenter info is useful especially in SoftLayer as we can share very specific networking (VLAN's, routers, etc.) that can make routing significantly smarter.  In the end Eureka's custom service domain model fits with a more complete description of services than DNS.

One area where non-intrusive service discovery is cited as a benefit is support of multiple languages/runtimes.  The Skydock approach doesn't care what type of runtime is being registered into SkyDNS, so it automatically works across languages/runtimes.  While Eureka has REST based interfaces to interact with clients, it is far easier today to use the Eureka Java clients for registration and query (and using higher level load balancers like Zuul and Ribbon make it even easier).  These Java clients for using the Eureka REST API's are not implemented in other languages.  At IBM, we have enabled Eureka to manage non-Java services (C based servers and NodeJS servers).  We have taken two approaches to make this easier for non-Java services.  First we have implemented an on-instance (same container or VM) Eureka "sidecar" which provides some of the same benefits external to the main service process that Eureka and Karyon provide.  We have done this both for Eureka registration and query.  Second, we have started to see users who see value in the entire NetflixOSS (including Eureka) platform implement native Eureka clients for Python and NodeJS.  These native implementations aren't complete at this point, but they could be made more complete.  Between these two options the "sidecar" approach is a stopgap.  Separating the application from the "sidecar" has some of the same issues (not as bad, but still worse than in-process) mentioned above when considering on-instance service registration.  For instance, you have to be careful about bootstrap (initialization vs. service registration) and healthcheck.  Both become more complicated to be synchronized across the service process and side car.  Also, in Docker container based clouds, having a second side car process tends to break the single process model, so having the service registration/query in process just fits better.

One final note: This comparison used SkyDNS and Skydock as the non-intrusive off-instance service registration and query.  I believe this discussion applies to any service registration technology that isn't intrusive to the service implementation or instance.  Skydock is an example of a service registry that is designed to be managed below the container/VM level.  I believe the issues presented in this blog are the reason why an application centric service registry isn't offered by IaaS clouds today.  Until IaaS clouds have a much better way for applications to report their status in a standard way to the IaaS API's, I don't think non-intrusive service discovery will be possible with the full functionality of intrusive and application integrated service discovery.

Interesting Links:

  1. SkyDNS announcement blog post
  2. Eureka wiki
  3. Service discovery for Docker via DNS
  4. Open Source Service Discovery
I do admit I'm still learning in the space.  I am very interested in thoughts from those who have used less intrusive service discovery.

FWIW, I also avoided a discussion of high availability of the deployment of the service discovery server.  That is critically important as well and I have blogged on that topic before.

Monday, May 5, 2014

Cloud Services Fabric (and NetflixOSS) on Docker

At IBM Impact 2014 last week we showed the following demo:

Direct Link (HD Version)

The demo showed an end to end NetflixOSS based environment running on Docker on a laptop.  The components running in Docker containers shown included:
  1. Acme Air Web Application - The front end web application that is NetflixOSS enabled.  In fact, this was run as a set of containers within an auto scaling group.  The web application looks up (in Eureka) ephemeral instances of the auth service micro-service and performs on-instance load balancing via Netflix Ribbon.
  2. Acme Air Auth Service - The back end micro-service application that is NetflixOSS enabled.  In fact, this was run as a set of containers within an auto scaling group.
  3. Cassandra - This was the Acme Air Netflix port that runs against Cassandra.  We didn't do much with the data store in this demo, other than making it into a container.
  4. Eureka - The NetflixOSS open source service discovery server.  The ephemeral instances of both the web application and auth service automatically register with this Eureka service.
  5. Zuul - The NetflixOSS front end load balancer.  This load balancer looks up (in Eureka) ephemeral instances of the front end web application instances to route all incoming traffic across the rest of the topology.
  6. Asgard - The NetflixOSS devops console, which allows an application or micro-service implementer to configured versioned clusters of instances.  Asgard was ported to talk to the Docker remote API as well as the Auto scaler and recovery service API.
  7. Auto scaler and recovery service.  Each of the instances ran an agent that communicates via heartbeats to this service.  Asgard is responsible for calling API's on this Auto scaler to create clusters.  The auto scaler then called Docker API's to create instances of the correct cluster size.  Then if any instance died (stopped heartbeating), the auto scaler would create a replacement instance.  Finally, we went as far as implementing the idea of datacenters (or availability zones) when launching instances by tagging this information in a "user-data" environment variable (run -e) that had an "az_name" field.
You can see the actual setup in the following slides:

Docker Demo IBM Impact 2014 from aspyker

Once we had this setup, we can locally test "operational" scenarios on Docker including the following scenarios:
  1. Elastic scalability.  We can easily test if our services can scale out and automatically be discovered by the rest the environment and application.
  2. Chaos Monkey.  As shown in the demo, we can test if killing single instances impacted overall system availability and if the system auto recovered a replacement instance.
  3. Chaos Gorilla.  Given we have tagged the instances with their artificial data center/availability zone, we can kill all instances within 1/3 of the deployment emulating a datacenter going away.  We showed this in the public cloud SoftLayer back at dev@Pulse.
  4. Split Brain Monkey.  We can use the same datacenter/availability tagging to isolate instances via iptables based firewalling (similar to Jepsen).
We want to use this setup to a) help our Cloud Service Fabric users understand the Netflix based environment more quickly b) allow our users to do simple localized "operational" tests as listed above before moving to the cloud and c) use this in our continuous integration/delivery pipelines to do mock testing on a closer to production environment than possible on bare metal or memory hungry VM based setups.  More strategically, this work shows that if clouds supported containers and the Docker API we could move easily between a NeflixOSS powered virtual machine and container based approach.

Some details of the implementation:

The Open Source

Updated 2014/06/09 - This project is now completely open source.  For more details see the following blog entry.

The Auto Scaler and agent

The auto scaler and on instance agents talking to the auto scaler being used here are working prototypes from IBM research.  Right now we do not have plans to open source this auto scaler which makes open sourcing the entire solution impossible.  The work to implement an auto scaler is non-trivial and was a large piece of work.

The Asgard Port

In the past, we had already ported Asgard to talk to IBM's cloud (SoftLayer) and its auto scaler (RightScale).  We extended this porting work to instead talk to our Auto scaler and Docker's remote API.  The work was pretty similar and therefore easily achieved in a week or so of work.

The Dockerfiles and their containers

Other than the aforementioned auto scaler and our Asgard port, we were able to use the latest CloudBees binary releases of all of the NetflixOSS technologies and Acme Air.  If we could get the auto scaler and Asgard port moved to public open source, anyone in the world could replicate this demo themselve easily.  We have a script to compile all of our Docker files (15 in all, including some base images) and it takes about 15 minutes on a decent Macbook.  This time is spent mostly in download time and compile steps for our autoscaler and agent.

Creation of these Dockerfiles took about a week to get the basic functionality.  Making them work with the autoscaler and required agents took a bit longer.

We choose to run our containers as "fuller" OS's vs. single process.  On each node we ran the main process for the node, a ssh daemon (to allow more IaaS like access to the filesystem) and the auto scaling agent.  We used supervisord to allow for easy management of these processes inside of Ubuntu on Docker.

The Network

We used the Eureka based service location throughout with no changes to the Eureka registration client.  In order to make this easy to humans (hostnames vs. IP's) we used skydock and skydns to give each tier of the application it's own domain name using --dns and --name options when running containers to associate incremental names for each cluster.  For example, when starting two cassandra nodes, they would show up in skydns as and  We also used routing and bridging to make the entire environment easy to access from the guest laptop.

The Speed

The fact that I can start this all on a single laptop isn't the only impressive aspect.  I ran this with my virtual box being set to three gigs of memory for the boot2docker VM.  Running the demo spins the cooling fan as this required a good bit of CPU, but in terms of memory it was far lighter than I've seen in other environments.
The really impressive aspect is that I can in 90 seconds (including a 20 second sleep waiting for Cassandra to peer) restart an entire environment including two auto scaling clusters of two nodes each and the other five infrastructural services.  This includes all the staggered starts required for starting the database, loading it with data, starting service discovery and dns, starting an autoscaler and defining the clusters to the auto scaler and the final step of them all launching and interconnecting.

Setting this up in a traditional cloud would have taken at least 30 minutes based on my previous experience.

I hope this explanation will be of enough interest to you to consider future collaboration.  I also hope to get up to Dockercon in June in case you also want to talk about this in person.

The Team

I wanted to give credit where credit is due.  The team of folks working on this included folks across IBM developer and research including Takahiro Inaba, Paolo Dettori, and Seelam Seetharami.