Friday, June 27, 2014

How is a multi-host container service different from a multi-host VM service?

Warning:  I am writing this blog post without knowing the answer to the question I am asking in the title.  I am writing this post to force myself to articulate a question I've personally been struggling with as we move towards what we all want - containers with standard formats changing how we handle many cases in the cloud.  Also, I know there are folks that have thought about this for FAR longer than myself and I hope they comment or write alternative blogs so we can all learn together.

That said, I have seen throughout the time leading up to Dockercon and since what seems to be divergent thoughts that when I step back aren't so divergent.  Or maybe they are?  Let's see.

On one hand, we have existing systems on IaaS clouds using virtual machines that have everything controlled by API's with cloud infrastructural services that help build up a IaaS++ environment.  I have specifically avoided using the word PaaS as I define PaaS as something that tends to abstract IaaS to a point where IaaS concepts can't be directly seen and controlled.  I know that everyone doesn't accept such a definition of PaaS, but I use it as a means to help explain my thoughts (please don't just comment exclusively on this definition as it's not the main point of this blog post).  By IaaS++ I mean an environment that adds to IaaS offering services like continuous delivery workflows, high availability fault domains/automatic recovery, cross instance networking with software defined networking security, and operational visibility through monitoring.  And by not calling it PaaS, I suggest that the level of visibility into this environment includes IaaS concepts such as (VM) instances through ssh or other commonly used *nix tools, full TCP network stack access, full OS's with process and file system control, etc.

On the other hand, we have systems growing around resource management systems and schedulers using "The Datacenter as a Computer" that are predominantly tied to containers.  I'll admit that I'm only partially through the book on the subject (now in 2nd edition).  Some of the systems in open source to implement such datacenter as the computer/warehouse scale machines are Yarn (for Hadoop), CoreOS/Fleet, Mesos/Marathon and Google Kubernetes.

At Dockercon, IBM (and yours truly) demoed a Docker container deployment option for the IBM SoftLayer cloud.  We used our cloud services fabric (partially powered by NetflixOSS technologies) on top of this deployment option as the IaaS++ layer.  Given IBM SoftLayer and its current API doesn't support containers as a deployment option, we worked to implement some of ties to the IaaS technologies as part of the demo reusing the Docker API.  Specifically, we showcased an autoscaling service for automatic recovery, cross availability zone placement, and SLA based scaling.  Next we used the Docker private registry along side the Dockerhub public index for image management.  Finally we did specific work to natively integrate the networking from containers into the SoftLayer network.  Doing this networking work was important as it allowed us to leverage existing IaaS provided networking constructs such as load balancers and firewalls.

Last night I watched the Kubernetes demo at Google I/O by Brendan Burns and Craig McLuckie.  The talk kicks off with an overview of the Google Compute Engine VM optimized for containers and then covers the Kubernetes container cluster management open source project which includes a scheduler for long running processes, a labeling system that is important for operational management, a replication controller to scale and auto recover labeled processes, and a service abstraction across labeled processes.

I encourage you to watch the two demo videos before proceeding, as I don't want to force you into thinking only from my conclusions.  Ok, so now that you've watched the videos yourself, let me use the two videos to look at use case comparison points (the links now jump to the right place in each video that are similar):

Fast development and deployment at scale

Brendan demonstrated rolling updates on the cloud.  In the IBM demo, we showed the same, but as an initial deployment on a laptop.  As you see later in the demo, due to the user of Docker, running on the cloud is exactly the same as the laptop.  Also, the IBM cloud services fabric devops console - NetflixOSS Asgard also has the concept of rolling updates as well as the demonstrated initial deployment.  Due to Docker, both demos use essentially the same approach to image creation/baking.

Automatic recovery

I like how Brendan showed through a nice UI the failure and recovery as compared to me watching log files of the health manager.  Other than presentation, the use case and functionality was the same.  The system discovered a failed instance and recovered it.

Service registration

Brendan talked about how Kubernetes offers the concept of services based on tagging.  Under the covers this is implemented by a process that does selects against the tagged containers updating an etcd service registry.  In the cloud services fabric demo we talked about how this was done with NetflixOSS Eureka in a more intrusive (but maybe more app centric valuable) way.  I also have hinted about how important it is to consider availability in your service discovery system.

Service discovery and load balancing across service implementations

Brenda talked about in Kubernetes how this is handled by, currently, a basic round robin load balancer.  Under the covers each Kubernetes node starts this load balancer and any defined service gets started on the load balancer across the cluster with information being passed to client containers via two environment variables, one for the address for the Kubernetes local node load balancer, and one for the port assigned to a specific service.  In the cloud services fabric this is handled by Eureka enabled clients (for example NetflixOSS Ribbon for REST), which does not require a separate load balancer and is more direct and/or the similar NetflixOSS Zuul load balancer in cases where the existing clients can't be used.

FWIW, I haven't seen specifically supported end to end service registration/discovery/load balancing in non-Kubernetes resource managers/schedulers.  I'm sure you could build something similar on top of Mesos/Marathon (or people already have) and CoreOS/etcd, but I think Kubernetes concept of labels and services (much like Eureka) are right in starting to integrate the concept of services into the platform as they are so critical in microservices based devops.

I could continue to draw comparison points for other IaaS++ features like application centric metrics, container level metrics, dynamic configuration management, other devops workflows, remote logging, service interaction monitoring, etc, but I'll let that to the reader.  My belief is that many of these concepts will be implemented in both approaches, as they are required to run an operationally competent system.

Also, I think we need to consider tougher points like how this approach scales (in both demos, under the covers networking was implemented via a subnet per Docker host, which wouldn't necessarily scale well), approach to cross host image propagation (again, both demos used a less than optimal way to push images across every node), and integration with other important IaaS networking concepts (such as external load balancers and firewalls).

What is different?

The key difference that I see in these systems is terminology and implementation.

In the IBM demo, we based the concept of a cluster on what Asgard defines as a cluster.  That cluster definition and state is based on multiple separate, but connected by version naming, auto scaling groups.  It is then, the autoscaler that decides placement based on not only "resource availability", but also high availability (spread deployments across distinct failure domains) and locality policies.  Most everyone is available with the concept of high availability in these policies in existing IaaS - in SoftLayer we use Datacenters or pods, in other clouds the concept is called "availability zones".  Also, in public clouds, the policy for co-location is usually called "placement groups".

Marathon (a long running scheduler on top of the Mesos resource manager), offers these same concepts through the concept of constraints.  Kubernetes today doesn't seem, today, to offer these concepts likely due to its initial focus on smaller scenarios.  Given its roots in Google Omega/Borg, I'm sure there is no reason why Kubernetes couldn't eventually expose the same policy concepts within its replication controller.  In fact, at the end of the Kubernetes talk, there is a question from the crowd on how to make Kubernetes scale across multiple Kubernetes configurations which could have been asked from a more high-availability.

So to me, the concept of an autoscaler and its underlying implementation seems very similar to the concept of a resource manager and scheduler.  I wonder if public cloud auto scalers were open sourced if they would be called resource managers and long running schedulers?

The reason why I ask all of this is as we move forward with containers, I think we might be tempted to build another cloud within our existing clouds.  I also think the Mesos and Kubernetes technologies will have people building clouds within clouds until cloud providers natively support containers as a deployment option.  At that point, will we have duplication of resource management and scheduling if we don't combine the concepts?  Also, what will people do to integrate these new container deployments with other IaaS features like load balancers, security groups, etc?

I think others are asking the same question as well.  As shown in the IBM Cloud demo, we are thinking through this right now.  We have also experimented internally with OpenStack deployments of Docker containers as the IaaS layer under a similar IaaS++ layer.  The experiments led to a similar cloud container IaaS deployment option leveraging existing OpenStack approaches for resource management and scheduling as compared to creating a new layer on top of OpenStack.  Also, there is a public cloud that has likely considered this a long time ago - Joyent.  Joyent has had SmartOS zones which are similar to containers under its IaaS API for a long time without the need to expose the formal concepts of resource management and scheduling to its users.  Also, right at the end of the Kubernetes demo, someone in the crowd asks the same question.  I took this question to ask, when will the compute engine support container deployment this way without having a user setup their own private set of Kubernetes systems (and possibly not have to consider resource management/scheduling with anything more than policy).

As I said in the intro, I'm still learning here.  What are your thoughts?

Friday, June 20, 2014

Quick notes on a day of playing with Acme Air / NetflixOSS on Kubernetes

I took Friday to play with the Kubernetes project open sourced by Google at Dockercon.

I was able to get a basic multi-tier Acme Air (NetflixOSS enabled) application working. I was able to reuse (for the most part) containers we built for Docker local (laptop) from the IBM open sourced docker port. By basic, I mean the front end Acme Air web app, back end Acme Air authentication micro service, Cassandra node and Acme Air data loader, and the NetflixOSS Eureka service discovery server. I ran a single instance of each, but I believe I could have pretty easily scaled up each instance of the Acme Air application itself easily.

I pushed the containers to Dockerhub (as Kubernetes by default pulls all container images from there). This was as pretty easy using these steps:

1. Download and build locally the IBM Acme Air NetflixOSS Docker containers
2. Login to dockerhub (needed once I did a push) via 'docker login'
3. Tag the images - docker tag [imageid] aspyker/acmeair-containername
4. Push the containers to Dockerhub - docker push aspyker/acmeair-containername

I started each container as a single instance via the cloudcfg script:

cluster/ -p 8080:80 run aspyker/acmeair-webapp 1 webapp

I started with "using it wrong" (TM, Andrew 2014) with regards to networking. For example, when Cassandra starts, it needs to know about what seed and peer nodes exist and Cassandra wants to know what IP addresses these other nodes are at. For a single Cassandra node, that means I needed to update the seed list to the IP address of the Cassandra container's config file to itself. Given our containers already listen on ssh and run supervisord to run the container function (Cassandra in this case), I was able to login to the container, stop Cassandra, update the config file with the container's IP address (obtained via docker inspect [containerid] | grep ddr), and restart Cassandra. Similarly I needed to update links between containers (for how the application/micro service found the Cassandra container as well how the application/micro service found Eureka). I could ssh into those containers and update routing information that exists in NetflixOSS Archaius config files inside of the applications.

This didn't perfectly work as the routing in NetflixOSS powered by Ribbon and Eureka use hostnames by default. The hostnames currently assigned to containers in Kubernetes are not resolvable by all other containers (so when the web app tried to route to the auth service based on the hostname registered and discovered in Eureka, it failed with UnknownHostException). We hit this in our SoftLayer runs as well and had patched Eureka client to never register the hostname.  I had asked about this previously on the Eureka mailing list and discovered this is something that Netflix fixes internally in Ribbon. I ended up writing a patch for this for Ribbon to just use IP addresses and patched the ribbon-eureka module in Acme Air.

At this point, I could map the front end web app instance to the Kubernetes minion host via cloudcfg run -p 8080:80 port specification and access Acme Air from the Internet in my browser.

My next steps are to look are running replicationControllers around the various tiers of the application as well as making them services so I can use the Kubernetes built in service location and routing.  I can see how to do this via the guestbook example.  In running that example I can see how if I "bake" into my images an idea of a port for each service, I can locate the port via environment variables.  Kubernetes will ensure that this port is routing traffic to the right service implementations on each Kubernetes host via a load balancer.  That will mean that I can start to route all eureka traffic to port 10000, all web app traffic to port 10001, all Cassandra traffic to port 10002, all auth micro service traffic to port 10003 for example.  This approach sounds pretty similar to an approach used at Netflix with Zuul.

Beyond that I'll need to consider additional items like:

1.  Application data and more advanced routing in the service registration/location

2.  How available the service discovery is, especially as we consider adding availability zones/fault domains.

3.  How do I link this into front facing (public internet) load balancers?

4.  How would I link in the concept of security groups?  Or is the port exposure enough?

5.  How I could start to do chaos testing to see how well the recovery and multiple fault domains works.

I do want to thank the folks at Google that helped me get through the newbie GCE and Kubernetes issues (Brendan, Joe and Daniel).

Tuesday, June 10, 2014

Docker SoftLayer Cloud Talk at Dockercon 2014

The overall concept

Today at Dockercon, Jerry Cuomo went over the concept of borderless cloud and how it relates to IBM's strategy.  He talked about how Docker is one of the erasers of the lines between various clouds with regards to openness.  He talked about how, regardless of vendor, deployment option and location, we need to focus on the following things:


Especially in the age of devops and continuous delivery how lack of speed is a killer.  Even worse, actually unforgivable, having manual steps that introduce error is not acceptable any longer.  Docker helps with this by having layered file systems that allow for just updates to be pushed and loaded.  Also, with its process model it starts as fast as you'd expect your applications to start.  Finally, Docker helps by having a transparent (all the way to source) description model for images which guarantees you run what you coded, not some mismatch between dev and ops.


Optimized means not only price/performance but also optimization of location of workloads.  In the price/performance area IBM technologies (like our IBM Java read-only memory class sharing) can provide for much faster application startup and less memory when similar applications are run on a single node.  Also, getting the hypervisor out of the way can help I/O performance significantly (still a large challenge in VM based approaches) which will help data oriented applications like Hadoop and databases.


Openness of cloud is very important to IBM, just like it was for Java and Unix/Linux.   Docker can provide the same write once, run anywhere experience for cloud workloads.  It is interesting how this openness combined with the fast/small also allows for advances in devops not possible before with VM's.  It is now possible to now run production like workload configurations on premise (and on developer's laptops) in almost the exact same way as deployed in production due to the reduction in overhead vs. running a full virtual machine.


Moving fast isn't enough.  You have to most fast with responsibility.  Specifically you need to make sure you don't ignore security, high availability, and operational visibility when moving so fast.  With the automated and repeatable deployment possible with Docker (and related scheduling systems) combined with micro-service application design high availability and automatic recovery becomes easier.  Also, enterprise deployments of Docker will start to add to the security and operational visibility capabilities.

The demo - SoftLayer cloud running Docker

After Jerry covered these areas, I followed up with a live demo.

On Monday, I showed how the technology we've been building to host IBM public cloud services, the Cloud Services Fabric (CSF), works on top of Docker.  We showed how the kernel of the CSF, based in part on NetflixOSS, and powered by IBM technologies was fully open source and easily run on a developer's laptop.  I talked about how this can even allow developers to Chaos Gorilla test their micro-service implementations.

I showed how building the sample application and its microservice was extremely fast.  Building an update to the war file took more time than containerizing the same war for deployment.  Both were done in seconds.  While we haven't done it yet, I could imagine eventually optimizing this to container generation as part of an IDE auto compile.

In the demo today, I followed this up with showcasing how we could take the exact same environment and marry it with the IBM SoftLayer public cloud.  I took the exact same sample application container image and instead of loading locally, pushing through a Docker registry to the SoftLayer cloud.  The power of this portability (and openness) is very valuable to our teams as it will allow for local testing to mirror more closely production deployment.

Finally, I demonstrated how adding SoftLayer to Docker added to the operational excellence.  Specifically I showed how once we told docker to use a non-default bridge (that was assigned a SoftLayer portable subnet attached to the host private interface), I could have Docker assign IP's out of a routable subnet within the SoftLayer network.  This networking configuration means that the containers spun up would work in the same networks as SoftLayer bare metal and virtual machine instances transparently around the global SoftLayer cloud.  Also, advanced SoftLayer networking features such as load balancers and firewalls would work just as well with the containers.  I also talked about how we deployed this across multiple hosts in multiple datacenters (availability zones) further adding to the high availability options for deployment.  To prove this, I unleashed targeted chaos army like testing.  I showed how I could emulate a failure of a container (by doing a docker rm -f) and how the overall CSF system would auto recover by replacing the container with a new container.

Some links

You can see the slides from Jerry's talk on slideshare.

The video:

Direct Link (HD Version)

Saturday, June 7, 2014

Open Source Release of IBM Acme Air / NetflixOSS on Docker

In a previous blog, I discussed the Docker "local" (on laptop) IBM Cloud Services Fabric powered in part by NetflixOSS prototype.

One big question on twitter and my blog went unanswered.  The question was ... How can someone else run this environment?  In the previous blog post, I mentioned how there was no plan to make key components open source at that point in time.

Today, I am pleased to announce that all of the components to build this environment are now open source and anyone can reproduce this run of IBM Acme Air / NetflixOSS on Docker.  All it takes is about an hour, a decent internet connection, and a laptop with VirtualBox (or boot2docker, or vagrant) installed.

Specifically, the aspects that we have added to open source are:

  1. Microscaler - a small scale instance health manager and auto recovery/scaling agent that works against the Docker remote API.  Specifically we have released the Microscaler service (that implements a REST service), a CLI to make calling Microscaler easier, and a Microscaler agent that is designed to manage clusters of Docker nodes.
  2. The Docker port of the NetflixOSS Asgard devops console.  Specifically we ported Asgard to work against the Docker API for managing IaaS objects such as images and instances as well as the Microscaler API for clusters.  The port handles some of the most basic CRUD operations in Asgard.  Some scenarios (like canary testing, red/black deployment) are yet to be fully implemented.
  3. The Dockerfiles and build scripts that enable anyone to build all of the containers required to run this environment.  The Dockerfiles build containers of the Microscaler, the NetflixOSS infrastructural servers (Asgard, Eureka and Zuul), as well as the full microservices sample application Acme Air (web app, microservice and cassandra data tier).  The build scripts help you build the containers and give easy commands to do the end to end deployment and common administration tasks.
If you want to understand what this runtime showcases, please refer to the previous blog entry.  There is a video that shows the Acme Air application and basic chaos testing that proves the operational excellence of the environment.

Interesting compare:

It is interesting to note that the scope of what we released (the core of the NetflixOSS cloud platform + the Acme Air cloud sample/benchmark application) is similar to we previously released back at the Netflix Cloud Prize in the form of Amazon EC2 AMI's.  I think it is interesting to consider the difference when using Docker in this release as our portable image format.  Using Docker, I was able to easily release the automation of building the images (Dockerfiles) in source form which makes the images far more transparent than an AMI in the Amazon marketplace.  Also, the containers built can be deployed anywhere that Docker containers can be hosted.  Therefore, this project is going to be valuable to far more than a single cloud provider -- likely more on that later as Dockercon 2014 happens next week.

If you want to learn how to run this yourself, check out the following video.  It shows building the containers for open source, starting an initial minimal environment and starting to operate the environment.  After that go back to the previous blog post and see how to perform advanced operations.

Direct Link (HD Version)