Friday, October 17, 2014

Kafka Part 1

Just a blog post to document some things I learned as I started to play with Kafka.  Hopefully others are doing similar work and can google-able benefit ...

I wanted trying to repeat (as a starting point) the performance blog post on Kafka.  Here are the issues I ran into and how I fixed them.

I first started to play with Kafka on a simple instance and choose m3.medium.  While this gave me a place to play with Kafka functionally if I was going to push 75M/sec to the Kafka logs, given the on-instance storage is 4GB, I was going to run out of storage in 4*1024/75 = 55 seconds.  I decided to move up to i2.xlarge which gives me 800G which is 800*1024/75 = 11000 seconds = 3 hours.

After I got basic functional testing, I started to look for the performance testing tools.  The fore-mentioned blog shows a new performance test client available in > 0.8.1 Kafka.  Unfortunately 0.8.2 isn't available pre-compiled yet and the performance client uses the new Java client.  So off I went to compile Kafka from source (using trunk).

Kafka builds using gradle -- something I'm used to.  Weirdly, Kafka doesn't have the gradle wrapper checked in.  This means you have to pay attention to the readme which says clearly you need to install gradle separately and run gradle once to get the gradle wrapper.  Unfortunately, I tried to "fix" the lack of a wrapper without reading the docs (doh!).  I just added another version of gradle wrapper to kafka/gradle/wrapper.  This let me build, but I immediately, got the error:

Execution failed for task ':core:compileScala'.
> com.typesafe.zinc.Setup.create(Lcom/typesafe/zinc/ScalaLocation;Lcom/typesafe/zinc/SbtJars;Ljava/io/File;)Lcom/typesafe/zinc/Setup;

Running gradle with --info showed me that there was a class mismatch error launching the zinc plugin.  I was able to get rid of the error by changing the dependency in build.gradle for the latest zinc compiler.  Doing this made me think if there was a dependency issue between gradle and the zinc plugin.  Once I realized this, I re-read the readme where it asks you to install gradle first and then run gradle once to get the correct wrapper.  Oops.  After following the directions (TM), trunk compiles worked.

I then did some basic single host testing and I was up and running on Kafka trunk.  I then tried to create multiple servers and separated the client from the server and created replicated topics.  This fails horribly if you have issues with hostnames.  Specifically, you will see upon creating a replicated topic the following error hundreds of times before a server crashes and then the next will do the same thing.

[ReplicaFetcherThread-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 4; ClientId: ReplicaFetcherThread-0-0; ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [my-replicated-topic,0] -> PartitionFetchInfo(0,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.server.ReplicaFetcherThread)

Upon looking at the more detailed logs, I noticed it was trying to connect to the other servers by the hostname they knew themselves (vs. DNS).  I'm guessing that the server upon startup does a inetaddress.getlocalhost().gethostname() and then registers that name in Zookeeper.  When the other servers then try to connect unless that hostname is resolvable (mine wasn't), it just won't work.  You'll see similar issues with the client to server connection saying certain servers aren't resolvable.  For now, I have solved this by adding the hostname mapping to ip address in all clients and servers /etc/hosts.  I'll certainly need to see how this can be fixed automatically without the need for DNS (hopefully leveraging Eureka discovery eventually).

The fore-mentioned blog says there was no tuning done addition to "out of the box", but upon inspection of the, it seems like there are either changes that were trivial or have changed in trunk since blog was written.  Something to come back to eventually, but for not I decided to just use the out-of-box so I could get a baseline and then start tweaking.

I was then able to spin up five c3.xlarge instances as clients which should have enough CPU to push the servers.

I was able to get the following results:

Single producer thread, no replication - 657431 records/sec @ 62.70 MB/sec.

This is 80% of the original blog.  I'm guessing I'll be able to tune a bit to get there as I've seen initial peaks hit 710K.  First performance results are always low or wrong.

Then I see Three producer threads, no replication:

client 1 - 679412 records/sec @ 64.79 MB/sec
client 2 - 678398 records/sec @ 64.70 MB/sec
client 3 - 678150 records/sec @ 64.67 MB/sec
total ~ 2035000 records/sec @ 94MB/sec

This is about 3X the previous result as expected.  I also did vmstat and netstat -a -n |grep 9092 to confirm what was going on to help me understand the clustering technology.

At five clients, things dropped off in an unexpected way.  I've hit some bottleneck.  I basically get lower results per client that in total add up to the 3 node result.

I have also played with replicated topics, but won't post those numbers yet.

Things I'm thinking I need to investigate (in order of priority):

1.  Networking speed.  It looks like I am limited by network (or disk, but doubt it) at this point.  I am new to Amazon, so when the table says moderate network performance, I'm not sure how much that means.  I'm more used to 100M or 1G, etc.

2.  Enhanced networking performance.  While I launched HVM images, I'm not sure yet in the base image I used has the additional HVM drivers.

3.  Tuning of the settings for Kafka.

4.  Java tuning.

5.  Different instance types.  HS1 would give me a way to split the traffic between volumes.

6.  Kafka code.

All this said, I need to also play with message sizes (larger and mixed) and look at some of the expected application access patterns - expanding into consumers as well.

More in part 2 coming soon.

Sunday, October 12, 2014

Performance Testing via "Microservice Tee" Pattern

In my last blog post, I warned that I might be too busy to blog with the new job.  I received feedback from many folks that they valued my blog and wanted me to keep blogging in the new job.  Ask and you shall receive.

This blog post actually started with a cool story I heard in Gene Kim's talk at Dockercon.  He discussed how Facebook used existing user's sessions to send test messages to their new messenger service well before it's actual release.  It's well worth listening to that part of the video as it gives you some idea of what I'm about to discuss.  Net net, the approach described here is the only way Facebook could have been confident in releasing the new messenger service without risk of performance being a problem on day one.

I had actually heard something somewhat similar when Netflix open sourced Zuul.  I heard stories of how you could use Zuul to direct a copy (or in Unix terms "tee") of all requests to a parallel cluster that is running active profiling.  The real request goes down the main path to the main production cluster, but at the same time a copy goes down a duplicate path and performance instrumented cluster.  The request to the non-critical performance cluster would be asynchronous and the response should be ignored (fire and forget basically).  As a performance engineer that has helped many IBM customers I started to salivate about such an approach.  I was excited given how this would mean that you could then do performance analysis on services with exact production traffic with very minimal impact on the upstream system.

Many folks I've helped in the past needed such an approach.  To see why, let's discuss what happens typically without following this approach.  As a good performance engineer I suggest getting in early while code is being developed.  What that means is creating load testing scripts with virtual users and running these while the code is being developed, focusing on what function is available at the time.  The problem here is rooted in the word "virtual".  Not only are the users in this approach artificially modeled, but also the timing between requests, data sent, etc is at best estimated.  While this approach works with careful consideration most of the time, many times I've seen a service go into production that fails in performance.  Why?  Simply put -- bad assumptions in the artificial load testing failed to model something that wasn't obvious about how users were going to interact with the service.

So far, none of this relates to what I've seen since I joined Netflix.  Interestingly, in the four days I've worked thus far, I've heard a similar approach for two major projects at Netflix.  Two technologies that would have a large impact on Netflix are following the same approach.  The idea is to deploy a new version of a technology and make sure initially all requests, while serviced by the older version, get sent in duplicate to the new technology.  That way, Netflix can assure the new technology is behaving as expected with exactly the production traffic it will eventually receive.  You can use such an approach to do basic functional testing of your new code.  You can then do performance testing much like the performance instrumented tee described above.  Finally, you can even take this beyond performance testing by doing chaos testing (high availability) and scale up/down (capacity and elastic scaling) testing on the new technology implementation without fear of what it would do to the parallel production path.

I'm not sure if there is already a name for this approach.  If you find other descriptions of this formally, let me know.  For now, I think I'll call it the "Performance testing via microservices tee" pattern.

Ok.  Time to go back to going deep at Netflix.

PS.  I have now heard the type of testing called "shadow traffic".  Doesn't describe how to achieve it, but still a good word.

Monday, October 6, 2014

Red Is the New Blue

I am excited to announce that I will be starting a new step in my career tomorrow.  After over 15 years with IBM, I will be joining Netflix (hence the Orange is the New Black homage in the title) starting Tuesday.

In my time at IBM I've had the honor to work with incredible colleagues, on amazing projects, and through exciting times.  The work on IBM Host On-Demand (the first commercial Java product in the world), IBM WebSphere Application Server function and performance (starting with V4), WebSphere's SOA (WS-*, Business Process Management and Enterprise Service Bus) and IBM's XML (XSLT2/XQuery) technology over the years has been challenging, interesting, and most importantly fun.  More recently in the time spent in the IBM's Emerging Technology Institute I was able to work on performance and then scale and then cloud and then on the creation of the Cloud Services Fabric (CSF).  This focus on cloud was very exciting to be a part of and I believe the technology I was a part of will continue at IBM for some time, being part of what helps IBM public cloud services have rock solid operational capabilities empowering development to deliver function quickly to customers with the insight needed to adjust quickly to opportunities that present themselves.  I'd like to personally thank all of those people who have worked beside me over the years.  Unfortunately with the length of my career at IBM, I cannot thank each individual person here.  Just know that every team I worked on is near and dear to my heart.

This said, it was time in my career to make a change.  I will be joining the cloud platform engineering team at Netflix is Los Gatos, California.   I will further explain the role as appropriate in the near future. I have had the opportunity, through NetflixOSS, to work with many of my soon to be colleagues already.  The technology, teams, and challenges ahead of me are all very exciting.  Below are some points of comparison that all factored into my decision to make a change in my career:
  1. Big company to less big company
    • Except for a short stints in college and in the Raleigh/Durham startup scene in 2000, I have worked for one of the largest companies in the world for the rest of my career.  It is amazing how many people I know across the world and across the various business units of IBM.  The scope of projects that can be done with such resources is great.  I hope to replace the same scope at Netflix with the intense technology culture and collaboration with similar companies in the valley.
    • Some stats for the compare:  Employees (~400,000 IBM, ~ 2,000 Netflix), Market cap ($189B IBM, $27B Netflix), Countries served (~170 IBM, ~50 Netflix - and growing every day)
  2. Enterprise to consumer
    • Enterprise is a challenging market to succeed in.  The expectations are high and the legacy is strong.  Keeping the world's governments, banks, insurance companies, hospitals, and manufacturers running is no easy task.  I am proud to have been a part of powering many of those critical businesses.
    • In moving to the consumer side, I expect to learn a new view on computing which was tough to focus on in the enterprise market.  I think the size and scale of the Netflix platform and service will be a huge change and learning experience for me personally.  I doubt Netflix is stopping to 50 million subscribers and it is clear that they will continue to help drive Internet bandwidth policy discussions across the globe.
    • I believe technology to power the consumer offerings is contributing significantly to technology innovation.  You don't have to look far beyond the cloud vendors to see the consumer companies and system of engagement applications that drove the initial creation of cloud.  The same holds true for mobile.  Other aspects of technology like NoSQL, containers, and large scale distributed computing also were born out of the necessity to implement solutions to consumer company problems only due to their mobile scale and always on requirements.
  3. Research Triangle Park (North Carolina) to Bay Area (California)
    • This was not only the hardest aspect of my decision to make a change, but was a key motivating factor to make a change.
    • In my career it is easy for me to make personal decisions.  The decision to move to the Bay area seems easy to someone that wants to create technology.  In the past, I have considered areas like Boston, New York City, or Texas.  All of these areas create more innovative technology than Research Triangle Park these days.  In the last year or so, I have witnessed the amazing collaboration that exists between companies in the Bay Area.  I believe the collaboration possible in the Bay area exceeds that of other the tech areas.  In the end, I came to the decision that there is no better place in my personal career in technology.
    • In my personal and family life moving from the east to the west coast is a huge change.  Our family is well established on the Atlantic coast.  We have amazing friends and neighbors in the Raleigh/Durham area.  Without going into personal details, I do think while the change will be hard, there are still very good reasons for a family to settle in the Bay area.  With kids that are interested in technology I know of Bay area opportunities to share in learning with them.  Also, I hope the access to great California outdoor space with moderate temperatures will be a reason to spend more time with the family enjoying the surroundings.
So today I am sitting on a plane with the first one way ticket I've had in a long time.  I've got my new Netflix red shoes on (see below) ready to take a first step on a huge new journey.  I'm sure I'll be swamped for the first month or so, so don't expect too many blog updates.  Once I get my feet firmly back on the ground, I'll let you know more about what I am doing.

Thanks!  If you get to the San Jose area, look me up!

Monday, August 18, 2014

A NetflixOSS sidecar in support of non-Java services

In working on supporting our next round of IBM Cloud Service Fabric service tenants, we found that the service implementers came from very different backgrounds.  Some were skilled in Java, some Ruby and others were C/C++ focused and therefore their service implementations were just as diverse.  Given the size of the team of the services we're on-boarding and timeframe for going public, recoding all of these services to use NetflixOSS Java libraries that bring the operational excellence (like Archaius, Karyon, Eureka, etc) seemed pretty unlikely.

For what it is worth, we faced a similar challenge in earlier services (mostly due to existing C/C++ applications) and we created what was called a "sidecar".  By sidecar, what I mean is a second process on each node/instance that did Cloud Service Fabric operations on behalf of the main process (the side-managed process).  Unfortunately those sidecars all went off and created one-offs for their particular service.  In this post, I'll describe a more general sidecar that doesn't force users to have these one-offs.

Sidenote:  For those not familiar with sidecars, think of the motorcycle sidecar below.  Snoopy would be the main process with Woodstock being the sidecar process.  The main work on the instance would be the motorcycle (say serving your users' REST requests).  The operational control is the sidecar (say serving health checks and management plane requests of the operational platform).

Before we get started, we need to note there are multiple types of sidecars.  Predominantly there are two main types of sidecars.  There are sidecars that manage durable and or storage tiers.  These sidecars need to manage things that other sidecars do not (like joining a stateful ring of servers, or joining a set of slaves and discovering masters, or backup and recovery of data).  Some sidecars that exist in this space are Priam (for Cassandra) and Exhibitor (for Zookeeper).  The other type is for managing stateless mid-tier services like microservices.  An example of this is AirBNB's Synapse and Nerve.  You'll see that in the announcement of Synapse and Nerve on AirBNB's blog that they are trying to solve some (but not all) of the issues I will mention in this blog post.

What are some things that a microservice sidecar could do for a microservice?

1. Service discovery registration and heartbeat

This registration with service discovery would have to happen only after the sidecar detects the side-managed process as ready to receive requests.  This isn't necessarily the same as if the instance is "healthy" as an instance might be healthy well before it is ready to handle requests (consider an instance that needs to pre-warm caches, etc.).  Also, all dynamic configuration of this function (where and if to register) should be considered.

2.  Health check URL

Every instance should have a health check url that can communicate out of band the health of an instance.  The sidecar would need to query the health of the side-managed process and expose this url on behalf of the side-managed process.  Various systems (like auto scaling groups, front end load balancers, and service discovery queries) would query this URL and take sick instances out of rotation.

3.  Service dependency load balancing

In a NetflixOSS based microservice, routing can be done intelligently based upon information from service discovery (Eureka) via smart client side load balancing (Ribbon).  Once you move this function out of the microservice implementation, as AirBNB noted as well, it is likely unneeded and problematic in some cases to move back to centralized load balancing.  Therefore it would be nice if the sidecar would perform load balancing on behalf of the side-managed process.  Note that Zuul (on instance in the sidecar) could fill this role in NetflixOSS.  In AirBNB's stack, the combination of service discovery and this item is done through Synapse.  Also, all dynamic configuration of this function (states of routes, timeouts, retry strategy, etc) should be considered.

One other area to consider here (especially in the NetflixOSS space) would be if the sidecar should provide for advanced devops filters in load balancing that go beyond basic round robin load balancing.  Netflix has talked about the advantages of Zuul for this in the front/edge tier, but we could consider doing something in between microservices.

4.  Microservice latency/health metrics

Being able to have operational visibility into the error rates on calls to dependent services as well as latency and overall state of dependencies is important to knowing how to operate the side-managed process.  In NetflixOSS by using the Hystrix pattern and API, you can get such visibility through the exported Hystrix streams.  Again, Zuul (on instance in the sidecar) can provide this functionality.

5.  Eureka discovery

We have found service implementation in IBM that already have their own client side load balancing or cluster technologies.  Also, Netflix has talked about other OSS systems such as Elastic Search.  For these systems it would be nice if the sidecar could provide a way to expose Eureka discovery outside of load balancing.  Then the client could ingest the discovery information and use it however it felt necessary.  Also, all dynamic configuration of this function should be considered.

6.  Dynamic configuration management

It would nice if the sidecar could expose to the side-managed process dynamic configuration.  While I have mentioned the need to have previous sidecar functions items dynamically configured, it is important that the side-managed process configuration to be considered as well.  Consider the case where you want the side-managed process to use a common dynamic configuration management system but all it can do is read from property files.  In NetflixOSS this is managed via Archaius but this requires using the NetflixOSS libraries.

7.  Circuit breaking for fault tolerance to dependencies

It would nice if the sidecar could provide an approximation of circuit breaking.  I believe this is impossible to do as "cleanly" as using NetflixOSS Hystrix natively (as this wouldn't require the user to write specific business logic to handle failures that reduce calls to the dependency), but it might be nice to have some level of guarantee of fast failure of scenarios using #3.  Also, all dynamic configuration of this function (timeouts, etc) should be considered.

8.  Application level metrics

It would be nice if the sidecar provided could allow the side-managed process to more easily publish application specific metrics to the metrics pipeline.  While every language likely already has a nice binding to systems like statsd/collectd, it might be worth making the interface to these systems common through the sidecar.  For NetflixOSS, this is done through Servo.

9. Manual GUI and programmatic control

We have found the need to sometimes quickly dive into a specific instance with human eyes.  Having a private web based UI is far easier than loading up ssh.  Also, if you want to script access to the functions and data collected by the sidecar, we would like a REST or even JMX interface to the control offered in the sidecar.

This all said, I started a quick project last week to create a sidecar that does some of these functions using NetflixOSS so it integrated cleanly into our existing IBM Cloud Services Fabric environment.  I decided to do it in github, so others can contribute.

By using Karyon as a base for the sidecar, I was able to get a few of the items on the list automatically (specifically #1, #2 partially and #9).  I started with the most basic sidecar in the trunk project.  Then I added two more things:

Consul style health checks:

In work leading up to this work Spencer Gibb pointed me to the sidecar agents checks that Consul uses (which they said they based on Nagios).  I based a similar set of checks for my sidecar.  You can see in this archaius config file how you'd configure them: a script that curls the healthcheck url of the sidemanaged process 8080 / a script that tests if /opt/sidecarscripts/killswitch.txt exists

Specifically you define a check as an external script that the sidecar executes and if the script returns a code of 0, the check is marked as healthy (1 = warning, otherwise unhealthy).  If all checks defined come back as healthy for greater than three iterations, the instance is healthy.  I have coded up some basic shell scripts that we'll likely give to all of our users (like and  Once I had these checks being executed by the sidecar, it was pretty easy to change the Karyon/Eureka HealthCheckHandler class to query the CheckManager logic I added.

Integration with Dynamic Configuration Management

We believe most languages can easily register events based on files changing and can easily read properties files.  Based on this, I added another feature configured this archiaus config file:

What this says is that a user of the sidecar puts all of the properties they care about in the file.template properties file and then as configuration is dynamically updated in Archaius the sidecar sees this and writes out a copy to the main properties file with the values filled in.

With these changes, I think we now have a pretty solid story for #1, #2, #6 and #9.  I'd like to next focus on #3, #4, and #7 adding a Zuul and Hystrix based sidecar process but I don't have users (yet) pushing for these functions.  Also, I should note that the code is a proof of concept and needs to be hardened as it was just a side project for me.

PS.  I do want to make it clear that while this sidecar approach could be used for Java services (as opposed to languages that don't have NetflixOSS bindings), I do not advocate moving these functions to external to your Java implementation.  There are places where offering this function in a side-car isn't as "excellent" operationally and more close to "good enough".  I'll let it to the reader to understand these tradeoffs.  However, I hope that work in this microservice sidecar space leads to easier NetflixOSS adoption in non-Java environments.

PPS.  This sidecar might be more useful in the container space as well at a host level.  Taking the sidecar and making it work across multiple single process instances on a host would be an interesting extension of this work.

Wednesday, August 6, 2014

Sidecars and service registration

I have been having internal conversations on sidecars to manage microservices.  By sidecar, I mean a separate process on each instance node that performs things on behalf of the microservice instance like service registration, service location (for dependencies), dynamic configuration management, service routing (for dependencies), etc.  I have been talking about how an in-process (vs. sidecar) approach to provide these functions while intrusive (requires every microservice to code to or implement a certain framework) is better.  I believe it is hard for folks to understand why things are "better" without actually running into nasty things that happen in real world production scenarios.

Today I decided to simulate a real world scenario.  I decided to play with Karyon which is the NetflixOSS in process technology to manage bootstrap and lifecycle of microservices.  I did the following:

  1. I disabled registry queries which Karyon by default does for the application assuming it might need to look up dependencies (eureka.shouldFetchRegistry=false).  I did this just to simply the timing of pure service registration.
  2. I "disabled" heartbeats for service registration (eureka.client.refresh.interval=60).  Again, I did this just to simplify the timing of initial service registration.
  3. I shortened the time for the initial service registration to one second (eureka.appinfo.initial.replicate.time=1).  I did this to be able to force the registration to happen immediately.
  4. I added a "sleep" to my microservice registration (@Application initialize() { .. Thread.sleep(1000*60*10) } ).  I did this to simulate a microservice that takes some time to "startup".
Once I did this, I saw the following:

The service started up and immediately called initialize, but of course this stalled.  The service also then immediately registered itself into the Eureka service discovery server.  At this point, a query of the service instance in the service registry returns a status of "STARTING".  After 10 minutes, the initialization finishes.  At this later point, the query of the service instance returns a status "UP".  Pretty sensible, no?

I then started to think if a sidecar could somehow get this level of knowledge by poking it's side-managed process.  If you look at Airbnb Nerve (a total sidecar based approach) it does exactly this.  I could envision a Eureka sidecar that was similar to Nerve that pinged the "healthcheck URL" already exposed by Karyon.

This got me thinking of if a health check URL returning 200 (OK) would be a sufficient replacement for deciding on service registration status.  Specifically if healthcheck returns OK for three or so checks, have the sidecar put the service into service discovery as "up".  Similarly if three or so checks return != 200.

I started up a twitter question on this idea and received great feedback from Spencer Gibb.  His example was a service that needed to do database migration before starting up.  In that case, while the service is healthy, until the service is up it shouldn't tell others that it was ready to handle requests.  This is especially true if the health manager of your cluster is killing off instances that aren't "healthy", so you can't solve the issue as just reporting "unhealthy" until the service is ready to handle requests.

This said, if a sidecar is to decide on when a service should be marked as ready to handle traffic, it would seem to reason that every side managed process needs a separate URL (from health check and/or the main microservice interface) for state of boot of the service.  Also, this would imply the side managed process likely needs a framework to consistently decide on the state to be exposed by that url.  In NetflixOSS that framework is Karyon.

I will keep thinking about this, but I find it hard to understand how a pure sidecar based approach with zero changes to a microservice (without a framework embedded into the side managed process) when a service is really "UP" and ready to handle requests vs. "STARTING" vs. "SHUTTING DOWN", etc.  I wonder if AirBNB asks its service developers to define a "READYFORREQUESTS" url's and that is what they pass as configuration to Nerve?

Thursday, July 24, 2014

Multitenancy models (and frats/sororities, toilets, and kids)

I have had more than a few discussions lately with various IBM teams as we move forward with some of our internal cloud technologies leveraging NetflixOSS technology.  I have found that one of the conversations that is hard to talk through is multitenancy.

Let's set the stage with the definition of multitenancy and how this affects cloud computing.

Wikipedia definition of multitenancy:

"Multitenancy refers to a principle in software architecture where a single instance of the software runs on a server, serving multiple client-organizations (tenants). Multitenancy contrasts with multi-instance architectures where separate software instances (or hardware systems) operate on behalf of different client organizations. With a multitenant architecture, a software application is designed to virtually partition its data and configuration, and each client organization works with a customized virtual application."

Wikipedia further explains multitenancy in the context of cloud computing:

Multitenancy enables sharing of resources and costs across a large pool of users thus allowing for:
centralization of infrastructure in locations with lower costs (such as real estate, electricity, etc.),
peak-load capacity increases (users need not engineer for highest possible load-levels),

utilisation and efficiency improvements for systems that are often only 10–20% utilised.

It seems like everyone comes with their own definition of multitenancy and from what I can see they are all shades of the same definition.  Specifically as you see below the differences are in the "designed to virtually partition its data and configuration" and to what extend that partitioning is possible to be affected by other users.

In an effort to have more meaningful conversations, I propose the following poor analogies based on humans inhabiting space.

The "Big Room"

Consider a really big room with one door.  That door is where all people living in the room enter and exit.  Also, there is a single toilet in the middle of the room.  Also consider that we allow anyone to come in and one of the inhabitants is a bit crazy and loves to run around the room at full speed randomly bouncing off the walls.  While this "big room" is multi-tenant (more than one human could live there) it doesn't well partition or protect inhabitants from each other.  Also, the use of the toilet (a common resource) might be a bit more than embarrassing to say the least.  I think most people I have talked to would consider this environment to lack even the weakest definitions of multi-tenancy.

The "Doorless Single Family Home"

Consider a typical North American single family home but remove all the internal doors.  In this new analogy, we might still have crazy inhabitants (I have two and they are called kids).  We can start to partition them from the rest of us by putting then in a door and they only once and a while escape into the areas affecting others.  Now the toilet and other shared resources are easier to use safely, but still not as safe as you'd want.  One other big change is the type of inhabitants and their ability to share.  In a family, likely they all have semi-common goals and one won't destroy common resources (the toilet) and if they do the family works to ensure that it doesn't happen again.  Finally, there is benefit of this family of living together as they likely share their services freely.

The improved "Single Family Home with Doors".

Consider the previous example, but add back in the typical doors - doors that can be opened and closed, but likely not locked.  Now our private moments are improved.  Also, the crazy kids won't bounce out of their rooms as frequently.  The doors are there to help mistaken bad interactions, but the doors can be opened freely to help achieve family goals more quickly than with the doors closed.

The "Fraternity/Sorority house"

Continuing the poor analogy, what if we make the inhabitants have similar, but more divergent goals than a single family.  Of course all of the inhabitants of a fraternity or sorority house want to graduate college and they might be working on similar subjects that they could share information and learning amongst themselves, but sometimes you really don't want your co-inhabitant to enter your part of the house.  When that co-inhabitant is drunk (never happens in college, right?), you really would like a locked door between you and them.  The co-inhabitant isn't really meaning to cause you harm, but they could cause you harm none-the-less, so you probably added a locked door just in case.  Ok, I'll admit I now have totally lose the part of the analogy of the toilet, but likely there are still shared resources that the house works to protect and shared responsibly.

The "Apartment Building"

Now we finish the analogies with what I think most people I talk about multitenancy consider from the start.  Consider an apartment where every tenant gets his or her own lockable front door.  Also, all of their toilets or important resources are protected to just them.  The inhabitants don't have any common shared goals.  Therefore, the apartment living conditions make sense.  However, these living conditions can be problematic in two ways.  First, this is a more costly way to live and operate both for each inhabitant and their non-shared resources.  Second, if any of the inhabitants have any shared goals, their lack of internal doors means a much slower communication channel and forward progress will be slower.

The Wrap-Up

Now going back to non-analogy aspect.  Many of the NetflixOSS projects (Eureka/Asgard/etc) come from a model that I think best is described by "Doorless Single Family Home".  There is nothing wrong with that for the type of organization Netflix is and likely when deployed inside of Netflix there are more doors added beyond the public OSS.  At IBM, in our own usage I believe we need at least "Single Family Home with Doors" mostly to add some doors that protect us from new users of the cloud technology from accidentally impacting others.  Some have argued that we need the "Fraternity/Sorority house" adding in locked doors until people are confident that people won't even with unlockable doors impact others.  Adding locking of doors means things like adding owner writable only namespaces to Eureka, locking down which clusters can be changed in Asgard, providing segmented network VLAN's, etc.  Finally, if we ever looked to run this fabric across multiple IBM customers (say Coke and Pepsi), it is hard to argue that we wouldn't need the full "Apartment Building" approach.

I hope this helps others in discussing multitenancy.  I hope my own team won't get tired of these new analogies.

Friday, June 27, 2014

How is a multi-host container service different from a multi-host VM service?

Warning:  I am writing this blog post without knowing the answer to the question I am asking in the title.  I am writing this post to force myself to articulate a question I've personally been struggling with as we move towards what we all want - containers with standard formats changing how we handle many cases in the cloud.  Also, I know there are folks that have thought about this for FAR longer than myself and I hope they comment or write alternative blogs so we can all learn together.

That said, I have seen throughout the time leading up to Dockercon and since what seems to be divergent thoughts that when I step back aren't so divergent.  Or maybe they are?  Let's see.

On one hand, we have existing systems on IaaS clouds using virtual machines that have everything controlled by API's with cloud infrastructural services that help build up a IaaS++ environment.  I have specifically avoided using the word PaaS as I define PaaS as something that tends to abstract IaaS to a point where IaaS concepts can't be directly seen and controlled.  I know that everyone doesn't accept such a definition of PaaS, but I use it as a means to help explain my thoughts (please don't just comment exclusively on this definition as it's not the main point of this blog post).  By IaaS++ I mean an environment that adds to IaaS offering services like continuous delivery workflows, high availability fault domains/automatic recovery, cross instance networking with software defined networking security, and operational visibility through monitoring.  And by not calling it PaaS, I suggest that the level of visibility into this environment includes IaaS concepts such as (VM) instances through ssh or other commonly used *nix tools, full TCP network stack access, full OS's with process and file system control, etc.

On the other hand, we have systems growing around resource management systems and schedulers using "The Datacenter as a Computer" that are predominantly tied to containers.  I'll admit that I'm only partially through the book on the subject (now in 2nd edition).  Some of the systems in open source to implement such datacenter as the computer/warehouse scale machines are Yarn (for Hadoop), CoreOS/Fleet, Mesos/Marathon and Google Kubernetes.

At Dockercon, IBM (and yours truly) demoed a Docker container deployment option for the IBM SoftLayer cloud.  We used our cloud services fabric (partially powered by NetflixOSS technologies) on top of this deployment option as the IaaS++ layer.  Given IBM SoftLayer and its current API doesn't support containers as a deployment option, we worked to implement some of ties to the IaaS technologies as part of the demo reusing the Docker API.  Specifically, we showcased an autoscaling service for automatic recovery, cross availability zone placement, and SLA based scaling.  Next we used the Docker private registry along side the Dockerhub public index for image management.  Finally we did specific work to natively integrate the networking from containers into the SoftLayer network.  Doing this networking work was important as it allowed us to leverage existing IaaS provided networking constructs such as load balancers and firewalls.

Last night I watched the Kubernetes demo at Google I/O by Brendan Burns and Craig McLuckie.  The talk kicks off with an overview of the Google Compute Engine VM optimized for containers and then covers the Kubernetes container cluster management open source project which includes a scheduler for long running processes, a labeling system that is important for operational management, a replication controller to scale and auto recover labeled processes, and a service abstraction across labeled processes.

I encourage you to watch the two demo videos before proceeding, as I don't want to force you into thinking only from my conclusions.  Ok, so now that you've watched the videos yourself, let me use the two videos to look at use case comparison points (the links now jump to the right place in each video that are similar):

Fast development and deployment at scale

Brendan demonstrated rolling updates on the cloud.  In the IBM demo, we showed the same, but as an initial deployment on a laptop.  As you see later in the demo, due to the user of Docker, running on the cloud is exactly the same as the laptop.  Also, the IBM cloud services fabric devops console - NetflixOSS Asgard also has the concept of rolling updates as well as the demonstrated initial deployment.  Due to Docker, both demos use essentially the same approach to image creation/baking.

Automatic recovery

I like how Brendan showed through a nice UI the failure and recovery as compared to me watching log files of the health manager.  Other than presentation, the use case and functionality was the same.  The system discovered a failed instance and recovered it.

Service registration

Brendan talked about how Kubernetes offers the concept of services based on tagging.  Under the covers this is implemented by a process that does selects against the tagged containers updating an etcd service registry.  In the cloud services fabric demo we talked about how this was done with NetflixOSS Eureka in a more intrusive (but maybe more app centric valuable) way.  I also have hinted about how important it is to consider availability in your service discovery system.

Service discovery and load balancing across service implementations

Brenda talked about in Kubernetes how this is handled by, currently, a basic round robin load balancer.  Under the covers each Kubernetes node starts this load balancer and any defined service gets started on the load balancer across the cluster with information being passed to client containers via two environment variables, one for the address for the Kubernetes local node load balancer, and one for the port assigned to a specific service.  In the cloud services fabric this is handled by Eureka enabled clients (for example NetflixOSS Ribbon for REST), which does not require a separate load balancer and is more direct and/or the similar NetflixOSS Zuul load balancer in cases where the existing clients can't be used.

FWIW, I haven't seen specifically supported end to end service registration/discovery/load balancing in non-Kubernetes resource managers/schedulers.  I'm sure you could build something similar on top of Mesos/Marathon (or people already have) and CoreOS/etcd, but I think Kubernetes concept of labels and services (much like Eureka) are right in starting to integrate the concept of services into the platform as they are so critical in microservices based devops.

I could continue to draw comparison points for other IaaS++ features like application centric metrics, container level metrics, dynamic configuration management, other devops workflows, remote logging, service interaction monitoring, etc, but I'll let that to the reader.  My belief is that many of these concepts will be implemented in both approaches, as they are required to run an operationally competent system.

Also, I think we need to consider tougher points like how this approach scales (in both demos, under the covers networking was implemented via a subnet per Docker host, which wouldn't necessarily scale well), approach to cross host image propagation (again, both demos used a less than optimal way to push images across every node), and integration with other important IaaS networking concepts (such as external load balancers and firewalls).

What is different?

The key difference that I see in these systems is terminology and implementation.

In the IBM demo, we based the concept of a cluster on what Asgard defines as a cluster.  That cluster definition and state is based on multiple separate, but connected by version naming, auto scaling groups.  It is then, the autoscaler that decides placement based on not only "resource availability", but also high availability (spread deployments across distinct failure domains) and locality policies.  Most everyone is available with the concept of high availability in these policies in existing IaaS - in SoftLayer we use Datacenters or pods, in other clouds the concept is called "availability zones".  Also, in public clouds, the policy for co-location is usually called "placement groups".

Marathon (a long running scheduler on top of the Mesos resource manager), offers these same concepts through the concept of constraints.  Kubernetes today doesn't seem, today, to offer these concepts likely due to its initial focus on smaller scenarios.  Given its roots in Google Omega/Borg, I'm sure there is no reason why Kubernetes couldn't eventually expose the same policy concepts within its replication controller.  In fact, at the end of the Kubernetes talk, there is a question from the crowd on how to make Kubernetes scale across multiple Kubernetes configurations which could have been asked from a more high-availability.

So to me, the concept of an autoscaler and its underlying implementation seems very similar to the concept of a resource manager and scheduler.  I wonder if public cloud auto scalers were open sourced if they would be called resource managers and long running schedulers?

The reason why I ask all of this is as we move forward with containers, I think we might be tempted to build another cloud within our existing clouds.  I also think the Mesos and Kubernetes technologies will have people building clouds within clouds until cloud providers natively support containers as a deployment option.  At that point, will we have duplication of resource management and scheduling if we don't combine the concepts?  Also, what will people do to integrate these new container deployments with other IaaS features like load balancers, security groups, etc?

I think others are asking the same question as well.  As shown in the IBM Cloud demo, we are thinking through this right now.  We have also experimented internally with OpenStack deployments of Docker containers as the IaaS layer under a similar IaaS++ layer.  The experiments led to a similar cloud container IaaS deployment option leveraging existing OpenStack approaches for resource management and scheduling as compared to creating a new layer on top of OpenStack.  Also, there is a public cloud that has likely considered this a long time ago - Joyent.  Joyent has had SmartOS zones which are similar to containers under its IaaS API for a long time without the need to expose the formal concepts of resource management and scheduling to its users.  Also, right at the end of the Kubernetes demo, someone in the crowd asks the same question.  I took this question to ask, when will the compute engine support container deployment this way without having a user setup their own private set of Kubernetes systems (and possibly not have to consider resource management/scheduling with anything more than policy).

As I said in the intro, I'm still learning here.  What are your thoughts?