iSpyker (Performance, Cloud, Containers and Beyond)

Tuesday, June 10, 2014

Docker SoftLayer Cloud Talk at Dockercon 2014

The overall concept

Today at Dockercon, Jerry Cuomo went over the concept of borderless cloud and how it relates to IBM's strategy. He talked about how Docker is one of the erasers of the lines between various clouds with regards to openness. He talked about how, regardless of vendor, deployment option and location, we need to focus on the following things:

Fast

Especially in the age of devops and continuous delivery how lack of speed is a killer. Even worse, actually unforgivable, having manual steps that introduce error is not acceptable any longer. Docker helps with this by having layered file systems that allow for just updates to be pushed and loaded. Also, with its process model it starts as fast as you'd expect your applications to start. Finally, Docker helps by having a transparent (all the way to source) description model for images which guarantees you run what you coded, not some mismatch between dev and ops.

Optimized

Optimized means not only price/performance but also optimization of location of workloads. In the price/performance area IBM technologies (like our IBM Java read-only memory class sharing) can provide for much faster application startup and less memory when similar applications are run on a single node. Also, getting the hypervisor out of the way can help I/O performance significantly (still a large challenge in VM based approaches) which will help data oriented applications like Hadoop and databases.

Open

Openness of cloud is very important to IBM, just like it was for Java and Unix/Linux. Docker can provide the same write once, run anywhere experience for cloud workloads. It is interesting how this openness combined with the fast/small also allows for advances in devops not possible before with VM's. It is now possible to now run production like workload configurations on premise (and on developer's laptops) in almost the exact same way as deployed in production due to the reduction in overhead vs. running a full virtual machine.

Responsible

Moving fast isn't enough. You have to most fast with responsibility. Specifically you need to make sure you don't ignore security, high availability, and operational visibility when moving so fast. With the automated and repeatable deployment possible with Docker (and related scheduling systems) combined with micro-service application design high availability and automatic recovery becomes easier. Also, enterprise deployments of Docker will start to add to the security and operational visibility capabilities.

The demo - SoftLayer cloud running Docker

After Jerry covered these areas, I followed up with a live demo.

On Monday, I showed how the technology we've been building to host IBM public cloud services, the Cloud Services Fabric (CSF), works on top of Docker. We showed how the kernel of the CSF, based in part on NetflixOSS, and powered by IBM technologies was fully open source and easily run on a developer's laptop. I talked about how this can even allow developers to Chaos Gorilla test their micro-service implementations.

I showed how building the sample application and its microservice was extremely fast. Building an update to the war file took more time than containerizing the same war for deployment. Both were done in seconds. While we haven't done it yet, I could imagine eventually optimizing this to container generation as part of an IDE auto compile.

In the demo today, I followed this up with showcasing how we could take the exact same environment and marry it with the IBM SoftLayer public cloud. I took the exact same sample application container image and instead of loading locally, pushing through a Docker registry to the SoftLayer cloud. The power of this portability (and openness) is very valuable to our teams as it will allow for local testing to mirror more closely production deployment.

Finally, I demonstrated how adding SoftLayer to Docker added to the operational excellence. Specifically I showed how once we told docker to use a non-default bridge (that was assigned a SoftLayer portable subnet attached to the host private interface), I could have Docker assign IP's out of a routable subnet within the SoftLayer network. This networking configuration means that the containers spun up would work in the same networks as SoftLayer bare metal and virtual machine instances transparently around the global SoftLayer cloud. Also, advanced SoftLayer networking features such as load balancers and firewalls would work just as well with the containers. I also talked about how we deployed this across multiple hosts in multiple datacenters (availability zones) further adding to the high availability options for deployment. To prove this, I unleashed targeted chaos army like testing. I showed how I could emulate a failure of a container (by doing a docker rm -f) and how the overall CSF system would auto recover by replacing the container with a new container.

Some links

You can see the slides from Jerry's talk on slideshare.

The video:

Direct Link (HD Version)

Saturday, June 7, 2014

Open Source Release of IBM Acme Air / NetflixOSS on Docker

In a previous blog, I discussed the Docker "local" (on laptop) IBM Cloud Services Fabric powered in part by NetflixOSS prototype.

One big question on twitter and my blog went unanswered. The question was ... How can someone else run this environment? In the previous blog post, I mentioned how there was no plan to make key components open source at that point in time.

Today, I am pleased to announce that all of the components to build this environment are now open source and anyone can reproduce this run of IBM Acme Air / NetflixOSS on Docker. All it takes is about an hour, a decent internet connection, and a laptop with VirtualBox (or boot2docker, or vagrant) installed.

Specifically, the aspects that we have added to open source are:

Microscaler - a small scale instance health manager and auto recovery/scaling agent that works against the Docker remote API. Specifically we have released the Microscaler service (that implements a REST service), a CLI to make calling Microscaler easier, and a Microscaler agent that is designed to manage clusters of Docker nodes.
The Docker port of the NetflixOSS Asgard devops console. Specifically we ported Asgard to work against the Docker API for managing IaaS objects such as images and instances as well as the Microscaler API for clusters. The port handles some of the most basic CRUD operations in Asgard. Some scenarios (like canary testing, red/black deployment) are yet to be fully implemented.
The Dockerfiles and build scripts that enable anyone to build all of the containers required to run this environment. The Dockerfiles build containers of the Microscaler, the NetflixOSS infrastructural servers (Asgard, Eureka and Zuul), as well as the full microservices sample application Acme Air (web app, microservice and cassandra data tier). The build scripts help you build the containers and give easy commands to do the end to end deployment and common administration tasks.

If you want to understand what this runtime showcases, please refer to the previous blog entry. There is a video that shows the Acme Air application and basic chaos testing that proves the operational excellence of the environment.

Interesting compare:

It is interesting to note that the scope of what we released (the core of the NetflixOSS cloud platform + the Acme Air cloud sample/benchmark application) is similar to we previously released back at the Netflix Cloud Prize in the form of Amazon EC2 AMI's. I think it is interesting to consider the difference when using Docker in this release as our portable image format. Using Docker, I was able to easily release the automation of building the images (Dockerfiles) in source form which makes the images far more transparent than an AMI in the Amazon marketplace. Also, the containers built can be deployed anywhere that Docker containers can be hosted. Therefore, this project is going to be valuable to far more than a single cloud provider -- likely more on that later as Dockercon 2014 happens next week.

If you want to learn how to run this yourself, check out the following video. It shows building the containers for open source, starting an initial minimal environment and starting to operate the environment. After that go back to the previous blog post and see how to perform advanced operations.

Direct Link (HD Version)

Friday, May 16, 2014

How intrusive do you want your service discovery to be?

In working on the Acme Air NetflixOSS Docker local implementation we ended up having two service discovery mechanisms (Eureka and SkyDNS). This gave me a concrete place to start to ponder issues that have come up inside of IBM relating to service discovery. Specifically the use of Eureka has been called "intrusive" on application design as it requires application changes to enable service registration and service query/location when load balancing. This blog post aims to start a discussion on the pros and cons of service discovery being "intrusive".

First (in the top of the picture), we had the NetflixOSS based Eureka. The back end microservice (the auth service) would, as part of its Karyon bootstrapping, make a call to register itself in the Eureka servers. Then, when the front end web app wanted to call the back end microservice via Ribbon with client side load balancing, it would do so based on information about service instances gained by querying the Eureka server (something Ribbon has native support for). This is how the NetflixOSS based service discovery has worked on Amazon, in our port to the IBM Cloud - SoftLayer, and in our Docker local port.

Next we had SkyDNS and Skydock (in the bottom of the picture). We used this to have DNS naming between containers. Interestingly we used SkyDNS to tell clients how to locate Eureka itself. We also used it to have clients locate services that weren't Eureka enabled (and therefore locatable) - such as Cassandra and our auto scaling service. Using Skydock, we were able to know that containers being started with an image name of "eureka" would easily resolve by other containers to a simple hostname "eureka.local.flyacmeair.net" (we used the "local" as the environment and "flyacmeair.net" as the domain name). Similarly cass images registered as cass.local.flyacmeair.net. Skydock works by registering with the Docker daemon's event API so it sees when containers start and stop (or die). Based on these events Skydock registers the container into SkyDNS on behalf of the container. Skydock also periodically queries for the running containers on a host and will update SkyDNS with a heartbeat to avoid the DNS entry from timing out.

Before I go into comparing these service discovery technologies, let me say that each did what it was intended to do well. Eureka gave us very good application level service discovery, while Skydock/SkyDNS gave us very good basic container location.

If you compare these approaches, roughly:

Skydock and Eureka client (registration) are similar in both perform the registration and heartbeating for service instances
SkyDNS and Eureka server are similar in that both host the information about the service instances
DNS offered by SkyDNS and Eureka client (query) are similar in that both provide lookup to clients that can load balance across instances of a service

One of the biggest differences between these approaches is that Eureka is specifically included in the service instance (above the container or VM line in IaaS) and is up to the service instance to use as part of its implementation while Skydock is outside of the scope of the service instance (and application code). To be fair to SkyDNS, it doesn't necessarily have to be called in a mode like Skydock does. Someone could easily write code like Eureka client that stored its data in SkyDNS instead of Eureka without using Skydock. However, the real comparison I'm trying to make is service registration that is "intrusive" (on instance) vs. "not intrusive" (off instance).

One interesting aspect of moving service registration out of the application code or below the container/VM boundary line is that there is no application knowledge at this layer. As an example, Karyon is written to only call the Eureka registration for the auth service once all bootstrapping of the application is done and the application is ready to receive traffic. In the case of Skydock, the registration with SkyDNS occurs as soon as the container reports that the process is started. If there was any initialization required in the service, this initialization wouldn't be completed and clients could find out about the service and thus receive requests before the service was at the application level ready to handle requests.

Similar to initial service registration, a service registration client outside of the application code or below the container/VM boundary cannot know true instance health. If the VM/container is running a servlet and the application is throwing massive errors, there is no way for Skydock to know this. Therefore Skydock will happily keep sending heartbeats to SkyDNS which means requests will keep flowing to an unhealthy instance. Alternatively with Eureka and Karyon's integrated health management, it can stop sending heartbeats as soon as the application code deems itself unhealthy regardless of if that container/VM is running or not.

Next let's focus on SkyDNS itself and its query and storage. SkyDNS picked DNS for each of these to lessen the impact on client applications which is a good thing when your main concern is lack of "intrusive" changes in your client code.

SkyDNS helps you not have to recode your clients by exposing service queries through standard DNS. While I think this is beneficial, DNS in my mind wasn't designed to support an ephemeral cloud environment. It is true that SkyDock's DNS has TTL's and heartbeats can effectively control smaller than "internet" facing TTL's typical in standard DNS servers. However, it is well known that there are clients that don't correctly time out TTL's in their caches. Java is notorious for ignoring TTL's without changes to the JVM security properties as lower TTL's open you up to DNS spoofing attacks. Eureka, on the other hand, forces the clients to use the Eureka re-querying and load balancing (either through custom code or through Ribbon abstractions) that is aware of Eureka environment and service registration timeouts.

Next, SkyDNS stores the information about service instances in DNS SRV records. SkyDNS stores (using a combination of DNS SRV records and parts of the hostname and domain used in lookup) the following information - name of service, version of service, environment (prod/test/dev/etc), region of service, host, port and TTL. While DNS SRV records are somewhat more service oriented (they add things to DNS that wouldn't typically be there for host records like services name, port) they do not cover all of the things that Eureka allows to be shared for a service. In addition to the service attributes provided by SkyDNS, there are more in InstanceInfo. Some examples are important urls (status page, health check, home page), secure vs. non-secure port, instance status (UP, DOWN, STARTING, OUT_OF_SERVICE), a metadata bag per app application, and datacenter info (image name, availability zone, etc.). I think while SkyDNS does a good job of using DNS SRV records, it has to go pretty far into domain name paths to add as much information as required on top of DNS. Also, the extended attributes not there that exist in Eureka provide for key functionality not yet possible in a SkyDNS environment. Two specific examples would be the instance status and datacenter info. Instance status is used in the NetflixOSS environment by Asgard in red/black deployments. Asgard marks service instances as OUT_OF_SERVICE allowing older clusters to remain in the service registry, but not be stopped, so that roll backs to older clusters is possible. The extended datacenter info is useful especially in SoftLayer as we can share very specific networking (VLAN's, routers, etc.) that can make routing significantly smarter. In the end Eureka's custom service domain model fits with a more complete description of services than DNS.

One area where non-intrusive service discovery is cited as a benefit is support of multiple languages/runtimes. The Skydock approach doesn't care what type of runtime is being registered into SkyDNS, so it automatically works across languages/runtimes. While Eureka has REST based interfaces to interact with clients, it is far easier today to use the Eureka Java clients for registration and query (and using higher level load balancers like Zuul and Ribbon make it even easier). These Java clients for using the Eureka REST API's are not implemented in other languages. At IBM, we have enabled Eureka to manage non-Java services (C based servers and NodeJS servers). We have taken two approaches to make this easier for non-Java services. First we have implemented an on-instance (same container or VM) Eureka "sidecar" which provides some of the same benefits external to the main service process that Eureka and Karyon provide. We have done this both for Eureka registration and query. Second, we have started to see users who see value in the entire NetflixOSS (including Eureka) platform implement native Eureka clients for Python and NodeJS. These native implementations aren't complete at this point, but they could be made more complete. Between these two options the "sidecar" approach is a stopgap. Separating the application from the "sidecar" has some of the same issues (not as bad, but still worse than in-process) mentioned above when considering on-instance service registration. For instance, you have to be careful about bootstrap (initialization vs. service registration) and healthcheck. Both become more complicated to be synchronized across the service process and side car. Also, in Docker container based clouds, having a second side car process tends to break the single process model, so having the service registration/query in process just fits better.

One final note: This comparison used SkyDNS and Skydock as the non-intrusive off-instance service registration and query. I believe this discussion applies to any service registration technology that isn't intrusive to the service implementation or instance. Skydock is an example of a service registry that is designed to be managed below the container/VM level. I believe the issues presented in this blog are the reason why an application centric service registry isn't offered by IaaS clouds today. Until IaaS clouds have a much better way for applications to report their status in a standard way to the IaaS API's, I don't think non-intrusive service discovery will be possible with the full functionality of intrusive and application integrated service discovery.

Interesting Links:

I do admit I'm still learning in the space. I am very interested in thoughts from those who have used less intrusive service discovery.

FWIW, I also avoided a discussion of high availability of the deployment of the service discovery server. That is critically important as well and I have blogged on that topic before.

Monday, May 5, 2014

Cloud Services Fabric (and NetflixOSS) on Docker

At IBM Impact 2014 last week we showed the following demo:

Direct Link (HD Version)

The demo showed an end to end NetflixOSS based environment running on Docker on a laptop. The components running in Docker containers shown included:

Acme Air Web Application - The front end web application that is NetflixOSS enabled. In fact, this was run as a set of containers within an auto scaling group. The web application looks up (in Eureka) ephemeral instances of the auth service micro-service and performs on-instance load balancing via Netflix Ribbon.
Acme Air Auth Service - The back end micro-service application that is NetflixOSS enabled. In fact, this was run as a set of containers within an auto scaling group.
Cassandra - This was the Acme Air Netflix port that runs against Cassandra. We didn't do much with the data store in this demo, other than making it into a container.
Eureka - The NetflixOSS open source service discovery server. The ephemeral instances of both the web application and auth service automatically register with this Eureka service.
Zuul - The NetflixOSS front end load balancer. This load balancer looks up (in Eureka) ephemeral instances of the front end web application instances to route all incoming traffic across the rest of the topology.
Asgard - The NetflixOSS devops console, which allows an application or micro-service implementer to configured versioned clusters of instances. Asgard was ported to talk to the Docker remote API as well as the Auto scaler and recovery service API.
Auto scaler and recovery service. Each of the instances ran an agent that communicates via heartbeats to this service. Asgard is responsible for calling API's on this Auto scaler to create clusters. The auto scaler then called Docker API's to create instances of the correct cluster size. Then if any instance died (stopped heartbeating), the auto scaler would create a replacement instance. Finally, we went as far as implementing the idea of datacenters (or availability zones) when launching instances by tagging this information in a "user-data" environment variable (run -e) that had an "az_name" field.

You can see the actual setup in the following slides:

Docker Demo IBM Impact 2014 from aspyker

Once we had this setup, we can locally test "operational" scenarios on Docker including the following scenarios:

Elastic scalability. We can easily test if our services can scale out and automatically be discovered by the rest the environment and application.
Chaos Monkey. As shown in the demo, we can test if killing single instances impacted overall system availability and if the system auto recovered a replacement instance.
Chaos Gorilla. Given we have tagged the instances with their artificial data center/availability zone, we can kill all instances within 1/3 of the deployment emulating a datacenter going away. We showed this in the public cloud SoftLayer back at dev@Pulse.
Split Brain Monkey. We can use the same datacenter/availability tagging to isolate instances via iptables based firewalling (similar to Jepsen).

We want to use this setup to a) help our Cloud Service Fabric users understand the Netflix based environment more quickly b) allow our users to do simple localized "operational" tests as listed above before moving to the cloud and c) use this in our continuous integration/delivery pipelines to do mock testing on a closer to production environment than possible on bare metal or memory hungry VM based setups. More strategically, this work shows that if clouds supported containers and the Docker API we could move easily between a NeflixOSS powered virtual machine and container based approach.

Some details of the implementation:

The Open Source

Updated 2014/06/09 - This project is now completely open source. For more details see the following blog entry.

The Auto Scaler and agent

The auto scaler and on instance agents talking to the auto scaler being used here are working prototypes from IBM research. Right now we do not have plans to open source this auto scaler which makes open sourcing the entire solution impossible. The work to implement an auto scaler is non-trivial and was a large piece of work.

The Asgard Port

In the past, we had already ported Asgard to talk to IBM's cloud (SoftLayer) and its auto scaler (RightScale). We extended this porting work to instead talk to our Auto scaler and Docker's remote API. The work was pretty similar and therefore easily achieved in a week or so of work.

The Dockerfiles and their containers

Other than the aforementioned auto scaler and our Asgard port, we were able to use the latest CloudBees binary releases of all of the NetflixOSS technologies and Acme Air. If we could get the auto scaler and Asgard port moved to public open source, anyone in the world could replicate this demo themselve easily. We have a script to compile all of our Docker files (15 in all, including some base images) and it takes about 15 minutes on a decent Macbook. This time is spent mostly in download time and compile steps for our autoscaler and agent.

Creation of these Dockerfiles took about a week to get the basic functionality. Making them work with the autoscaler and required agents took a bit longer.

We choose to run our containers as "fuller" OS's vs. single process. On each node we ran the main process for the node, a ssh daemon (to allow more IaaS like access to the filesystem) and the auto scaling agent. We used supervisord to allow for easy management of these processes inside of Ubuntu on Docker.

The Network

We used the Eureka based service location throughout with no changes to the Eureka registration client. In order to make this easy to humans (hostnames vs. IP's) we used skydock and skydns to give each tier of the application it's own domain name using --dns and --name options when running containers to associate incremental names for each cluster. For example, when starting two cassandra nodes, they would show up in skydns as cass1.cassandra.dev.docker and cass2.cassandra.dev.docker. We also used routing and bridging to make the entire environment easy to access from the guest laptop.

The Speed

The fact that I can start this all on a single laptop isn't the only impressive aspect. I ran this with my virtual box being set to three gigs of memory for the boot2docker VM. Running the demo spins the cooling fan as this required a good bit of CPU, but in terms of memory it was far lighter than I've seen in other environments.
The really impressive aspect is that I can in 90 seconds (including a 20 second sleep waiting for Cassandra to peer) restart an entire environment including two auto scaling clusters of two nodes each and the other five infrastructural services. This includes all the staggered starts required for starting the database, loading it with data, starting service discovery and dns, starting an autoscaler and defining the clusters to the auto scaler and the final step of them all launching and interconnecting.

Setting this up in a traditional cloud would have taken at least 30 minutes based on my previous experience.

I hope this explanation will be of enough interest to you to consider future collaboration. I also hope to get up to Dockercon in June in case you also want to talk about this in person.

The Team

I wanted to give credit where credit is due. The team of folks working on this included folks across IBM developer and research including Takahiro Inaba, Paolo Dettori, and Seelam Seetharami.

Thursday, April 24, 2014

Accessing docker container private network easily from your boot2docker host

I wrote a blog post a while back about how I was running Acme Air in docker along with Cassandra. This setup has become far more complex and I wanted to stop doing port mapping specifically for every container. Here are the simple steps to get all the ports and ip's to route cleanly from a hosting system (Mac OS X, but windows works the same). to all of your docker containers.

So my setup is:

- Macbook Pro laptop running Mac OS X 10.9.2
- VirtualBox 4.3.10
- Boot2docker 0.8.0
- Docker 0.10.0

To help understand the concept I'll communicate with a "server" on a container that is listening on a TCP port. To demonstrate, I'll use the netcat tool listening on port 3333 on a base ubuntu image. The goal is to be able to telnet directly to that port from my base laptop. Using netcat is just an example. Once this works any server listening on any port should be just as easy to access.

To help understand the below terminal sessions, my laptop's hostname is "ispyker", my docker vm running on VirtualBox's hostname is "boot2docker" and containers usually have hostnames like "e79e432696f7".

First, let's go ahead and run the netcat/unbuntu container:

ispyker:~ aspyker$ ~/bin/boot2docker ssh
docker@localhost's password: 
                        ##        .
                  ## ## ##       ==
               ## ## ## ##      ===
           /""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
           \______ o          __/
             \    \        __/
              \____\______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
boot2docker: 0.8.0
docker@boot2docker:~$ docker run -t -i ubuntu /bin/bash
root@e79e432696f7:/# /sbin/ifconfig eth0 |grep addr
eth0      Link encap:Ethernet  HWaddr 7e:20:1b:29:bb:b6  
          inet addr:172.17.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::7c20:1bff:fe29:bbb6/64 Scope:Link
root@e79e432696f7:/# nc -l 3333

Now, on another Mac OS terminal:

ispyker:~ aspyker$ telnet 172.17.0.2 3333
Trying 172.17.0.2...
telnet: connect to address 172.17.0.2: Operation timed out
telnet: Unable to connect to remote host

Ok, so let's fix this ...

ispyker:~ aspyker$ ~/bin/boot2docker stop
[2014-04-24 13:19:16] Shutting down boot2docker-vm...

First, we need to open up the VirtualBox application from finder. From the menu, select:

VirtualBox->Preferences->Network->Host-only Networks

Either edit an existing or create a network called "vboxnet0" with the following settings:

Under adapter:

IPv4 Address: 172.16.0.1
IPv4 Network Mask: 255.255.0.0
IPv6 Address: (blank)
IPv6 Network Mask: 0

Under DHCP server:

Uncheck "Enable Server"

Next, right click the "boot2docker-vm" and select:

Settings->Network

Create an Adapter 2 with the following settings:

Check Enable Network Adapter
Attached to: Host-only Adapter
Name: vboxnet0
Advanced:
Adapter Type: Intel Pro/1000 MT Desktop
Promiscuous Mode: Deny Mac
Address: (use the default)
Enable Cable Connected

Save all your settings and let's start back up that netcat/ubuntu container:

ispyker:~ aspyker$ ~/bin/boot2docker up
[2014-04-24 13:27:12] Starting boot2docker-vm...
[2014-04-24 13:27:32] Started.
ispyker:~ aspyker$ ~/bin/boot2docker ssh
docker@localhost's password: 
                        ##        .
                  ## ## ##       ==
               ## ## ## ##      ===
           /""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
           \______ o          __/
             \    \        __/
              \____\______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
boot2docker: 0.8.0
docker@boot2docker:~$ docker run -i -t ubuntu /bin/bash
root@1560f377bf4a:/# netcat -l 3333

We still at this point won't be able to "see" this port from MacOS, as we haven't yet assigned an IP address to the boot2docker VM nor have we created a route from MacOS to the docker host-only network.

Let's test that to be sure:

ispyker:~ aspyker$ telnet 172.17.0.2 3333
Trying 172.17.0.2...
telnet: connect to address 172.17.0.2: Operation timed out
telnet: Unable to connect to remote host

First, let's add an IP address to the host-only network for this new interface on the boot2docker VM:

ispyker:~ aspyker$ ~/bin/boot2docker ssh
docker@localhost's password: 
                        ##        .
                  ## ## ##       ==
               ## ## ## ##      ===
           /""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
           \______ o          __/
             \    \        __/
              \____\______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
boot2docker: 0.8.0
docker@boot2docker:~$ /sbin/ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 08:00:27:DC:5A:BA  
          inet6 addr: fe80::a00:27ff:fedc:5aba/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:11703 (11.4 KiB)

docker@boot2docker:~$ sudo ifconfig eth1 172.16.0.11
docker@boot2docker:~$ sudo ifconfig eth1 netmask 255.255.0.0
docker@boot2docker:~$ /sbin/ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 08:00:27:DC:5A:BA  
          inet addr:172.16.0.11  Bcast:172.16.255.255  Mask:255.255.0.0
          inet6 addr: fe80::a00:27ff:fedc:5aba/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:53 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:15723 (15.3 KiB)

At this point, you should be able to ping your boot2docker VM on it's new ip address from your Mac:

ispyker:~ aspyker$ ping -c 1 172.16.0.11
PING 172.16.0.11 (172.16.0.11): 56 data bytes
64 bytes from 172.16.0.11: icmp_seq=0 ttl=64 time=0.349 ms

--- 172.16.0.11 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.349/0.349/0.349/0.000 ms

However, you still can't get to the netcat container port:

ispyker:~ aspyker$ telnet 172.17.0.2 3333
Trying 172.17.0.2...
telnet: connect to address 172.17.0.2: Operation timed out
telnet: Unable to connect to remote host

Now, we'll add the route to the hosting Mac OS:

ispyker:~ aspyker$ netstat -nr |grep 172\.17
ispyker:~ aspyker$ sudo route -n add 172.17.0.0/16 172.16.0.11
Password:
add net 172.17.0.0: gateway 172.16.0.11
ispyker:~ aspyker$ netstat -nr |grep 172\.17
172.17             172.16.0.11        UGSc            0        0 vboxnet
ispyker:~ aspyker$ telnet 172.17.0.2 3333
Trying 172.17.0.2...
Connected to 172.17.0.2.
Escape character is '^]'.
hello container world

If you followed along correctly, and typed "hello container world" once telnet connects, "hello container world" should have been printed out in your ubuntu/netcat container. At this point you should be able to access any container's ip address and ports. You can get the IP address of any container by running docker inspect [containername] looking for it's 172.17.0.x address.

Welcome to your easier local host-only fully TCP accessible cloud.
Thanks to Takahiro Inaba for helping put this together.

Wednesday, February 26, 2014

Chaos Gorilla High Availability Tests on IBM Cloud with NetflixOSS

At the IBM cloud conference this week, IBM Pulse, I presented on how we are using the combination of IBM Cloud (SoftLayer), IBM middleware, and the Netflix Open Source cloud platform technologies to operationalize IBM's own public cloud services. I focused on high availability, automatic recovery, elastic and web scale, and continuous delivery devops. At Dev@Pulse, I gave a very quick overview of the entire platform. You can grab the charts on slideshare.

The coolest part of the talk was the fact that it included a live demo of Chaos Gorilla testing. Chaos Gorilla is a type of chaos testing that emulates an entire datacenter (or availability zone) going down. While our deployment of our test application (Acme Air) and runtime technologies was already setup to survive such a test, it was our first time doing such a test. It was very interesting to see how our system reacted (the workload itself and the alerting/monitoring systems). Knowing how this type of failure manifests will help us as we roll this platform and other IBM hosted cloud services into production. The goal of doing such Chaos testing is to prove to yourself that you can survive failure before the failure occurs. However, knowing how the system operates in this degraded state of capacity is truly valuable as well.

To be fair when compared to Netflix (who pioneered Chaos Gorilla and other more complicated testing), so far this was only killing all instances of the mid tier services of Acme Air. It was not killing the data tier, which would have more interesting stateful implications. With a proper partitioned and memory replicated data tier that includes failure recovery automation, I believe the data tier would also have survived such of an attack, but that is work remaining within each of our services today and will be the focus of follow-on testing.

Also, as noted in a quick review by Christos Kalantzis from Netflix, this was more targeted destruction. The true Netflix Chaos Gorilla is automated such that it randomly decides what datacenter to kill. Until we automate Chaos Gorilla testing, we had to pick a specific datacenter to kill. The application and deployment approach demonstrated is architected in a way that should have worked for any datacenter going down. Dallas 05 was chosen arbitrarily in targeted testing until we have more advanced automation.

Finally, we need to take this to the next level beyond basic automation. Failing an entire datacenter is impressive, but it is mostly a "clean failure". By clean I mean the datacenter availability goes from 100% available to 0% availability. There are more interesting Netflix chaos testing like Split Brian Monkey and Latency Monkey that would present cases of availability that are worse than perfect systems but not as clean as gone (0%). These are also places where we want to continue to test our systems going forward. You can read more about the entire suite of Chaos testing on the Netflix techblog.

Take a look at the following video, which is a recording of the Chaos Gorilla testing.

Direct Link (HD Version)

Monday, February 10, 2014

Acme Air Cloud Native Code Details - Part 1

Previously, I blogged at a high level of what changes were required to get Acme Air ported to be cloud native using IBM middleware, the IBM cloud, and NetflixOSS technology. I have gotten the question of how to see these code changes with more specifics. In my next few blog posts I'll try to give you code pointers to help you copy the approach. Note that you could always just go to the two source trees and do a more detailed compare. The github project for the original is here. The github project for the NetflixOSS enabled version is here.

First, even before we started on NetflixOSS enabling Acme Air, we worked to show that it could be run at Web Scale. As compared to other desktop and mobile applications, there were two basic things we did to make this possible.

Stateless Design

We focused on making sure every request was independent from the server perspective. That would allow us handle any request on any server. As classic JEE developers, this was pretty easy except for two commonly stateful aspect of JEE, storing application specific transient state and authentication state stored in session objects.

As a performance guy, the first was pretty easy to avoid. I've spent years advising clients on avoiding storing large application specific transient state in JEE sessions. While app servers have spent considerable time optimizing memory to memory session state technologies, nothing can help performance when developers start overusing session in their applications. App server memory to memory replication is problematic when you want to move to a web scale implementation as it is usually server specific vs. user specific and requires semi-static specific configuration across a small cluster. Database backed replication is problematic at web scale as it requires a database that can web scale which wasn't the case when session replication was designed in most application servers. Web 2.0 and AJAX has helped application specific transient state. It is now rather straight forward to offload such state storage to the web browser or mobile application. All you need to do is send to the client what they want the store and let the client send back what is pertinent to each specific request. Of course, like server side session data, be careful to keep the size of such transient state to the bare minimum to keep request latency in check. In the case of state that doesn't belong on the client, consider a distributed data cache keyed on the user with time to live expiration. Unlike server side session data, be sure to validate all data sent back to the server from the client.

The second was a bit harder to avoid. The state of authentication to the services being offered is certainly something we cannot trust the client to supply. We implemented an approach similar to what we've seen from other online web scale properties. When a client logs into the application, we generate a token and send that back as a cookie to the client while storing this token in a web scale data tier designed for low latency - specifically WebSphere eXtreme Scale. Later on all REST calls to that service we expect the client to send back the token as a cookie and before we call any business logic we check that the token is valid (that we created it and it hasn't timed out). In JEE we implemented this check as a ServletFilter to ensure it was run before any business logic in the called REST service. We implemented this with a security naive implementation, but recommended authentication protocols would follow a similar approach.

public class RESTCookieSessionFilter implements Filter {
  public void doFilter(ServletRequest req, ServletResponse resp, FilterChain chain) throws IOException, ServletException {

    Cookie cookies[] = request.getCookies();
    Cookie sessionCookie = null;
    if (cookies != null) {
      for (Cookie c : cookies) {
        if (c.getName().equals(LoginREST.SESSIONID_COOKIE_NAME)) {
          sessionCookie = c;
          break;
        }
      }

      String sessionId = "";
      if (sessionCookie != null)
        sessionId = sessionCookie.getValue().trim();

      CustomerSession cs = customerService.validateSession(sessionId);
      if (cs != null) {
        request.setAttribute(LOGIN_USER, cs.getCustomerid());
        // session is valid, let the request continue to flow
        chain.doFilter(req, resp);
        return;
      }
      else {
        response.sendError(HttpServletResponse.SC_FORBIDDEN);
        return;
      }
  }
}

You can find this code in RESTCookieFilter.java and see it added to the REST chain in web.xml.

As we'll see in follow-on articles the change to have a stateless service tier was important in enabling elastic scaling with auto recovery. Once you free your service from having to remember state, any server instance that exists (or even ones that didn't exist when the first request was handled) can handle the next request.

Partitioned and Scale Out Data Tier Design

In an application we used before Acme Air for working on web scale experiments, we learned how to design our data storage to be well partitioned. The previous application worked against a monolithic relational database and we found that the definition of the data access services wasn't ready for a partitioned NoSQL storage engine. We learned to fix this you need to more carefully consider your data model and what impacts that has on your data access service interfaces in your application.

The previous application had three aspects and we learned that it was easy to convert two of the tree aspects to singly keyed set of related data tables. In these two aspects we further found cases where secondary indexes were used, but really were not needed. Specifically we found places where we relied on a secondary index when the primary key was known but not passed down to the data service due to a data service tier that was designed assuming index scans and joins were possible and efficient when running on a single database system. We were able to remedy this problem by redesigning our data access services to always require callers provide the primary key. You can see this design reflected in Acme Air's data service interfaces. Note that you will always see the partition key for each data type passed on all data service calls.

For the third aspect of the previous application, there was data that needed to be queried by two different keys (consider bookings that need to be queried by passenger id as well as flight instance id). In partitioned data storage there are two ways to deal with this. You can store the data once by one key and implement a user maintained index for the second key. Alternatively you can duplicate the data (denormalizing based on which type of access is required). This approach (as compared to a user maintained index) trades off storage with ensuring a guaranteed single network access for the data regardless of which key type is being queried.

In Acme Air currently, we have implemented carefully keyed partitioned data storage. We haven't yet added operations that would key the same data differently. If we added a view of the booking data per flight as required by gate agents (so far bookings are purely accessed by passenger), it would require we implement user maintained indexes or duplicate data.

Much like there was side benefits of doing stateless mid tier design, there is an elastic scaling and auto recovery benefit of using a purely partitioned and non-monolithic instanced data tier. Once the data service holding your data is partitioned NoSQL technologies can easily replicate and move that data around within a data cluster as required when the instances fail or need to scale out.

In the next post, I'll move on from this foundation of stateless mid tier services and partitioned / scale out data tier to cover topics such as breaking the application into micro-services and containers that make highly available instance registration and cross instance routing easier.