Tuesday, August 14, 2012

Measuring performance of platform focused private clouds

I had the opportunity to present to some of our partners yesterday what we're doing to measure the performance of our Pure Application System yesterday and wanted to write down what I presented.

Previously Jerry Cuomo blogged on how we are looking at benchmarking clouds. He covered the concepts of time to deploy, density, elastic scale, resiliency, runtime performance and time to genesis. While I think all of these matter what has become clear is they matter in different ways when we're looking at the kinds of clouds that Pure Application Systems enables.

Due to be being platform focused (vs. infrastructure focused)

When looking at the performance of setting up a typical web to database solution you must consider the entire platform that is part of "deployment". If you consider a typical configuration of this sort of application that provides elastic scaling and high availability, the "application" could easily involve up to ten servers (multiple application servers, session state / caching offload servers, load balancers, databases, etc.). When one of "applications" gets started, each one of these servers needs to be configured, interlinked, and started in concert. It does no good to allow your application server to handle requests without the database or session caching server being known and interconnected first. In measuring the performance of this time to deployment, the end to end solution must be considered. We are looking at the time from when a user clicks deploy on their application until the *entire* set of servers is ready to handle the first request to the entire platform. In contrast infrastructural benchmarks look only at the time to get a single virtual machine up and running. Typically such performance measurements don't consider the time for the servers on top of the VM's to be ready to handle a request and certainly don't consider the time it takes for a group of servers associated with a more complex platform take to be ready to work together.

The Pure Application System provides common multi-tenant services for things like session caching and elastic load balancing. Such services aren't usually considered by infrastructural performance measurements. These shared services are just as critical as the non-shared platform components. Given these shared services are multi-tenant careful performance work must be done to ensure they scale across applications within the platform while providing resiliency and isolation.

Due to being private cloud focused (vs. public)

There exist many frameworks to measure the performance of public clouds (EC2, Rackspace, etc.). However, due to the opaqueness of using a public cloud, these frameworks can't focus on key aspects that can be analyzed in private cloud offerings.

Resiliency is one reason why people consider private vs. public clouds. With resiliency, you want to consider not only high availability/fail over scenarios, but scenarios that ensure performance is maintained in the face of "bad" applications that end up being co-located on a multi-tenant cloud. "Bad" applications could be intentionally bad applications (where isolation for security is key) or non-intentionally bad applications (where isolation for common resource protection is key). On a public cloud, such resiliency would be hard to test as it's impossible to force co-location of workloads. With a private cloud it's far easier to force scenarios to test resilency. This isn't to say that resiliency and isolation aren't important to measure on public clouds, but it should be clear that it's far easier to ensure on private clouds.

With private clouds, you are investing in infrastructure on site. Due to this, ensuring that the system is fully optimized for the platform being provided is critically important. In public cloud you pay for an instance and you can't tell how optimal that instance is during it's own operations or when co-located with other instances. In private cloud you'll know very clearly how many applications and management operations can be supported per unit of on-site infrastructure (density). Therefore when comparing private cloud systems, it's very important to ensure you are getting the most top to bottom optimized software and hardware stack. This has been one of the top focus areas of Pure Application Systems.

With private clouds, the owner is in charge of getting them ready to handle requests. Jerry talked about time to genesis in his blog. This measures the time to get the system from power on state to the first deployment is available. Any operations that occur in this phase time not only time, but introduce the possibility for mistakes that could negatively impact the system well into the operational life of the cloud. While this isn't a typical "performance measurements" it is very important in private cloud infrastructures. As Jerry stated, you have to be very careful in "when to stop the watch". With Pure Application Systems we stop the clock when the platform is ready for applications to be deployed. I have seen other measurements (again related to be platform focused vs. infrastructural focused) that stop the clock when the operation systems or virtual machines are started. You can see our timed genesis process already in the video on the Pure Application Systems homepage.