Monday, July 8, 2013

Web Scale with Acme Air (Project Scale) - Part Two

In the last blog post, I talked about the thoughts that went into why we created a new workload called Acme Air and the fundamental functional thoughts that went into the performance and scaling work of the benchmark we called internally Project Scale.

Now I want to share results.  You can view the summary report of one of our large scale runs we talked about at Impact 2013.  This report was created by a project called "acmeair-reporter" (not yet open sourced) that takes all the log files across all the instances and processes to calculate throughput, number of errors, latency and per node performance metrics.

Now, let me summarize the results.  The results were collected for a time span of 10 minutes.  While it would be better to run for an hour or two to ensure this is no variability in results over longer runs, we have proven that internally the throughput is sustainable and we wanted to simplify the report.  Also note that (a) the throughput over the 10 minutes is very constant (ignore the beginning and end which is an artificial artifact of the JMeter tool and our reporter) and (b) this was after thirty minutes of warm-up and a second run of 10 minutes repeated the results.

As a top line number we achieved roughly 50,000 requests / second.  These requests are end-to-end mobile client and Web 2.0 browser requests, which means that each request has mobile enablement aspects, REST endpoints, business logic, and data tier interactions involved.  The lightest request requires at least one deep data tier read (to validate the user's session) along with a flight query cache query and all the REST enablement to expose this to a browser and/or mobile client.  When calculated out to 24 hours, this is roughly 4.3 billion requests per day.  As mentioned in the previous blog post, this puts this benchmark run in the "Billionaire's Club".  This means on average for every second of the benchmark run, the application was serving the same order of magnitude previously documented by Google, Facebook, and Netflix.

While many internet companies are talking about the technology they are using to serve such throughput at Web Scale, I don't believe there are any publicly available workloads in open source that someone can run and repeat web scale results.  With the open source application source code, the documented results should be repeatable by someone willing to work with the application and enough operational expense to try the workload on a cloud.  To be fair to the internet companies, their applications are likely far more complex.  That said, the Acme Air application is as complicated as most other traditional benchmarks and far more complete in breadth (especially in cloud first and mobile architectures).  We believe this balance between repeatable benchmark and real world sample application should be beneficial to the world and help us start to focus open community conversation around web scale and cloud workload benchmarking.

We released the results of the "Java" implementation which is based on the WebSphere Liberty Profile application server (a light weight and nimble app server that is well suited to cloud deployments), WebSphere eXtreme Scale (an advanced data grid product that is also suited well to cloud), IBM Worklight (which gave us not only the development and deployment platform for our natively deployed Android and iOS mobile applications, but also gave us a server to mobile enable our application services), and nginx (a popular cloud http proxy with load balancing features).  We used 51 WebSphere Liberty Profile servers, 46 WebSphere eXtreme Scale data containers, 28 IBM Worklight servers, and 10 nginx servers.  We ran the 50,000 requests/sec using 49 JMeter load driver instances all coordinated to run emulated mobile and desktop browser traffic.  We ran all of this on the IBM Smart Cloud Enterprise cloud mostly in the Germany cell.  This is a sum total of 185 "copper" instances.  While the numbers are likely far smaller than the number of servers used in any internet company, for a web scale enterprise application this was a significantly sized application.

Digging deeper into the results, it is worth noting that the report covers many aspects of the performance of the systems (not only overall throughput but also latency distribution, per node system performance, error rate, workload distribution).

Some cool numbers taken from the report:
  1. The run had 30 million requests.  Of those 30 million requests, only 4 requests reported errors.  That is 100 thousandths of a percent of all requests.
  2. The 90th percentile response time never peaked over 300 ms for these end-to-end requests.  On average we were averaging around 70 ms per request.
  3. The main workhorse in the benchmark (the application server) averaged 91% CPU utilization across all 51 application servers.  Likely CPU utilization was actually higher as some outliers at the run start/end are averaged in.  The other workhorse tiers (the data grid containers and the mobile servers) averaged 83% and 62% respectively which means with cluster resizing we could have used fewer servers.
  4. We ran both browser traffic and mobile traffic.  In this run we decided to focus more on the browser traffic emulating an application that was just starting to take on mobile users.  On average we ran 5 to 1 more browser traffic than mobile traffic.

One last final cool number - PRICE!


In the previous blog post I talked about the cost of some of the largest scale standardized benchmarks.  I mentioned how one of the largest documented standardized benchmark runs cost approximately thirty million dollars to procure the base system needed to run the workload.  That specific publication ran an order of magnitude larger number of transactions per second than shown in this Acme Air performance result, but the transactions were purely database focused.  The standardized benchmark did not contain the end-to-end mobile app and browser enablement.  The standardized benchmark was also monolithic in nature meaning the failure of the database would mean zero throughput (all or nothing high availability).  With the cloud architecture and breath of features of the Acme Air application, it does not suffer from these legacy shortcomings.  In comparison to this $30 million for the existing standardized benchmarks, our cost was a few thousand dollars in cloud spend per month to produce this Acme Air performance number.  We didn't get an exact cost as we tore down the workload once the performance tests were over, but you can (and should) do your own math of the number of instances we used running 24/7 for a month.  We also achieved this with a team of five people at IBM.  I believe this dramatic reduction in expenses and required human resources is significant and this is the key reason why performance work needs to focus on cloud first deployments going forward.  The cloud and cloud first application architectures have fundamentally changed the world of computing.

Next up .. some words about our availability and transactionality.  Stay tuned!