In a recent article, Derrick Harris talked about how crowdsourcing could be the future of benchmarking. As someone who participates deeply in standardized benchmarks (TPC and SPEC), I wanted to comment on some of the important messages in his blog.
Derrik talks about benchmarking within the context of Hadoop, but in general the article applies to benchmarking across multiple technologies. While SPEC and TPC benchmarks have incredible industry credibility, its hard to ignore the fact that Hadoop, NoSQL, and many open source projects have long since played a different game. I read blogs all the time that talk about simple developer laptop performance tests. While these benchmarks (more realistically performance experiments) aren't what would matter in a datacenter in enterprise application performance, they usually after some review and adjustment tend to have good bits of performance knowledge. I also see single vendor performance results and claims that give very little information.
I have, in the past, talked about the value of standardized benchmarks. I talked about why doing such benchmarks at SPEC lead to unbiased and trusted results. I think the key reason is the rigor and openness by which the review is done and the focus on scenarios that matter within enterprise computing. Also SPEC has years of experience in benchmarking to leverage to avoid pretty common performance testing mistakes. It's impossible to compare a developer laptop performance experiment to any SPEC benchmark result. The result from SPEC is likely far more credible. With SPEC, benchmark results are usually submitted from a large number of vendors meaning the benchmark matters to the industry. With performance experiments, until there is community review and community participation, there is only one vendor which leads to "one off" tests that have less long standing industry value. The scenario where I wrote about this - a Microsoft "benchmarketing" single vendor result - is a very good example of how results from a single vendor don't have much value.
But there is a problem with some SPEC benchmarks - the community by which results are disclosed and benchmarks designed is a closed community. It's great that SPEC is an un-biased third party to the vendors, but that doesn't mean the review is a community of the consumers of the results. I think Derrik reflects on this by talking about how "big benchmarks" aren't running workloads anyone runs in product. I disagree, but do believe due to the lack of open community it's harder for the consumers to understand how the results compare to their own workloads. I personally will attest to how SPECjEnterprise 2010 and its predecessors have improved Java based application servers for all customer applications. While it might not be clear how that specific benchmark matters to a specific customer's use of a Java based application server, it is not true that improvements shown via the benchmark don't benefit the customer's applications over the long haul. In contrast to Derrik's views, this is why customers benefit from vendors participating themselves in such benchmarking - I don't think this have occurred if all benchmarking done was done without vendor involvement.
BTW, full disclosure of the performance experiment and results is critical. You can see in the recent Oracle ad issue, that the whole industry loses without such disclosure. Any performance data should be explained within the context of publicly available tests and methodology, tuning etc.
I think if you put the some of these views together (Derrick's and my standardized benchmark views), you'll start to see some possible common threads. Here I think are the key points:
1) We need open community based (crowdsourced is one option, more open standardized benchmarking is another) benchmarking in this day and age. By doing this, the results should be seen as not only trustable but also understandable.
2) Any benchmark, to have value, must have multiple participants actively engaged in publishing results and actively discussing the results and technologies that led to the results. By doing this, the benchmark will have long standard industry value.
I hope this post generates discussion (positive and negative). I'd love to take action and start to figure out how the industry can move forward in open and community based benchmarking.