Benchmark Testing

PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free

Benchmark Testing

by Faulkner Staff

Docid: 00021367

Publication Date: 2012

Report Type: TUTORIAL


At its best, benchmarking cuts through confusing, contradictory, and
biased information to deliver objective data in support of IT decisions.
But good, relevant benchmarks are hard to define, and the results of these
tests can be confusing. To benchmark effectively, organizations must
identify the measurements that are most relevant to their needs and
clearly understand how to interpret their meanings.

Report Contents:

Executive Summary

[return to top of this

Many organizations struggle to determine how well their applications,
devices, and services are performing.

Outsourcing the Enterprise Development Project Tutorial
Developing Mobile Applications Tutorial

Benchmarking helps to solve this problem by providing quantified
measurements. For example, a benchmark might measure the data throughput
of a LAN in bits per second. An organization would first take measurements
while the network was in a “normal” state, and then it would take a set of
measurements under different conditions, such as after a potentially
troublesome application was installed.

But throughput is only one type of benchmark that can be measured on a
LAN, and data networks are just one of the IT assets that can be
benchmarked. Organizations can also measure databases, cloud applications,
mobile connections, and many other products or services. Not all of these
measurements are accurate and useful, however. Some mislead or provide
little in the way of meaningful information.

Benchmarking is not a simple, paint-by-numbers process. To do it
effectively, organizations must identify which of the many tests available
are most meaningful to them based on their particular business needs and
technology environment, and they must then carefully plan how to execute
these tests. This need for customization is in particular important in
user experience testing, an emerging practice that focuses on how users
interact with applications and Web sites.


[return to top of this

The Varieties of Benchmarking

Benchmarking aims to objectively, quantitatively test hardware, software,
and IT services. The growing interest in benchmarking has spawned a
seemingly endless variety of measurement and monitoring tools, and thus
there are now more tests that can be run than any organization could
reasonably keep track of and make sense of. Many characteristics can
be tested, but practically speaking, organizations tend to focus
on speed and reliability.

Benchmarking is used in two basic ways:

  • For an enterprise to test the devices, applications, and services it
    uses in its own environment, based on how they operate within that
  • For an organization to test a device, application, or service outside
    of a single environment. These putatively “absolute” measurements are
    taken for one of several reasons:

    • For organizations to benchmark the products and services sold by
      another company, either to provide that company with data or to help
      customers to make purchasing decisions. An example of such
      third-party ratings are Bitpipe’s online Benchmark testing reports.1 
    • For organizations to test their own products and services, either
      for their information or to demonstrate performance to customers. An
      example of this type of testing is Oracle’s Applications Standard
    • For organizations to test the products and services of other
      companies to determine how well they integrate with their own
      products, such as testing a hardware platform based on how well it
      supports a particular piece of software. An example is SAP’s
      Standard Application Benchmarks, which aim to help SAP users “find
      the appropriate hardware configuration for their IT solutions.”3

Targets of Benchmarking


Enterprise databases provide a good example of a type of application
that is commonly benchmarked. Organizations can benchmark database
performance factors such as throughput measured in transactions per
second. And benchmarks can examine other factors such as how hardware is
configured or how the application’s code is written. “If the application
code has been written as efficiently as possible, additional performance
gains might be realized from tuning the database and database manager
configuration parameters,” says IBM.4 “You can even tune
application parameters to meet the requirements of the application

IBM explains that it is useful to benchmark databases to compare
different situations, such as when there are changes in the volume of
users or when a different version of software is installed. Similarly,
performance measurement company NetForecast describes its approach to
benchmarking as entailing the evaluation of “deployment alternatives
defined by variables such as geographic location, user uptake, application
functional changes, technical alternatives, user interface changes, etc.”5
Benchmarking can also replicate various implementation scenarios, such as
single-instance deployments and clusters.

The potential difficulties of benchmarking network applications are
described by NetForecast as follows: “Network application and user
behavior is in a never-ending state of flux as new network-based
applications and activities gain traction while old ones morph or fall by
the wayside, and network users are continually added and change locations.
This makes it hard for network managers to understand, predict, and
improve network performance to ensure that it meets the evolving needs of
the business.”6

When testing an application, it is helpful to also test the network on
which it operates, as discussed below.


The performance of a network can be benchmarked in many ways. One
approach is to assess networks in respect to overall performance, measured
in speed (bits per second); scalability, measured in terms of the traffic
volumes that the network can handle (e.g., the amount of data or the
amount of transactions); or reliability, measured in uptime. Performance
is also sometimes measured in terms of latency, packet loss, or jitter.
(Jitter is especially important for networks that support voice over IP.)
Organizations can also benchmark individual components of networks, such
as routers. Such tests are often performed in a lab because once a device
is installed on a network, there may be too many variables to effectively
assess the device itself. 

Wireless networks are often harder to benchmark than wired networks.
“[B]ecause WiFi networks are sensitive to RF interference from other
wireless devices, they are more difficult to troubleshoot and transient
changes in the local environment may affect their performance,” writes the
company Nuts about Nets, a small provider of products for measuring
network performance.7 “Tools that are typically used to
troubleshoot wireless networks report signal strengths of RF interference
or beacons from an access point in units of dBm or RSSI (relative signal
strength indication). But what do these really mean? How do these
translate in terms of the performance of your wireless network?” The
company argues that the ultimately meaningful measurement of network
performance is throughput as measured in the number of bytes per second
that can move between nodes on a network. “The dBm and RSSI numbers don’t
mean much if you can’t somehow relate them to a performance metric – the
most relevant being ‘bytes / sec,'” it says.

Mobile Services

There are several tools available for benchmarking mobile devices. A TechRadar
analysis looked at what it considers to be some of the best of them,
including the following:8

  • AnTuTu is a free tool for benchmarking Android phones. It examines
    several metrics and uses them to produce a single numeric indicator of
  • Geekbench, from Primate Labs, works on both the Apple iOS and
    Android platforms. It takes measurements of several specific factors,
    and these measurements can also be distilled into a single overall

The boom in popularity of mobile applications has created another focus
for benchmarking. Testing mobile software requires a different approach
than testing software developed for desktops. “From an application
performance testing perspective, such mobile versions of a web page need
to be treated as separate applications, even though they might share some
components on the back-end,” says a report from Micro Focus (formerly
Borland).9 “This all comes down to the fact that a variety of
mobile devices are driven by a range of operating systems that include
Android, iOS, Windows Phone and Blackberry, thus your testing solution
must enable you to record test scripts from a PC, an emulator or a mobile
device. Moreover it must simulate the bandwidth limitations of mobile
network connections and support all existing and upcoming mobile phone
standards like GPRS, EDGE, UMTS, HSDPA, HSPA+, and LTE.”

Cloud Services

The benchmarking of cloud services typically focuses on measuring how
applications perform under heavy traffic loads. There are many tools now
available that enable developers to simulate Internet traffic, which would
be hard to otherwise test. Leading examples include Micro Focus Silk
Performer Cloudburst, Load Runner, and LoadStorm as well as Web-based
services from IBM. And the field is expanding: Google now offers the
PerfKit cloud benchmarking tool.10

Such products and services let users simulate the conditions under which an
application will operate. The factors that can be simulated include the

  • The global region from which traffic is coming.
  • Whether the application is accessed over the Internet or an intranet.
  • What actions are performed, such as completing credit card
  • What browsers are used.
  • The platforms and languages in use.

Some load testing products offer usage reports or help developers
identify the root cause of performance problems. Many of the tools
available work over the Internet without requiring software or
hardware to be installed.

Virtual Environments

Virtual environments present a new target for benchmarking. The leader of
the virtualization marketplace, VMware, offers a free tool for
benchmarking, called VMmark. “Traditional single-workload performance and
scalability benchmarks for non-virtualized environments were developed
with neither virtual machines nor server consolidation in mind,” says the
company.11 SolarWinds,
which also offers a product for benchmarking virtual environments, echoes
the idea that these environments are particularly hard to test.
“Performance monitoring with virtualization is much more complicated than
with traditional servers,” it says.12 “The virtualization layer
that is inserted between the physical hardware and the guest operating
systems of the virtual machines changes the way you monitor performance.
With virtualization, there is more to monitor, and interpreting the
results can be difficult.” But, SolarWinds points out, these
environments are very susceptible to performance problems: “Because
virtualization is a shared environment, small changes can have a big
ripple effect throughout the entire virtual infrastructure. Hosts have a
limited set of resources, and there are many VMs competing for them. If a
performance problem were to occur, all your VMs can potentially be
affected by it.”

An article by IT analyst Bernd Harzog notes that difficult-to-diagnose
performance problems often arise in virtual environments: “Almost every
enterprise that I have spoken to about their experiences in virtualizing
anything more than simple or tactical applications has come across one or
more that did not perform well once virtualized.”13 Through a
process of elimination, Harzog concludes that the roots of many of these
problems are “the storage networking and physical storage layers of the
virtual infrastructure.” He warns that “[t]here are a wide variety of
problems that can occur in these layers, all of which can create serious
performance problems for applications and users.”

User Experiences (UX)

User experience (UX) testing evaluates how well an application or Web site
meets customer needs. This is a broad category of benchmarking that includes
both objective measurements, like conversion rates (e.g., the number of
customers who buy something), and subjective evaluations, such as how users
rate a site’s appearance. The baselines against which these measurements are
made can include previous versions of a company’s own software or the
services of a competitor.

Tests are shaped around the features of a particular Web site or
application. They are not as standardized as most other types of
benchmarking. Because of the need for such customization, UX testing is
often performed by third-party consultants, sometimes those specializing
in a certain industry. Unlike more objective, narrowly defined benchmarks
like network throughput, UX testing typically can’t be performed with
automated tools. Instead, knowledge in the field is held by consultants
like Fresh Consulting, Measuring U, and the Nielsen Norman Group, and
often described in sources like blogs rather than formal research papers.

The projected growth of UX benchmarking can be roughly estimated by
looking at the overall Web performance monitoring, optimization, and
testing market, of which it is a part. That market is forecast to grow at
a compound rate of over 9-percent until 2022, when it will hit $5.45

Current View

[return to top of this

The boom in cloud and mobile computing has created the need for new
benchmarking services and tools. For instance, IBM offers Rational
Performance Tester, which lets customers test how their Web applications
perform in a simulated environment. The products from Micro Focus
discussed above also fall into this category.

In addition, there are many small and mid-sized companies in the field.
In a Wall Street Journal article about how startups are
competing in the performance testing market, the CEO of testing company
SOASTA, Tom Lounibos, said “[t]his is a battle for the application
lifecycle of the future – develop, test, deploy and manage… All the big
players are figuring out how to come in, and we all know the future is up
for grabs. I’ve never seen it more disruptive.”15 

Smaller companies competing in the market include the following:

  • Applause
  • Dynatrace
  • Parasoft

There are also many free tools available for download, some of which were
created by individual, even anonymous, developers.


[return to top of this

The IT industry is moving quickly and decisively toward the broader use
of cloud and mobile services that can be accessed anywhere, at any time,
and from (almost) any platform. A user who is evaluating the quality of
these services will be greatly interested in speed and reliability, which
are features on which benchmarks commonly focus. (Security, the other
factor that is of greatest concern when evaluating cloud and mobile
services, is less amenable to benchmarking.) As a result, the interest in
benchmarking will likely continue to grow as cloud and mobile technology
is used by more and more people for a greater number of functions.

But the use of benchmarking is unlikely to grow as quickly as the
technologies that are helping to motivate interest in it. This is in part
because customers often embrace a technology without having objective
measures of its performance. For instance, many wireless services that are
marketed as being fourth generation (4G) offer speeds that fall well short
of the official specifications for the standard. Yet interest in 4G
remains high, and since services branded with this label deliver
better speeds than previous generations of cell phone services,
it still makes sense to upgrade to 4G. And other issues will
likely limit the growth of benchmarking: There are many factors to
consider when choosing a product or service other than those that can be
effectively benchmarked, and benchmarking remains difficult to perform and
its results are still often open to interpretation.

Over time, benchmarking will expand into other areas, but there might be
some obstacles to overcome along the way. One area in which there are few
commonly accepted benchmarks is in Big Data. “We’ve hardly seen any
standardized performance benchmarks in big data,” writes George Gilbert.16
“Nor will we see any real benchmarks until the industry coalesces around
common workloads.” One notable effort has been made by the nonprofit
Transaction Processing Performance Council, which in 2015 released a
benchmark that pertains to a single aspect of Big Data technology. This is
just a small start, however, and the broader industry would still have to
widely adopt any standard that is proposed.


[return to top of this

Pick the Right Tests to Run

Enterprises can measure their applications, devices, and services in many
ways using the gamut of tools now available. But the hard part of
benchmarking is gathering metrics that provide information that is
meaningful and useful. “Companies often rely on the most readily available
metrics rather than the most useful,” explains a Network World
analysis. “One such metric is I/Os [Input/Output Operations] per second.
This metric only addresses two secondary measures: is the I/O causing a
problem, and how optimal is it? It does not get to the heart of the most
important questions: how quickly are things getting done, and are they all

Organizations that pick tests that are meaningful in terms of their
business goals and that are appropriate for their technology
infrastructures will realize better results than organizations that use
generic benchmarks. Making good choices can be difficult, however. The
process is often highly complex and technically difficult, becoming more a
matter of computer science than of IT administration.18

Develop a Plan

Benchmarks are typically not effective if they are run without
consideration of what they mean. “I can tell you from hard won experience
that many weeks of hard work have [been] ruined because the person or team
performing the benchmark failed to prepare properly,” says Kevin Kline,
director of engineering services at SQL Sentry.19 “Many of
these ruinous mistakes were caused by forgetting the cardinal rule of
benchmarks – a benchmark must produce results that are both reliable and
repeatable so that we can foster conclusions that are predictable and
actionable. Keeping the ‘reliable and repeatable’ mantra in mind
necessitates a few extra steps.”

A book published by the American Society for Quality offers the following
advice about planning the benchmarking process: “You might begin
benchmarking by running the test application in a normal environment. As
you narrow down a performance problem, you can develop specialized test
cases that limit the scope of the function that you are testing. The
specialized test cases need not emulate an entire application to obtain
valuable information. Start with simple measurements, and increase the
complexity only when necessary.”20


[return to top of this

[return to top of this