Best practices for benchmarking SAN performance

Storage-area network complexity can mask what might seem to be relatively benign issues that have the potential to build up and cause an outage or brownout.

Storage-area network complexity can mask what might seem to be relatively benign issues that have the potential to build up and cause an outage or brownout. To identify trouble early, you need to create a SAN performance benchmark, an essential first step to setting up metrics to gauge infrastructure performance.

The key is to establish the metrics in advance. Most companies wait until they have a problem before trying to truly understand baseline performance. Ironically that is the worst time to look because: a) what is found is often overwhelming, b) often multiple issues appear to be the cause and it can be difficult to know where to start, and c) many performance optimization opportunities are overlooked.

Here are best practices for benchmarking SAN performance:

1. Baseline when the SAN is healthy. The best time to evaluate an environment is when everything is healthy and before a cost-saving or performance-enhancing project is implemented. This provides a metric to compare the "problem" state to the baseline, making it immediately obvious where the problem resides.

Ideally, a company should be proactive with the initial baseline and address the issues that are present. Eliminating existing issues helps reduce the number of problems that can together cause a brownout. Optimization savings can be well planned and measured by comparing both consolidation effectiveness and user impact

A good baseline will often reveal over-provisioned infrastructures, ineffective use of tiers, multi-path issues, uneven load distribution, physical layer problems, minor device incompatibilities, improper configurations (zoning, I/O size request, queue depths), out of control applications, unnecessary load or intermittent performance issues.

2. Measure what matters. The most important goal for an application user is to see their actions complete successfully and accurately in a timely fashion. There are two secondary goals for the IT organization: how to resolve user issues, and how to ensure the solutions use only the resources necessary.

Companies often rely on the most readily available metrics rather than the most useful. One such metric is I/Os per second. This metric only addresses two secondary measures: is the I/O causing a problem, and how optimal is it? It does not get to the heart of the most important questions: how quickly are things getting done, and are they all successful?

Rather than looking at I/O, for effective monitoring you need to consider:

* Minimum, maximum and average for Read/Write/Other Exchange Completion time (ECT) (9 metrics) for every host bus adaptor (HBA), storage port and logical unit number (LUN).

* Minimum, maximum and average read command to first data for every HBA, storage port and LUN.

* Minimum, maximum and average pending exchanges (queue depth) for every HBA, storage port and LUN.

* Read/write/other I/O size for every HBA, storage port and LUN.

Another common mistake is to give a metric more credit than it deserves. For example, relying on a server response time (either from the operating system or an application on the server) to determine the health of the rest of the infrastructure.

There are several problems with this approach that make it insufficient to determine whether or not the infrastructure is causing issues. One challenge is that the measurement is impacted by all of the resources on the server. Server issues can cause this measurement to appear artificially long when in fact something as simple as a busy CPU can be the real problem, not I/O transaction times.

The other issue is it relies on the same resources that are being monitored to do the monitoring. Therefore, either large averages or samples are all that are gathered. Ironically, when things are slow fewer transactions are completed. If that applies to only one resource for the server (for example a single LUN or virtual machine), the response times can still look good even though there is a big problem. When you average tens of thousands of good transactions with tens of thousands of bad… the result is everything looks good. Outlying infrastructure problems can be missed.

3. Measure the complete I/O transaction path. Because application response time is measured on the server by the server, it is only a rough indicator in a benchmark. Administrators should look to latency deltas throughout the data path to establish baselines for effective troubleshooting.

Another mistake is relying on the end devices or components to tell you if the infrastructure is healthy. Storage arrays and switches provide useful information when problems are present, but they aren't designed to determine if a problem exists in the infrastructure as a whole. They are inward focused rather than infrastructure focused and lack the granularity to be conclusive. They simply cannot prove conclusively that all of the transactions are completing successfully from host to array and back again in a timely fashion.

4. Use non-intrusive instrumentation. Use instrumentation that is vendor-independent and not SAN component-derived. It will help provide accurate, comprehensive, cross-vendor benchmark metrics. The ideal way to baseline an infrastructure is to find a solution that monitors the environment without the performance impact that a component might have or the outside influences that a server has.

5. Measure every transaction, in real time. The solution needs to be able to monitor every transaction to ensure that they complete successfully in a timely fashion and present the data frequently enough (ideally every second) to ensure that outliers are not missed. The typical one- or five-minute averages most tools report are guaranteed to miss problems.

Establishing metrics in advance based on data captured when the SAN is healthy and application response times are acceptable is key to identifying SAN troubles early. These benchmarks will enable you to spot what otherwise might seem like benign issues and prevent outages.

Foster is a principal architect for Virtual Instruments professional services.

Join the PC World newsletter!

Error: Please check your email address.

Tags sanbest practice guidestorageNetworking

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Craig Foster

Network World
Show Comments

Most Popular Reviews

Latest News Articles


PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?