How to End Virtualization Administration Storage Frustration

W
H
I
T
E
P
A
P
E
R
How to End Virtualization Administration
Storage Frustration
Marc Staimer, President & CDS of Dragon Slayer Consulting
WHITE PAPER • How to End Virtualization Administration Storage Frustration
How to End Virtualization Administration Storage Frustration
Provisioning VMs are easy, and then the problems start…
Marc Staimer, President & CDS of Dragon Slayer Consulting
[email protected] 503-579-3763
Introduction
Virtualization has been an incredible IT operations godsend to the vast majority of IT organizations. It has
tremendously simplified server implementations, management, operations, upgrades, patches, tech
refresh, and most importantly of all, availability. Tasks that used to require scheduled downtime are now
conducted online, during production hours, instead of weekends, holidays, and late nights. Virtualization
administrators can create, configure, and provision VMs in a matter of minutes. This makes virtualization
admins smile.
It is what follows VM provisioning that makes virtualization admins scream in frustration. First there is
the annoying wait for the storage provisioning that can be hours, days, even weeks. Then there is the
exasperating intermittently poor performance that somehow has something to do with storage and rarely
matches the pristine performance experienced in the lab. Diagnosing as well as fixing the problem is
maddeningly difficult. The tools rarely say, “Here’s the root cause of the problem!” In addition, there’s the
tedious ongoing manually labor-intensive performance tuning required for virtualization’s always
changing fluid environment. And finally (as if that wasn’t enough), virtualization admins have to
coordinate their data protection policies and practices with those of the backup admin and storage
admin.
This leads many virtualization admins to lament: “why can’t managing performance, storage, and data
protection be as easy as creating, configuring, provisioning, and managing VMs?” It is a valid question.
The root cause almost always comes down to storage. More specifically, it seems to revolve around
storage based performance barriers.
Storage performance barriers are technical issues or processes that prevent virtual machines (VM) and
virtual desktops (VDI) from achieving required response times consistently. These barriers can be
something as simple as the high latency inherent in hard disk drives (HDD) that limits IOPS and
throughput, or as convoluted as SAN or LUN oversubscription. Each barrier by itself will reduce
virtualization performance. Combinations of these barriers can decimate it.
This document will examine in detail the storage problems virtualization administrators must deal with on
a day-to-day basis; the typical market workarounds deployed to solve those problems; how those
workarounds ultimately break down and fail in one way or another; and finally a better way to solve these
vexing storage problems for the virtualization administrator – the Astute Networks ViSX Flash-based VM
Storage Appliances.
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
2
WHITE PAPER • How to End Virtualization Administration Storage Frustration
Table of Contents
Introduction ........................................................................................................................ 2
Storage Barriers Plaguing Virtualization Performance ....................................................... 4

HDD Limitations ................................................................................................................... 4
 Adding More HDDs to the Storage System ................................................................ 4
 HDD Short Stroking .................................................................................................... 5
 PCIe Flash SSDs as Cache in the Physical Virtualization Servers ................................ 5
 Put into the Storage System Flash SSDs as Cache, Tier 0 Storage, or as Complete
HDD Replacement .............................................................................................................. 6
 Flash Cache Appliances in the Storage Network ........................................................ 6
 VM LUN Oversubscription ................................................................................................... 6
 Storage Network Configuration and Oversubscription ....................................................... 7
 Organizational Administrative Silos ..................................................................................... 7
 Data Protection, Business Continuity, and Disaster Recovery ............................................ 8
 There Must Be a Better Way to Eliminate or Mitigate Storage Barriers Plaguing
Virtualization Performance .......................................................................................................... 8
Astute Networks ViSX G4 Virtualization Optimized Storage Appliances............................ 8





No HDD 100% Flash SSD Appliances .................................................................................... 9
Unprecedented IOPS and Throughput ................................................................................ 9
Lower Than Expected TCO ................................................................................................... 9
Purpose Built for the Virtualization Administrator.............................................................. 9
Enterprise Class Reliability ................................................................................................. 10
Conclusion ......................................................................................................................... 10
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
3
WHITE PAPER • How to End Virtualization Administration Storage Frustration
Storage Barriers Plaguing Virtualization Performance
There are five major storage based virtualization performance barriers:
1.
2.
3.
4.
5.

Hard disk drive (HDD) limitations
VM LUN oversubscription
Storage network configuration/oversubscription
Organizational administration silos
Data protection, Business Continuity, and Disaster Recovery
HDD Limitations
HDDs are electro-mechanical devices with spinning magnetic platters or disks and a moving read/write
head. Performance is directly tied to how the density of the platters (the largest being 4TB but only
900GB for higher speed HDDs); the speed they spin (currently maxed out at 15,000 RPM); how fast the
head can find a specific piece of data (commonly measured as seek time); how fast the head can write the
data; and what the delay when a request comes in while the head is already reading or writing. All of
these factors add up to the HDD latency. HDD latencies are high and by definition severely limit
performance. One of the biggest contributors to IO latency is the huge well-known performance gap
between storage processors and HDDs that’s rapidly widening. Processors have been following Moore’s
law for nearly 40 years. This has led to their power, IOPS, and bandwidth improving roughly 100 X over a
10-year period (per Intel). HDD improvement over that same time period has been effectively flat. In
other words, processors wait an eternity for HDDs to read or write data.
Fig 1: Processor – HDD Performance Gap per Intel
There are several workarounds virtualization admins attempt to overcome HDD limitations that include:





Adding more HDDs to the storage system
HDD short stroking
Add PCIe-connected Flash SSDs as cache in the physical virtualization servers
Put Flash SSDs into the storage system as cache, Tier 0 storage or as complete HDD replacement
Flash Cache appliances in the storage network
Each of these workarounds manages to ameliorate a portion of the HDD performance issues but do not
necessarily improve the life of the virtualization admin. A look at each one shows why.

Adding More HDDs to the Storage System
Adding more HDDs is typically the first thing a storage administrator will do to try to improve
virtualization storage performance because it’s the easiest. There is a noticeable and ultimately limited
aggregate IOPS increase. Conversely, it takes a large number of drives to even partially close the
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
4
WHITE PAPER • How to End Virtualization Administration Storage Frustration
processor to HDD gap. Storage systems HDD support is finite, which means when
that limit is reached and performance is still not enough, either more storage
systems must be added (causing storage system sprawl) or the storage system must
be replaced with a bigger more expensive system. Both alternatives lead to
problematic data migration and/or load balancing. More HDDs additionally mean
higher capital expenses (CapEx) and much higher operating expenses (OpEx) in
power, cooling, rack, floor space, and software licensing costs.
Fig 2: Lots of HDDs
Just because the workaround is easy does not mean it is either effective or efficient with results that are
marginal at best.

HDD Short Stroking
HDD short stroking is another regularly utilized workaround storage admins try to improve virtualization
performance. HDD short stroking reduces latency by restricting placement of data to the outer sectors of
platters that results in faster seek times. The outer sectors deliver the lowest latencies because the head
doesn’t have to move as far when reading and writing data.
Industry testing consistently shows HDD short stroking
performance improvements ranging from 29 to 33%.
This workaround too, has non-trivial drawbacks. It starts by
throwing
away as much as 67 to 90% of the usable HDD
Fig 3: HDD Short Stroking
capacity. And this wasted space still requires power, cooling
and rack/floor space overhead increasing the OpEx cost per
usable TB in addition to the extra networking infrastructure that will be required resulting from additional
storage systems because that HDD limit is achieved so much sooner. The cost and complexity of HDD
short stroking makes it an interim solution at best.

PCIe Flash SSDs as Cache in the Physical Virtualization Servers
Fig 4: PCIe
PCIe Flash SSDs in application servers with caching software that moves the
Flash SSD
data to SAN storage have become quite fashionable. They’re most common in
high performance compute clusters (HPC). Having Flash SSDs in the physical
server on the PCIe bus puts it very close to the application. This provides the
lowest round trip latency between the application and it’s very high
performance storage. This approach would appear to solve several of the
virtualization administrators’ problems with control and performance. Regrettably, it does not solve all of
them and creates several more including:






Places a heavy burden on the server’s CPU cycle utilization ranging as high as 20%. Virtualization
oversubscribes physical server hardware to enable more virtual machines (VMs) or desktops (VDs) to
utilize the hardware effectively. Taking a big chunk of the resources away dramatically reduces that
consolidation capability.
Caching software increases the CPU burden and further reduces each server’s ability to consolidate.
That caching software is essential to hosted VM guests because it must keep the VM image
synchronized with the shared storage sitting across the storage network or much of the virtual
server’s advanced functionality including VM movement between machines, ceases to work properly.
The caching software is not inexpensive and licensed per server.
The PCIe Flash SSD hardware and caching software does not help the virtualization administrator at
all with the management, provisioning, or data protection of the shared external storage. It still
requires a knowledgeable storage and SAN administrator.
This workaround is an expensive non-shared solution that only benefits the VMs or VDs on that
particular physical server on which they reside. Other virtualized servers and desktops are out of luck
unless they too purchase and install the same.
Each implementation of the Flash PCIe SSD and its accompanying caching software must be
implemented, configured, and ongoingly managed separately.
Local caching doesn’t alleviate storage network requirements or issues.
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
5
WHITE PAPER • How to End Virtualization Administration Storage Frustration

Put into the Storage System Flash SSDs as Cache, Tier 0 Storage, or as Complete HDD Replacement
Fig 5:
Flash SSD
Putting Flash SSD into the storage system will greatly speed up read IOPS. The Tier 0 Storage
and complete HDD replacement will also increase write performance. This approach should
definitely solve HDD limitations for virtualization; however, once again there are significant
downsides.
When using Flash SSDs as cache or Tier 0 (a.k.a. the hybrid storage approach),
many storage systems have severe limits on the number of SSDs that can be supported.
This limits cache or Tier 0 size. Smaller cache means that as data sets continue to grow
there will be more cache misses that are redirected to HDDs and subsequently much
lower performance. Smaller Tier 0 sizes means there will be reduced capabilities in
supporting larger data sets for the target resulting in more active data residing on much
lower performance Tier 1 or Tier 2 storage (HDDs). And the Tier 0 approach requires
typically quite costly storage tiering software.
Fig 6: Storage System w/several SSDs
The storage systems with 100% Flash SSDs
would seem to alleviate the problems of
caching and tiering. Yet they too have issues and problems for
the virtualization administrator. The storage systems will have
a significantly higher upfront CapEx (TCO too) than hybrid
systems and they will still require a storage knowledgeable
storage administrator. These systems will not be under the
control of the virtualization administrator and will do nothing
Fig 7: Storage System w/100% SSDs
to alleviate the storage as well as storage networking issues of
setup, provisioning, management, data protection, and ops. Furthermore, these systems tend to move
the performance bottleneck to the storage system controller or the storage network.

Flash Cache Appliances in the Storage Network
The Flash cache appliance is similar to putting the Flash cache in the storage system but for that it sits
between the virtualization initiators and the storage system target. This conceptually allows the Flash
cache appliance to provide caching for
Fig 8: Flash Cache Appliance
multiple storage systems. The reality is a
bit different.
Most Flash cache
appliances are capacity constrained. As
datasets continue to grow, so do cache
misses causing a redirect to the backend
storage systems and HDDs reducing the
Flash cache appliance’s effectiveness.
By sitting between the virtualization initiators and the target storage systems, it introduces another
storage management layer, another variable in troubleshooting, and another system to tech refresh.
Virtualization admins are seeking simpler control over storage not more complexity especially when
troubleshooting performance issues.

VM LUN Oversubscription
One of the bigger complaints from virtualization administrators comes when they move a virtualization
environment from the lab to production. Far too often they see a noticeable drop-off in performance.
What’s even worse is that the performance drop is intermittent and unpredictably mystifying.
Troubleshooting is an exercise in frustration.
The issue is a situation that is frequently attributed to VM LUN oversubscription. VM LUN
oversubscription is an indirect consequence of the virtualization hypervisor virtualizing storage LUNs.
That virtualization enables the virtualization administrator to slice and dice a physical LUN into multiple
virtual LUNs enabling each VM to have what they perceive as their own LUN.
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
6
WHITE PAPER • How to End Virtualization Administration Storage Frustration
The storage system does not see or understand that there are different multiple
virtual machines accessing that LUN. It sees a data stream coming from the same
physical server. It does not parse that data stream. It cannot provide higher levels
of service or priority of one VM over another when accessing that LUN. It can’t
even see the different VMs. This means there can be multiple VMs attempting to
hit the same HDDs at the same time. That creates contention on the drives. LUN
IO is handled on a first in first out basis (FIFO). VM read/write IO requests are put
in a queue. There are HDD queues and system queues. HDD drive type (SAS/FC or
SATA) determines the number of queue IO requests. SATA drives have at most 32
buffered queues where as SAS/FC have 256. When queue buffers are saturated the
VM SCSI protocol times out and bad things will happen (crashed VMs.)
VMware has a quality of service (QoS) feature work around that allows the VM
administrator to prioritize different VMs based on IOPS or latency. Unfortunately,
it robs Peter to pay Paul. It takes performance away from the lower prioritized VMs
then gives to the higher ones. It has limited usefulness because it only treats the
symptoms of the problem and not the root cause.
Fig 9:
LUN Oversubscription
Another work around is to limit the number of VMs that can access a LUN. This is a
manual work around that must be set up in the beginning. Obviously, if each VM
has its own LUN there is no contention.
The third workaround is to implement a storage solution that utilizes Flash SSD as previously discussed.
All of these workarounds require cooperation with the storage admin and the SAN admin as well as the
facilities admin. Provisioning and changes must be scheduled in advance leaving the virtualization admin
frustrated with partial solutions.

Storage Network Configuration and Oversubscription
Storage networks (SANs) are architected for oversubscription. SANs allow
multiple physical servers to access the same target storage ports. This enabled
more servers to utilize the same storage resources long before server
virtualization became fashionable. SAN oversubscription is today a common
practice. But when virtual servers and desktops are added to an oversubscribed
SAN, the SAN quickly becomes a performance bottleneck if it is not adjusted for
the virtualization oversubscription. For example: a typical SAN oversubscription
rate is 8:1, or 8 physical server initiator ports to 1 storage target port. If the
average number of VMs on a virtual server is 10 or a 10:1 ratio, then the total
VM to target storage port is 80:1. Obviously an 80:1 oversubscription ratio is
going to have performance problems.
Avoiding the SAN bottleneck requires coordination between the virtualization
administrator, SAN administrator, storage administrator, and facilities
administrator because each discipline is its own administrative silo. The
scheduled planning, coordination, and time (mostly the time) are the primary
reasons virtualization administrators are so frustrated with storage.

Fig 10:
SAN Bottleneck
Organizational Administrative Silos
Most IT organizations have administrative silos for virtualization, applications,
networks, storage, storage networks, facilities, etc., is because the knowledge and
experience requirements for each one is vast. Those that don’t expect their admins
to have skills they often don’t have.
As previously discussed, virtualization admins get exasperated by the constant
planning, coordination, and time required to work with the other admins. But they
become quickly discouraged when they can control everything including the storage and storage
Fig 11:
Wasted Time
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
7
WHITE PAPER • How to End Virtualization Administration Storage Frustration
networking. This is because they lack the knowledge, skills, and experience they need. It is very
disheartening and begins to feel like a “no-win” scenario.
The work around most often deployed by virtualization administrators is NAS. VMware has NFS (network
file system) built into the vSphere kernel. Microsoft Hyper-V has CIFS (computer
internet file system) built into its kernel. NAS allows the virtualization admin to
forget about LUNs, SAN, pathing, oversubscriptions, etc. because it is file based
storage. It doesn’t need any of those things. Just set up the file store for each
virtual machine or desktop, mount it and it’s done. The trade-off to this
common easy workaround is a significant performance loss. Latency is an order
of magnitude higher. NFS or CIFS metadata can be as much as 90% of the NAS
IOPS causing serious CPU bottlenecks for application data. Virtualization admins
Fig 12:
File Storage (NAS)
love the simplicity and hate the performance of this work around.

Data Protection, Business Continuity, and Disaster Recovery
Data protection, business continuity, and disaster recovery (DR) are essential to the vast majority of IT
organizations, especially today in an increasingly regulatory climate. Yet these disciplines tend to be
highly uncoordinated with large amounts of functional and effort duplication. Virtual servers and
desktops are snapped and replicated as well as backed up. Storage systems are snapped and replicated.
Databases are mirrored (or continuously protected on every write) synchronously and/or asynchronously.
Different administrators handle all of these data protection functions with
little or no knowledge of what the other admins have done. This can and
often does create chaos during recoveries and business continuity. Data is
recovered multiple times wasting valuable time and too much duplicate
efforts. It results in unacceptably increased times to be up and running.
It is analogous to wearing 2 sets of underwear, pants, belt, suspenders,
and coveralls, while having dysentery. Not a pretty picture.
Fig 13:
Different Admins & Tools
rd
One work around has been to use a 3 party data protection software
management catalogue consolidation providing overall reporting on most of the data protection systems
in place. This helps deliver information required to minimize duplication of effort for recoveries, but does
nothing to minimize that duplication of effort in the protection. And it adds yet another layer of software
requiring time-consuming management.

There Must Be a Better Way to Eliminate or Mitigate Storage Barriers Plaguing Virtualization Performance
The key to solving these problems is the storage. An effective solution must offer the virtualization
administrator:






Storage control without requiring any storage expertise;
Virtual machine and virtual desktop the consistent performance required without high cost or
the need for professional services;
Intuitive hypervisor integrated operations and management;
Eradication of knotty performance problems such as LUN or SAN oversubscription, file metadata
IO latencies, and iSCSI processing latencies;
Data protection/business continuity/DR duplication elimination or minimally mitigation;
Cost effectiveness.
Astute Networks recognized the problems and is attacking them with a series of virtualization optimized
storage appliances specifically architected to do just this.
Astute Networks ViSX G4 Flash Virtual Machine Optimized Storage
Appliances
Fig 14: ViSX G4
Astute Networks purpose built from the ground up the ViSX G4 storage
appliances to specifically address each and every one of the storage
barriers inhibiting virtualization and mission critical application
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
8
WHITE PAPER • How to End Virtualization Administration Storage Frustration
performance. It starts with a clean sheet of paper and looking at the problems with a fresh set of eyes.
Solving only one aspect of the problems will shift the bottleneck elsewhere leaving both the users and
admins frustrated. The problems must be solved holistically as a package, and that is what the ViSX does.

No HDD 100% Flash SSD Appliances
First it eliminates all HDDs (e.g. there are no spinning disks). Each ViSX G4 is 100% Flash SSD (no LUN
oversubscription issues). The Flash itself is high performance eMLC that offers performance and write
cycle life similar to SLC Flash but at cost much closer to MLC. The ViSX G4 then combines those fast eMLC
Flash SSDs with its patented unique DataPump™ Engine (ASIC) that offloads and accelerates TCP/IP and
iSCSI protocol processing. Utilizing 1 and 10G Ethernet iSCSI eliminates most of SAN issues as well as
oversubscription. But it is the architecture and performance of the DataPump Engine that eliminates
most of the iSCSI performance issues.

Unprecedented IOPS and Throughput
The DataPump Engine eliminates both network and storage IO bottlenecks. The
DataPump Engine ASIC eliminates the iSCSI network IO bottleneck by processing
TCP/IP and iSCSI packets in addition to commands much faster than can be
accomplished with software stacks running on commodity processors. The result
is unequalled extremely low round trip network latencies on 1G or 10G standard
Ethernet. That offloading plus a highly optimized software suite that marshals
Fig 15: DataPump Engine data to a high performance RAID controller has the additional benefit of allowing
the CPU to focus on serving up storage for for read/writes. This combination maximizes sustainable Flash
IOPS and throughput performance and enables unprecedented IOPS per dollar. It is Astute’s holistic
design that permits the VISX G4 to outperform conventional Ethernet deployed Flash or hybrid storage
systems by as much as 5 to 10X.

Lower Than Expected TCO
Cost always seems to be a factor with 100% SSD based storage systems. Astute makes SSD cost a nonissue with ViSX G4 appliances by utilizing both eMLC and a very advanced primary data deduplication
algorithm. That algorithm offers very high rates of that intensely increases storage capacity and, unlike
nearly all other storage solutions, (Flash, hybrid, or disk-based), has zero impact on performance. These
factors make the ViSX G4 appliances upfront CapEx costs on a par with hybrid or HDD based storage
systems. But based on a total cost of ownership, the ViSX G4 is typically considerably less because of the
very low amounts of power and cooling required as well as minimal storage software costs. Which leads
to that other virtualization administrator frustration, control.

Purpose Built for the Virtualization Administrator
The ViSX VM storage appliances are designed primarily for virtualization administrators. Right out the box
it has pre-configured LUNs and RAID so no expertise is required for provisioning storage for virtual
machines or desktops. However, if the virtualization administrator has some storage system knowledge
and wants to make changes, they can. Each ViSX appliance is intuitive and manageable by a VMware
vCenter plug-in for vSphere environments, or by its FlashWRX™ GUI for other hypervisor platforms such
as Microsoft Hyper-V, Citrix XenServer, and RHEV, making it feel like a virtualization feature.
Astute then did something quite smart by not duplicating the data protection, business continuity, and DR
capabilities within the hypervisors. Each ViSX appliance takes advantage of the virtualization software’s
snapshots, backup, replication, HA, data migration, and DR. No duplication of effort in protecting or
recovering data, no wasted cost on duplicated software, no wasted time recovering data.
And the ViSX is not a “Rip-Out-And-Replace” storage solution. It’s complimentary to the current SAN and
NAS storage ecosystem.
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
9
WHITE PAPER • How to End Virtualization Administration Storage Frustration

Enterprise Class Reliability
It all sounds great, but in the end this is still storage. All storage systems, just like physicians, must have
the mantra of: “first do no harm.” What are the system reliability, data durability, data resilience, and
availability assurances?
The ViSX G4 appliances are the only virtualization-optimized storage with four levels of reliability built into
its DNA:
1.
On chip Flash ECC – Production hardened NAND (Flash) chip based error detection and
correction.
2.
eMLC – Enterprise-grade multi-level cell flash devices combines SLC like extra reliability
(extending the life of flash-based modules to 10 years or more) and SLC like high performance to
with the low cost of MLC.
3.
SSD RAID Levels – Multiple RAID choices including 0, 1, 10, 5, and 6 are supported with extremely
fast rebuild times due to the solid state architecture.
4.
Flash Module Chip Redundancy – ViSX goes beyond traditional ECC data protection by using
redundant flash chips to improve both overall reliability and write performance by a factor of 100
over other devices.
The result is an Enterprise class storage appliance, purpose
built for the virtualization administrator, that eliminates the
common virtualization and mission critical application
performance barriers. And it does all of this without the
sticker shock.
Fig 16: ViSX G4
Conclusion
There are many storage performance barriers facing virtualization admins. Some are architectural. Some
are organizational. Some are absolutely perplexing. All are extraordinarily aggravating. Astute Networks
with its family of ViSX appliances removes those barriers at an incredible IOPs/$.
For more detailed information, please contact:
[email protected] or go to http://www.astutenetworks.com
About the author: Marc Staimer is the founder, senior analyst, and CDS of Dragon Slayer Consulting in Beaverton, OR. The
consulting practice of nearly 14 years has focused in the areas of strategic planning, product development, and market
development. With over 32 years of experience in infrastructure, storage, server, software, and virtualization, he’s considered one of
the industry’s leading experts. Marc can be reached at [email protected].
Dragon Slayer Consulting © 2012 All Rights Reserved • Q3 2012
10