EMC ScaleIO Operation Overview

EMC SCALEIO OPERATION OVERVIEW
Ensuring Non-disruptive Operation and Upgrade
ABSTRACT
This white paper reviews the challenges organizations face as they deal with the
growing need for “always-on” levels of service availability. It illustrates how the EMC
ScaleIO architecture provides the tools needed to address these challenges.
March 2015
EMC WHITE PAPER
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local
representative or authorized reseller, visit www.emc.com, or explore and compare products in the EMC Store
Copyright © 2015 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without
notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with
respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Part Number H14036
2
TABLE OF CONTENTS
EMC SCALEIO INTRODUCTION .................................................................. 4
ARCHITECTURE OVERVIEW ...................................................................... 4
SCALEIO DATA CLIENT (SDC) .............................................................................. 4
SCALEIO DATA SERVER (SDS) ............................................................................. 4
METADATA MANAGER (MDM) ............................................................................... 4
NON-DISRUPTIVE OPERATION ................................................................. 5
DATA STORAGE AND ACCESS .............................................................................. 5
DISTRIBUTED VOLUME ............................................................................................... 5
TWO-COPY MESH MIRRORING ..................................................................................... 6
ADDING NEW SDS NODES AND AUTO-REBALANCE ................................................ 6
REMOVING SDS NODES AND FAULT HANDLING ..................................................... 6
PROTECTION DOMAINS .............................................................................................. 7
FAULT SETS .............................................................................................................. 7
FORWARD AND BACKWARD REBUILD ........................................................................... 7
ADDING AND REMOVING STORAGE MEDIA ........................................................... 8
MDM FAILURE .................................................................................................... 8
ADDING AND REMOVING SDC ............................................................................. 8
NON-DISRUPTIVE UPGRADE ..................................................................... 9
AVAILABLE TOOLS ............................................................................................. 9
UPGRADE PREPARATION ..................................................................................... 9
UPGRADE PROCESS............................................................................................ 9
CONCLUSION .......................................................................................... 10
3
EMC SCALEIO INTRODUCTION
Organizations have experienced similar pain points over the past few decades to manage growth, operational efficiency, service
levels and cost. In a typical environment, the IT department has dedicated SAN environments where there is a limitation on how
much they can scale. It’s a lot of effort to acquire new hardware and perform data migrations. There isn’t a way to pool the overall
available resources so that the users can share I/O operations or capacity. In some cases, users may run out of capacity in one
application but have resources under-utilized in other areas. It’s extremely hard to maintain when the environment gets complex
over time. This is even more difficult when the organization is required to reduce cost.
Many other alternative solutions cannot deliver similar capabilities as EMC ScaleIO. For example, many would claim to provide
scalability and high performance. But they are often black-box solutions which tend to be expensive to maintain over time and
difficult to scale up or down as needed. There are also open source alternatives; however they require a lot of manual labor and
internal developer expertise to maintain and tune the system during normal operation. Most cannot take advantage of modern media
such as SSD and PCIe/NVMe persistent storage due to poor performance.
ScaleIO is an industry leading technology that can offer what the competitors cannot: hyper-convergence, scalability, elasticity, and
performance. The software converges storage and compute resources into a single architectural layer, which resides on the
application server. The architecture allows for scaling out from as little as three servers to thousands by simply adding servers
(nodes) to the environment. This is done elastically; increasing and decreasing capacity and compute resources can happen “on the
fly” without impact to users or applications. ScaleIO also has self-healing capabilities, which enables it to easily recover from server
or disk failures. ScaleIO aggregates all the IOPS in the various servers into one high-performing virtual SAN. All servers participate in
servicing I/O requests using massively parallel processing. In addition, ScaleIO is hardware agnostic so there is no limitation on what
hardware customers are required to use. In a VMware environment, ScaleIO can even support normal VMware block storage features
such as VMotion and DRS.
This paper will discuss some of the key characteristics of ScaleIO in regards to resiliency and flexibility. Those are required because
organizations need a solution that can support an “always-on” infrastructure. This happens specifically in two areas:

Non-disruptive Operation (NDO): Any node in a cluster may go down anytime due to failure. ScaleIO’s capability to ensure
NDO would need to allow high tolerance of failure due to unplanned outages or graceful maintenance. Data migrations are
normally very time consuming and costly for most data centers. However, ScaleIO’s auto-rebalancing and rebuilding allows this
process to happen seamlessly.

Non-disruptive Upgrade (NDU): This is an update to the ScaleIO software without interruption to the storage system service
and data access. ScaleIO version 1.3x and later supports rolling upgrade. Such activity will not require any downtime during
maintenance process.
ARCHITECTURE OVERVIEW
ScaleIO is comprised of three software components. Understanding these components is critical to ensure non-disruptive operation
and help troubleshooting performance of the cluster.
SCALEIO DATA CLIENT (SDC)
SDC is a lightweight device driver that exposes ScaleIO volumes as block devices to the operating system residing on the same
server that the SDC is installed on. The SDC should be installed on any server where the user wants to use the ScaleIO storage. All
I/O requests will go through the SDC which will communicate with other SDSs over TCP/IP.
SCALEIO DATA SERVER (SDS)
SDS is used to contribute local storage of the node it’s installed on. It manages the capacity of a single server and acts as a backend
for data access. There could be many SDS nodes which contribute storage capacity to the entire ScaleIO cluster. It aggregates not
only all application servers’ storage capacity but also performance. It can leverage Flash, SSD or HDD for storage and RAID cache or
RAM for caching capability.
METADATA MANAGER (MDM)
MDM serves as the monitoring and configuration agent. It’s important to know that the MDM is not a part of the data path. If each
SDC wants to access data, it goes directly to specific SDS that contains the required information. The MDM is mainly used for
management which consists of migration, rebuilds, and all system-related functions. To support high availability, two instances of
MDM can be run on different servers. An MDM may run on servers that also run SDCs and/or SDSs.
4
NON-DISRUPTIVE OPERATION
ScaleIO architecture is very flexible in various situations of typical operation lifecycle such as proactive maintenance and unplanned
outages. The table below describes the major operations and how ScaleIO handles them to ensure uptime and availability.
OPERATION
Write Data
Read Data
Add SDS Node
VALUE
Data chunks are directly written in multiple nodes. Each chunk has a primary and a secondary
copy.
Data access will be pointed directly to the SDS that contains the primary chunks. No
communication to MDM to prevent bottleneck and single point of failure.
ScaleIO will trigger auto-rebalancing process to redistribute all chunks in the cluster. This also
allows linear scale by increasing capacity and I/O performance.
Remove SDS Node
A graceful SDS removal via CLI will trigger rebalancing. An ungraceful SDS removal or failure
Gracefully and Ungracefully
will start the rebuild process. There will be no disruption to existing I/O operations.
Add/Remove Storage Media
MDM Failure
ScaleIO will follow similar process as SDS regarding rebalancing and rebuild. It treats the
event as an SDS reconfiguration.
A single MDM failure will not impact I/O operations because it’s not in the data path and is
clustered for high availability. SDC communicates with SDS directly for I/O request.
SDC Reboot or Network
ScaleIO volume is shared volume by default. The CLI is available to troubleshoot connectivity
Issues
with the primary MDM.
DATA STORAGE AND ACCESS
This section will discuss the mechanics of ScaleIO’s storage engine. Its resilient design allows fault tolerance but still optimizes
performance. This is possible because the data is sent in multiple chunks across many nodes using a distributed volume architecture
and two-copy mesh mirroring mechanism.
DISTRIBUTED VOLUME
When there is a write request, the data chunks (1MB) are spread throughout the cluster randomly and evenly. The local SDC
communicates with a specific SDS to perform this operation. The MDM will not be involved with data access unless there is a change
in cluster topology. In that case, the MDM will provide the new “mapping” to the local SDC. ScaleIO has a complex algorithm to
evaluate the cluster balancing and ensure randomness.
Note that ScaleIO is very efficient in terms of managing network bandwidth. If an application writes out 4KB of data, only 4KB are
written. The same goes for read operations—only the required data is read. This scheme is designed to maximize protection and
optimize performance.
5
TWO-COPY MESH MIRRORING
Mirroring method is used only on data write to ensure there is a second copy of the primary chunk. For example (Figure 1), the
below volume is broken down into multiple chunks across the cluster (3 SDS nodes). The mirror of each chunk (yellow color) must
be located by ScaleIO in different nodes other than its own primary SDS node (blue color). In a failure scenario, the volume data is
protected since ScaleIO can reach out to the redundant copies to reconstruct the same volume. In a cluster, there will be 50%
primary and 50% mirror. Note that only a write operation will require 2x data transmission over the network due to mirroring. For a
read operation, the SDC can reach out to the primary SDS directly to retrieve the data chunks.
Volume
SDS 1
Figure 1.
SDS 2
SDS 3
Mesh mirroring example with 1 Volume and 3 SDS nodes
ADDING NEW SDS NODES AND AUTO-REBALANCE
ScaleIO is very flexible when it comes to scaling. There is no need to invest in a costly data migration effort because capacity can be
added with no downtime and minimal planning. This is a major factor in reducing operational costs and growth complexity. In order
to increase capacity in the cluster, new SDS nodes must be added. The system dynamically reacts to new addition events and
recalculates its rebalancing plans. This process will happen automatically with minimal data movement. Unlike traditional methods
where only new volumes can benefit from new capacity, the ScaleIO cluster will rearrange the data between the SDS servers to
optimize performance.
In the below Figure 2, when the storage administrator adds a new node to the cluster, data chunks in existing nodes automatically
migrate and are distributed evenly using the new node.
NEW NODE
+
BEFORE
AUTO-REBALANCING
AFTER
Figure 2.
Auto-rebalancing process when adding a new SDS node
REMOVING SDS NODES AND FAULT HANDLING
Protecting the SDS is important to ensure high availability and I/O performance. The SDS nodes could go down either by planned
maintenance / removal or unplanned outages. For a planned scenario there is flexibility to ensure the cluster has enough capacity for
redistribution and ways to optimize performance with minimal I/O operations for rebalancing activities. Removing an SDS node can
6
be done either via the CLI, ScaleIO GUI or REST API. The data are always in protected state. However, in a failure event, ScaleIO will
trigger rebuild due to degraded protection mode. It’s important to understand some key concepts of ScaleIO when it comes to fault
handling of an SDS node:

Protection Domains

Fault Sets

Forward and Backward Rebuild
PROTECTION DOMAINS
Protection Domains are subsets of SDS nodes. The administrator can divide SDSs into multiple Protection Domains of various sizes,
designating volume to domain assignments. Within a Protection Domain, both primary and mirror data chunks of a particular volume
will be stored in SDS nodes that belong to the same Protection Domain. So if there are two SDS nodes failing and they are in
different Protection Domains, it will not cause data unavailability because each Protection Domain would have the mirror copy in
different SDS nodes. Such isolation is helpful to increase the resilience of the overall system.
In addition, Protection Domains can be used for separation of volumes for performance planning—for example, assigning highly
accessed volumes in “less busy” domains or dedicating a particular domain to an application. In a Service Provider environment, it’s
an important feature for data location and partitioning in multi-tenancy deployments so that tenants can be segregated efficiently
and securely. Finally, there could be use cases where the administrators want to use Protection Domains as adjustments to different
network constraints within the system.
FAULT SETS
Within a Protection Domain, the administrator can setup multiple Fault Sets. This is just a logical grouping of SDSs to ensure that the
mirroring occurs outside of that grouping. This setup can be defined based on various risk factors. For example, the administrator
can treat one rack as a Fault Set. In the case that particular rack goes down, the mirror chunks in a different rack (outside of the
Fault Set) could be used to rebuild the cluster. Fault Set can act as a “rack-level” high availability feature. This design can support
multiple host failures and not result in data loss.
FORWARD AND BACKWARD REBUILD
The rebuild process is triggered when there is a change in SDS topology due to an unexpected loss of a storage device or the entire
node. There are two cases:

Forward Rebuild: Once a disk or an SDS node fails, the rebuild load is balanced across all the Protection Domain’s disks and
nodes. This is a many-to-many process.

Backward Rebuild: If that same failing disk or node returns to operational state during the forwards rebuild, ScaleIO will
trigger a smart and selective transition to “backward” rebuild (re-slivering).
There is very little performance penalty during rebuild because the ScaleIO algorithm can optimize this process and allow flexibility
for control:

The administrator can set policies to use for rebuild I/O such as concurrency, bandwidth, and priority vs. application I/O inprocess.

Unlike traditional solutions which tend to treat a node coming back alive as a “blank” node, ScaleIO’s intelligent engine can
evaluate if the data chunks are out of date to determine if it should continue the forward rebuild process or leverage data
chunks from that node. Therefore, a shorter outage will result in a smaller performance penalty.
The example below (Figure 3) demonstrates how rebuild works when a node fails. Within a few seconds after failure detection, the
mirrored chunks outside that node will be copied to different nodes via a many-to-many operation (chunk A in SDS 2 to SDS 3 and
chunk C in SDS 3 to SDS N). No two copies of the same chunk are allowed to reside on the same server. Note that while this
operation is in progress, all the data is still accessible to applications. The local SDC can communicate to the mirror SDS to get the
data to ensure no outage or delays.
7
FAILED NODE
…
SDS 1
Figure 3.
SDS 2
SDS 3
SDS N
Rebuild process when SDS 1 failed
ADDING AND REMOVING STORAGE MEDIA
In these operations, the number of SDS nodes does not change. The new media to be added (or removed) in one of the SDS nodes
could be HDD, SSD or PCIe Flash Cards. ScaleIO will treat this event as an SDS reconfiguration. That means ScaleIO will redistribute
the data accordingly and seamlessly. This process does not require the administrator to manually redistribute data if a storage media
is taken out. It’s important to note that if the storage media is removed for planned maintenance, there must be enough spare
capacity for the data to be “evacuated”. Otherwise, ScaleIO will not allow the media removal.
MDM FAILURE
It’s critical not to confuse the purpose of the MDM. It does not keep the metadata or index of the actual data nor perform data
operations. The MDM exists for several purposes such as system maintenance, management operations, monitoring the state of all
components, and calculating system reconfiguration to optimize available resources. The MDM is not on the ScaleIO data path. The
purpose of such design is to prevent bottleneck where multiple clients access a single point of failure in order to identify the data
chunk location. The resource consumption of the MDM cluster is very minimal and should not impact overall cluster performance and
bandwidth.
It’s required to have a ScaleIO system with three nodes (Primary MDM, Secondary MDM and Tie-Breaker) for redundancy
management purposes. In a failure event, ScaleIO will provide automated failover as well as allow manual intervention, depending
on the scenario:

If the Primary MDM goes down, the system will failover to the Secondary MDM. When it becomes operational again, the
administrator can add a new MDM IP address via the CLI.

If the Secondary MDM goes down, there is no impact to management traffic as the Primary MDM is still handling management
functions.

If the Tie-Breaker goes down, the system will still operate as normal since the Tie-Breaker is only for HA and conflict resolution.
ADDING AND REMOVING SDC
Since the SDC is just a client or a consumer of storage, ScaleIO only exposes the volume for I/O operations in this node. It’s possible
to designate which SDCs can access the given volumes. The volume access could be controlled or shared to multiple SDCs,
depending on the configuration of the operating system to support clustered mode. Multiple SDCs can gain access to it. To make it
easy for troubleshooting, the ScaleIO CLI allows several options such as:

Check if the volume is mapped to any of the SDC servers

Determine if the SDC is installed

Determine if the SDC is connected to an MDM

Scan for new volumes
8
NON-DISRUPTIVE UPGRADE
AVAILABLE TOOLS
It’s recommended to use existing tools for automated deployment and upgrade unless a manual method is absolutely required.
Those tools can help the administrator ensure that the upgrade process is non-disruptive:

ScaleIO Gateway Installation Manager (IM) validates environment compatibility before performing installation, upgrade,
extend, and uninstall operations.

ScaleIO Light Installation Agent (LIA) is installed on all ScaleIO nodes and creates trust with the Installation Manager to
facilitate its operation.

ScaleIO vSphere Plug-In installs ScaleIO on an ESX environment and also installs the ScaleIO Gateway, which can be used
for “Get Info” and upgrade operations.
UPGRADE PREPARATION
Starting from ScaleIO version 1.30, all components can be upgraded during normal operation with no downtime. There are some
considerations to ensure a smooth upgrade process:

ESX servers should be in a cluster with proper HA settings configured. This will allow the administrator to reboot the ESX server
after upgrading the SDC component without impacting the virtual machines’ availability.

Although ScaleIO vSphere web plug-in is not required, it’s highly recommended to ensure NDU.

The IM and LIA must be upgraded to the latest version first before they can trigger the component upgrade. If they are not
installed previously (due to manual installation), they should be installed.

Before upgrade, the IM will check the system to avoid degrading capacity. Otherwise, it will not proceed.

For some cases, the organization may require a manual upgrade. This could be due to internal policy or previous failure during
an automated upgrade. The administrator must verify in the ScaleIO GUI that:
o
No rebuild or rebalance is running in the background
o
No degraded capacity exists
o
No SDS or SDC is disconnected
o
No SDS device is in error state
o
The MDM cluster is not in degraded mode, including the Tie-Breaker
UPGRADE PROCESS
At a high level, the upgrade process will happen step-by-step as described in Figure 4 below.
ScaleIO
vSphere
Web Plug-in
Installation
Manager
Light
Installation
Agent
Secondary
MDM
Primary
MDM
Tie-Breaker
SDS
SDC
If ESXi 5.5
Figure 4.
Upgrade process
The vSphere web plug-in is only needed for VMware ESXi environments. This plug-in will orchestrate the updating of IM and LIA
components also. For Windows and Linux, just the IM is used to trigger the upgrade. There are two scenarios in an ESX
environment:
1. If the plug-in was never installed previously, it can be triggered from outside the cluster. The administrator needs to make sure
that the host has connectivity to vCenter and its credentials are available. It’s also possible to setup the ScaleIO Gateway first
and point to its URL because the vSphere Web Client needs to access a web server to retrieve the plug-in.
9
2. If the plug-in was an older version, it must be updated. The administrator will need to unregister it first. Removing the older
plug-in will not impact the existing ScaleIO system in the cluster because the plug-in is just used to trigger installation and
upgrade within the vSphere Web Client. After the old plug-in is removed, the administrator can register the new plug-in and
upload the OVA template.
When the IM and LIA are upgraded, it will not result in any operational impact because these components are not used for either
management or data transmission. The entire process can be executed via the IM GUI under Maintenance Operation screen. The IM
will communicate with all ScaleIO nodes via the LIA components to retrieve system topology and perform the upgrade as in Figure
4:
1. Secondary MDM: It will replace the binary and restart the MDM process. There will be no reboot of the host and this event
happens only within a second. After it’s upgraded, the system will switch its role to be a Primary MDM.
2. Primary MDM: This MDM is now operating as Secondary. It will follow the similar process for upgrading and there will be no
impact to any management traffic. The system will automatically assign control back to this MDM after the upgrade.
3. Tie-Breaker: Since the Tie-Breaker does not sit in the management path, this can be upgraded after the Primary and Secondary
MDM upgrade is completed.
4. SDS: Each of the SDS will be upgraded in a series for the rebuild process to finish before the next SDS is upgraded. ScaleIO will
leverage the spare capacity for this rebuild. It’s a good practice to set spare capacity equal to the largest node. Once the upgrade
is completed, only the services are restarted and there is no reboot required.
5. SDC: ScaleIO will install the newest version in the host while the old version is still operating. SDC has backwards compatibility
and the old SDC will work perfectly with the new SDSs and MDMs until such time that the system is rebooted and the SDC is
replaced. It will not switch to the new version until the reboot as this step is necessary when any device driver is changed. The
administrator needs to make sure that the volume is not mapped and locked before the reboot. It’s a good practice to wait until
the next maintenance window to perform this step. In an ESX environment where the SDC is installed inside the kernel, HA
cluster capability will assist in migration of VMs to a different machine during the reboot.
CONCLUSION
ScaleIO is a hyper-converged infrastructure with enterprise-grade resilience that offers greater choices. Customers have the option
of deploying ScaleIO using existing storage, new storage or a mix of both. This can be done using any physical environment or
virtualization/cloud platform including Linux, OpenStack, VMware or Hyper-V. ScaleIO is a great solution for Service Providers or
Service Provider-like Enterprises to deliver Infrastructure-as-a-Service (IaaS) to customers or internal users. Customers not only
achieve lower Total Cost of Ownership (TCO) but also gain complete control over performance, capacity and data location.
The ScaleIO architecture is designed so that there is no bottleneck or single point of failure. In other storage systems, having a
virtualization layer to keep track of data (e.g. index or journal) usually results in massive failure and disruption when such layer
becomes unavailable. A ScaleIO cluster has a many-to-many communication in a mesh network which enables large parallelism and
I/O performance.
Maintenance and lifecycle tasks without downtime is critical. The benefits customers get from non-disruptive operations are
significant. There are many ways that ScaleIO provides such resiliency and flexibility:

No downtime when changing, scaling or upgrading the storage infrastructure

Efficient distributed self-healing process that overcomes media and node failures without requiring administrator involvement

Great control of auto-rebalancing and rebuild process to prevent application “hogging” scenarios

High tolerance of multiple simultaneous failures via Protection Domains and Fault Sets

Easy physical separation of tenants in multi-tenancy deployments

Intelligent data protection engine which makes rebuild decision on chunk by chunk basis

Flexible linear scale by adding nodes “on the fly”

True elasticity by supporting any commodity hardware and any storage media
10