Answering the Call of Virtualization for Databases

WHITE PAPER
Answering the Call of
Virtualization for Databases
1
The English writer Samuel Johnson once remarked, “The trouble with life isn’t that there is no answer, it’s that
there are so many answers.” This statement is entirely appropriate for the term “database” which represents a
wide spectrum of use cases, with On-Line Transaction Processing (OLTP), Reporting, On-Line Analytical Processing
(OLAP) and Analytics being the most common ones. With such diverse and varied sets of requirements, what is
the best method to optimize “database performance”? This paper explores the characteristics of the predominant
database implementations and offers suggestions on building a highly efficient and performant infrastructure to
support this application category.
First, let’s establish a taxonomy. Typically, OLTP is aligned with a single application such as Customer Relationship
Management (CRM) or Enterprise Resource Planning (ERP) and is focused on handling business processes;
Reporting, OLAP and Analytics, on the other hand, can be seen as a part of the Data Warehousing family. Generally,
through data integration via Extract, Transform, and Load (ETL) , data is pulled from sources such as OLTP databases
and placed into data warehouses. Consequently, a data warehouse can be seen as a layer optimized for reporting
and analytics that sits on top of the OLTP environment.
Data Sources
Users
Customer
Replationship
Management
(CRM)
Enterprise
Resource
Planning
(ERP)
Supply
Chain
Management
OLAP
Analysis
ETL
Data
Mining
ETL
DATA
WAREHOUSE
ETL
Reporting
Figure 1 Database taxonomy
As with any complex problem, database architectures require a clear understanding of the problem being solved
and the technical requirements of the workload. So let’s dive into the four database types to understand their
requirements and the impact on the design:
2
OLTP
OLTP workloads have a good mix of read and write operations. It’s latency sensitive and requires high levels of
concurrency. Let’s use the example of an Automated Teller Machine (ATM) to clarify the concept of concurrency.
Each customer at an ATM generates a connection to an OLTP database to execute a few simple instructions. For
instance, if a customer wants to make a withdrawal, the process needs to read the records of the customer in the
database to confirm that she has sufficient funds then record (write) the transaction; in Database Administrator
(DBA) speak, that would be a SQL SELECT statement followed by an UPDATE. However, the bank may have several
ATMs servicing many customers simultaneously. A proper OLTP database should be able to handle a lot of users
at the same time, preferably with a minimal delay as any latency negatively impacts user experience. From an
availability perspective, the service must also be running at all times (i.e. 24/7/365).
OLTP
Availability
+++
Concurrency
+++
Latency sensitivity
+++
Throughput oriented
+
Ad hoc
+
I/O Operations
Mix R/W
Reporting
OLAP
Analytics
Table 1 Characterization of OLTP databases
Reporting
Reporting is usually triggered when there’s a desire to better understand the performance of the business (e.g.
how many accounts were opened/closed this week, is the private banking account team hitting their quota of
acquiring new customers, etc.). It’s a set of predictable requests with the user having a good idea of what data
he wants to see and a specific report design that structures the data consistently. This means the operations are
repetitive and allows the DBA to design and optimize the database and schema (e.g. Star and Snow Flake) so that
this query gets executed in a predictable and efficient manner.
Reporting databases are predominately read-intensive and requires throughput more than anything else. Concurrency
and availability are not as important as they are for OLTP since it serves the back office processes.
OLTP
Reporting
Availability
+++
+
Concurrency
+++
+
Latency sensitivity
+++
+
Throughput oriented
+
+++
Ad hoc
+
+
I/O Operations
Mix R/W
Read intensive
OLAP
Analytics
Table 2 Characterization of Reporting databases
3
OLAP
OLAP can be seen as the analytical counterpart of OLTP. Where OLTP is the original source of data, OLAP is the
consolidation point of these various data streams. A common remark made in the database world is that OLAP
provides a multi-dimensional view, meaning that you drill down into the data coming from mixed sources and
then analyze the data amongst different attributes. This workload is more ad-hoc than Reporting as you examine
the data in a variety of ways depending on the nature of the query. It is also primarily read-intensive and can run
complex queries involving aggregations of multiple databases so it’s throughput-oriented as well. An example of
an OLAP query would be the amount of additional insurance services gold credit card customers had opted for
during the summer months.
OLTP
Reporting
OLAP
Availability
+++
+
+
Concurrency
+++
+
+
Latency sensitivity
+++
+
++
Throughput oriented
+
+++
+++
Ad hoc
+
+
++
I/O Operations
Mix R/W
Read intensive
Read intensive
Analytics
Table 3 Characterization of OLAP databases
Analytics
Analytics is truly ad-hoc in nature. Whereas Reporting aims to provide a perspective of the numbers that are being
presented, Analytics strives to deliver insights on why the numbers are what they are. As an example, Reporting
can show how many new accounts were acquired by the private banking account team in the previous fiscal
quarter; Analytics aims to answer why they did not hit their quota. Analytics often query multiple databases and
can be multi-step processes. Analytic queries write out large temporary results before analyzing them again (the
output of one query is the input of the next query) and this can happen multiple times. This means that both read
and write performance is crucial as the intermediate results need to be stored as fast as possible and, shortly after,
the data made available quickly for the next query.
Another issue with Analytics is the sort process. While the dataset to be retrieved needs to be sorted, it may be so
large that it can’t be fully held in memory during the sort process. As a result, some data may end up spilling to disk.
The large discrepancy in performance between memory and disk will, inevitably, add significant delay to the process.
Because of these challenges, it is difficult to design an efficient schema for Analytics queries upfront. It also makes
the workload an especially difficult use case from a performance perspective.
OLTP
Reporting
OLAP
Analytics
Availability
+++
+
+
+
Concurrency
+++
+
+
+
Latency sensitivity
+++
+
++
+++
Throughput oriented
+
+++
+++
+++
Ad hoc
+
+
++
+++
I/O Operations
Mix R/W
Read intensive
Read intensive
Mix R/W
Table 4 Characterization of Analytics databases
4
Database Schema Design
A database schema can be seen as a container object and allows for the logical grouping of objects such as tables,
views and stored procedures. OLTP is an excellent example of a normalized schema. When using a normalized
schema, tables are usually split into smaller tables. For example, let’s assume a bank database only has a single
table that logs all the activity generated by its customers. This means that there are multiple rows in this table for
each customer. Now, if a customer updates her address, you may need to update many rows in this table if she
had several transactions attributable to her in the given time frame for the database to be consistent. This can
have an impact on the performance and concurrency of the database.
Instead of this, you could build out a schema for the database such that there are multiple tables but only one
with customer information. This way, when the customer changes her address you only need to update just one
row in this table. This process of organizing the attributes and tables to minimize data redundancy is called
normalization and it improves concurrency and performance greatly. If you normalize your database enough,
every insert, delete and update statement will only hit a single table. However, these small updates—consisting of
the transfer of small blocks—require a very fast response making OLTP databases very latency sensitive.
Data warehouses, on the other hand, tend to have fewer tables and are, therefore, denormalized. Case in point,
when querying the database to find out who owns Account #1234, it needs to join two tables—the Account-table
with the Customer-table. This example represents a two-way join but it is not uncommon for a data warehousing
system to do many-way joins (that is, joining multiple tables at once). These functions are generally throughput
bound as it requires the transmission of large amounts of data.
Figure 2 outlines the various database types and illustrates how, moving from OLTP on the left to Analytics on the
right, they become more throughput bound as the average block size grows. Conversely, the further left the use
case, the more latency sensitive it becomes.
More Latency Sensitive
OLTP
Reporting
OLAP
Analytics
More Throughput Bound
Figure 2 Database characteristics
Business Processes
Another interesting way to study databases is to examine its place in the business process. This provides tremendous
insight into the availability, concurrency and response requirements.
Typically OLTP databases are at the front of the process, facing the customer. Dramatically put, they are in the line
of fire. When the OLTP database is performing slowly (or worse, is unavailable) it will impact revenue-generating
processes so they are heavily normalized to read, insert and update data quickly to respond as fast as possible.
On the contrary, data warehousing operations generally occur away from customer-facing operations. Data is
typically loaded into the data warehouse from multiple sources to provide the business insights into its day-to-day
operations. For example, an analyst may want to identify opportunities to drive quality and cost improvements.
However, while we talk about a data warehouse as a single entity, this is seldom the case. Many times you will find
one large data warehouse and many so called “data marts” that hang from it. The resulting database proliferation is
a real problem in the enterprise and managing all these databases and providing them the storage performance
they need can be challenging.
5
Designing and Testing Your Storage Architecture In Line with the Workload
Data is not a static entity in the enterprise; it flows through the enterprise continuously and at various points is
used for different things. As discussed previously, data usually enters the pipeline through OLTP databases and
is linked by ETL (through a solution such as Informatica or Hadoop) to the data warehouse. Sometimes the data
of the OLTP database will be transferred in its raw form and the ETL process will occur inside the data warehouse
itself. In either method, transforming the data into a different shape requires reading the data, modifying it, and writing
the data into new tables. Any mention of nightly loads that happen into the data warehouse refers to this process.
BI
SAS/R
ERP
CRM
POS
Data
Mart
Load
Enterprise
Data
Warehouse
Transform
Extract
WEB
Analytics
Figure 3 Data pipeline
6
Using PernixData FVP software to Optimize the Data Pipeline
By adding storage intelligence to high-speed server resources like flash and RAM to increase data throughput
and reduce application latency, PernixData FVP software can introduce huge gains in the performance through
the data pipeline.
•OLTP databases can leverage the low latency characteristics of server-side flash and RAM. This means more
transactions per second and higher levels of concurrency all while providing protection against data loss via
FVP’s write back replication capabilities.
•Loads of data into the data warehouse will get tremendously faster in Write Back mode because the new rows
will be added to the table as soon as it touches the server side flash or RAM.
•The reports and analytics may execute joins, aggregations, sorts, etc. require rapid access to large volumes of
data and can also generate large intermediate results. High read and write throughput are therefore beneficial
and having this done on the server right next to the database will help performance tremendously.
•Analytics can be ad-hoc and obviate any attempts at tuning that the DBA has implemented. Having the base
tables on flash via FVP can help greatly performance.
•Analytics workloads tend to create and leverage temporary tables within the database. Using server-side resources
for read enhances performance on these temporary tables and write accesses to them.
In addition, there is also a huge operational benefit. The entire data pipeline (OLTP databases, ETL, data warehouse,
data marts, etc.) can now be virtualized because high performance can be delivered consistently via server-side
resources and FVP. This brings together the best of both worlds. The operational benefits of a virtualization platform,
such as vSphere HA, DRS and vMotion, can be leveraged while the entire data pipeline is standardized on it without
sacrificing performance at all. For the first time ever, rather than employing separate techniques and implementing
disparate infrastructure to meet the varied requirements of different database workloads, a single solution—PernixData
FVP software—offers a clear answer to the age-old question: How do I improve database performance?
www.pernixdata.com | 1-855-PERNIX-D | @PernixData
Copyright © 2015 PernixData, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws.
PernixData is a registered trademarks and FVP, FVP Cluster and Flash Cluster are trademarks of PernixData, Inc. in the United States and/or other jurisdictions. All other brands,7
products, marks and names mentioned herein may be trademarks or service marks of, and are used to identify, products or services of their respective owners.