WHITE PAPER Answering the Call of Virtualization for Databases 1 The English writer Samuel Johnson once remarked, “The trouble with life isn’t that there is no answer, it’s that there are so many answers.” This statement is entirely appropriate for the term “database” which represents a wide spectrum of use cases, with On-Line Transaction Processing (OLTP), Reporting, On-Line Analytical Processing (OLAP) and Analytics being the most common ones. With such diverse and varied sets of requirements, what is the best method to optimize “database performance”? This paper explores the characteristics of the predominant database implementations and offers suggestions on building a highly efficient and performant infrastructure to support this application category. First, let’s establish a taxonomy. Typically, OLTP is aligned with a single application such as Customer Relationship Management (CRM) or Enterprise Resource Planning (ERP) and is focused on handling business processes; Reporting, OLAP and Analytics, on the other hand, can be seen as a part of the Data Warehousing family. Generally, through data integration via Extract, Transform, and Load (ETL) , data is pulled from sources such as OLTP databases and placed into data warehouses. Consequently, a data warehouse can be seen as a layer optimized for reporting and analytics that sits on top of the OLTP environment. Data Sources Users Customer Replationship Management (CRM) Enterprise Resource Planning (ERP) Supply Chain Management OLAP Analysis ETL Data Mining ETL DATA WAREHOUSE ETL Reporting Figure 1 Database taxonomy As with any complex problem, database architectures require a clear understanding of the problem being solved and the technical requirements of the workload. So let’s dive into the four database types to understand their requirements and the impact on the design: 2 OLTP OLTP workloads have a good mix of read and write operations. It’s latency sensitive and requires high levels of concurrency. Let’s use the example of an Automated Teller Machine (ATM) to clarify the concept of concurrency. Each customer at an ATM generates a connection to an OLTP database to execute a few simple instructions. For instance, if a customer wants to make a withdrawal, the process needs to read the records of the customer in the database to confirm that she has sufficient funds then record (write) the transaction; in Database Administrator (DBA) speak, that would be a SQL SELECT statement followed by an UPDATE. However, the bank may have several ATMs servicing many customers simultaneously. A proper OLTP database should be able to handle a lot of users at the same time, preferably with a minimal delay as any latency negatively impacts user experience. From an availability perspective, the service must also be running at all times (i.e. 24/7/365). OLTP Availability +++ Concurrency +++ Latency sensitivity +++ Throughput oriented + Ad hoc + I/O Operations Mix R/W Reporting OLAP Analytics Table 1 Characterization of OLTP databases Reporting Reporting is usually triggered when there’s a desire to better understand the performance of the business (e.g. how many accounts were opened/closed this week, is the private banking account team hitting their quota of acquiring new customers, etc.). It’s a set of predictable requests with the user having a good idea of what data he wants to see and a specific report design that structures the data consistently. This means the operations are repetitive and allows the DBA to design and optimize the database and schema (e.g. Star and Snow Flake) so that this query gets executed in a predictable and efficient manner. Reporting databases are predominately read-intensive and requires throughput more than anything else. Concurrency and availability are not as important as they are for OLTP since it serves the back office processes. OLTP Reporting Availability +++ + Concurrency +++ + Latency sensitivity +++ + Throughput oriented + +++ Ad hoc + + I/O Operations Mix R/W Read intensive OLAP Analytics Table 2 Characterization of Reporting databases 3 OLAP OLAP can be seen as the analytical counterpart of OLTP. Where OLTP is the original source of data, OLAP is the consolidation point of these various data streams. A common remark made in the database world is that OLAP provides a multi-dimensional view, meaning that you drill down into the data coming from mixed sources and then analyze the data amongst different attributes. This workload is more ad-hoc than Reporting as you examine the data in a variety of ways depending on the nature of the query. It is also primarily read-intensive and can run complex queries involving aggregations of multiple databases so it’s throughput-oriented as well. An example of an OLAP query would be the amount of additional insurance services gold credit card customers had opted for during the summer months. OLTP Reporting OLAP Availability +++ + + Concurrency +++ + + Latency sensitivity +++ + ++ Throughput oriented + +++ +++ Ad hoc + + ++ I/O Operations Mix R/W Read intensive Read intensive Analytics Table 3 Characterization of OLAP databases Analytics Analytics is truly ad-hoc in nature. Whereas Reporting aims to provide a perspective of the numbers that are being presented, Analytics strives to deliver insights on why the numbers are what they are. As an example, Reporting can show how many new accounts were acquired by the private banking account team in the previous fiscal quarter; Analytics aims to answer why they did not hit their quota. Analytics often query multiple databases and can be multi-step processes. Analytic queries write out large temporary results before analyzing them again (the output of one query is the input of the next query) and this can happen multiple times. This means that both read and write performance is crucial as the intermediate results need to be stored as fast as possible and, shortly after, the data made available quickly for the next query. Another issue with Analytics is the sort process. While the dataset to be retrieved needs to be sorted, it may be so large that it can’t be fully held in memory during the sort process. As a result, some data may end up spilling to disk. The large discrepancy in performance between memory and disk will, inevitably, add significant delay to the process. Because of these challenges, it is difficult to design an efficient schema for Analytics queries upfront. It also makes the workload an especially difficult use case from a performance perspective. OLTP Reporting OLAP Analytics Availability +++ + + + Concurrency +++ + + + Latency sensitivity +++ + ++ +++ Throughput oriented + +++ +++ +++ Ad hoc + + ++ +++ I/O Operations Mix R/W Read intensive Read intensive Mix R/W Table 4 Characterization of Analytics databases 4 Database Schema Design A database schema can be seen as a container object and allows for the logical grouping of objects such as tables, views and stored procedures. OLTP is an excellent example of a normalized schema. When using a normalized schema, tables are usually split into smaller tables. For example, let’s assume a bank database only has a single table that logs all the activity generated by its customers. This means that there are multiple rows in this table for each customer. Now, if a customer updates her address, you may need to update many rows in this table if she had several transactions attributable to her in the given time frame for the database to be consistent. This can have an impact on the performance and concurrency of the database. Instead of this, you could build out a schema for the database such that there are multiple tables but only one with customer information. This way, when the customer changes her address you only need to update just one row in this table. This process of organizing the attributes and tables to minimize data redundancy is called normalization and it improves concurrency and performance greatly. If you normalize your database enough, every insert, delete and update statement will only hit a single table. However, these small updates—consisting of the transfer of small blocks—require a very fast response making OLTP databases very latency sensitive. Data warehouses, on the other hand, tend to have fewer tables and are, therefore, denormalized. Case in point, when querying the database to find out who owns Account #1234, it needs to join two tables—the Account-table with the Customer-table. This example represents a two-way join but it is not uncommon for a data warehousing system to do many-way joins (that is, joining multiple tables at once). These functions are generally throughput bound as it requires the transmission of large amounts of data. Figure 2 outlines the various database types and illustrates how, moving from OLTP on the left to Analytics on the right, they become more throughput bound as the average block size grows. Conversely, the further left the use case, the more latency sensitive it becomes. More Latency Sensitive OLTP Reporting OLAP Analytics More Throughput Bound Figure 2 Database characteristics Business Processes Another interesting way to study databases is to examine its place in the business process. This provides tremendous insight into the availability, concurrency and response requirements. Typically OLTP databases are at the front of the process, facing the customer. Dramatically put, they are in the line of fire. When the OLTP database is performing slowly (or worse, is unavailable) it will impact revenue-generating processes so they are heavily normalized to read, insert and update data quickly to respond as fast as possible. On the contrary, data warehousing operations generally occur away from customer-facing operations. Data is typically loaded into the data warehouse from multiple sources to provide the business insights into its day-to-day operations. For example, an analyst may want to identify opportunities to drive quality and cost improvements. However, while we talk about a data warehouse as a single entity, this is seldom the case. Many times you will find one large data warehouse and many so called “data marts” that hang from it. The resulting database proliferation is a real problem in the enterprise and managing all these databases and providing them the storage performance they need can be challenging. 5 Designing and Testing Your Storage Architecture In Line with the Workload Data is not a static entity in the enterprise; it flows through the enterprise continuously and at various points is used for different things. As discussed previously, data usually enters the pipeline through OLTP databases and is linked by ETL (through a solution such as Informatica or Hadoop) to the data warehouse. Sometimes the data of the OLTP database will be transferred in its raw form and the ETL process will occur inside the data warehouse itself. In either method, transforming the data into a different shape requires reading the data, modifying it, and writing the data into new tables. Any mention of nightly loads that happen into the data warehouse refers to this process. BI SAS/R ERP CRM POS Data Mart Load Enterprise Data Warehouse Transform Extract WEB Analytics Figure 3 Data pipeline 6 Using PernixData FVP software to Optimize the Data Pipeline By adding storage intelligence to high-speed server resources like flash and RAM to increase data throughput and reduce application latency, PernixData FVP software can introduce huge gains in the performance through the data pipeline. •OLTP databases can leverage the low latency characteristics of server-side flash and RAM. This means more transactions per second and higher levels of concurrency all while providing protection against data loss via FVP’s write back replication capabilities. •Loads of data into the data warehouse will get tremendously faster in Write Back mode because the new rows will be added to the table as soon as it touches the server side flash or RAM. •The reports and analytics may execute joins, aggregations, sorts, etc. require rapid access to large volumes of data and can also generate large intermediate results. High read and write throughput are therefore beneficial and having this done on the server right next to the database will help performance tremendously. •Analytics can be ad-hoc and obviate any attempts at tuning that the DBA has implemented. Having the base tables on flash via FVP can help greatly performance. •Analytics workloads tend to create and leverage temporary tables within the database. Using server-side resources for read enhances performance on these temporary tables and write accesses to them. In addition, there is also a huge operational benefit. The entire data pipeline (OLTP databases, ETL, data warehouse, data marts, etc.) can now be virtualized because high performance can be delivered consistently via server-side resources and FVP. This brings together the best of both worlds. The operational benefits of a virtualization platform, such as vSphere HA, DRS and vMotion, can be leveraged while the entire data pipeline is standardized on it without sacrificing performance at all. For the first time ever, rather than employing separate techniques and implementing disparate infrastructure to meet the varied requirements of different database workloads, a single solution—PernixData FVP software—offers a clear answer to the age-old question: How do I improve database performance? www.pernixdata.com | 1-855-PERNIX-D | @PernixData Copyright © 2015 PernixData, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. PernixData is a registered trademarks and FVP, FVP Cluster and Flash Cluster are trademarks of PernixData, Inc. in the United States and/or other jurisdictions. All other brands,7 products, marks and names mentioned herein may be trademarks or service marks of, and are used to identify, products or services of their respective owners.
© Copyright 2024