Best Practices for HP Enterprise Data Warehouse Warehouse R2

Best Practices for HP Enterprise Data Warehouse
Appliance/Microsoft SQL Server 2008 Parallel Data
Warehouse R2
Technical white paper
Table of contents
Introduction ..................................................................................................................................2
What is big data? .....................................................................................................................2
Hub and spoke business intelligence infrastructure ............................................................................2
Data quality issues ....................................................................................................................2
Performance in the BI environment ..............................................................................................4
Factors contributing to I/O throughput demand in a data warehouse or data mart ...............................9
Classes of users ........................................................................................................................9
Business intelligence user expectations from a performance perspective.........................................11
Data compression variants .......................................................................................................13
Traditional database designs vs. EDW and Fast Track .................................................................13
EDW and Ultra-Shared Nothing technology ...............................................................................18
Other PDW software benefits ...................................................................................................21
How loading data can affect EDW/PDW performance ...............................................................22
Best practices for loading data .................................................................................................32
EDW performance metrics .......................................................................................................36
Summary ...................................................................................................................................36
For more information...................................................................................................................37
Introduction
The HP Enterprise Data Warehouse (EDW) Appliance has been optimized for Microsoft® SQL Server 2008 Parallel
Data Warehouse (PDW) software. The system has been integrated to provide organizations with excellent
performance in a scalable, highly available environment.
HP provides this document as a guide to our customers so that you will be able to take advantage of the EDW
appliance‟s capabilities. Each section discusses basic EDW/PDW concepts from different points of view, and digs
deeper into each concept while recommending various best practices along the way.
It has been assumed that the reader is familiar with business intelligence environments, EDW architecture and
terminology. If not, we recommend reviewing the first half of the “HP Enterprise Data Warehouse Appliance
architecture overview and performance guide – Introduction to Business Intelligence architectures” as a prerequisite to
reading this document, available at:
http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-5625ENW.pdf.
Several performance issues discussed in the latter half of the above document have been readdressed in this white
paper because they pertain to some fundamental EDW performance concepts which bear repeating. The good news
is that this document extends the performance discussions in earlier HP white papers by delving into additional EDW
and PDW best practices in more detail.
It is important to note that the industry terms “Enterprise Data Warehouse” (EDW) and the HP “Enterprise Data
Warehouse (EDW) Appliance” may sound the same; however, the former EDW term typically relates to a generic
centralized enterprise wide hub for an organization‟s historical data. The latter interpretation refers to the HP
Enterprise Data Warehouse Appliance, which is physical hardware that was engineered to be a loosely-coupled,
massively parallel server running Microsoft‟s Parallel Data Warehouse (PDW) software. The HP EDW Appliance was
designed to support not only a company‟s Enterprise Data Warehouse functionality (hence, its name), but also the
company‟s high end data marts or Operational Data Stores (ODSs).
To recap, even though the product called the HP Enterprise Data Warehouse Appliance implies that it is to be used
for an organization‟s Enterprise Data Warehouse hub, more frequently the HP Enterprise Data Warehouse Appliance
is used to support high end data marts which require a data store for structured “big data” analytics or large ODSs.
What is big data?
For the sake of this document, we would like to define the term “big data.” This term can be very subjective. After all,
the word “big” to some may be “small” to others and vice versa. In the past, people thought of big data as being
things like phone call detail records, POS credit card transactions or stock/financial transactions. While these forms
of data are truly “big”, today, many people also think of big data as including RFID, web logs, social data, etc.
All of the above are examples of big data. Other examples, not mentioned above, will vary from customer to
customer and from application to application. At this point in time, the HP Enterprise Data Warehouse Appliance
using SQL Server Parallel Data Warehouse software defines big data as that which supports databases up to 600TB.
HP believes that the EDW Appliance‟s real benefit is not only supporting big data relating to database sizes, but,
more importantly, supporting various business intelligence, analytical, mining or reporting tools which can unlock the
data‟s business value while meeting end user service level agreements.
Hub and spoke business intelligence infrastructure
When designing a business intelligence environment, users tend to be mainly concerned with data quality and
performance. A hub and spoke infrastructure can address these issues in several ways.
Data quality issues
Data quality issues may be addressed by providing users and executives with a single version of the truth. Many
business intelligence (BI) systems have evolved over time. Typically, each organization extracts data from various
OLTP systems at different times during the day. They also massage data differently using various algorithms which
may yield different results that may be misinterpreted when data is shared across organizations.
2
This problem is compounded and becomes more visible when executive management obtains different answers to
their business questions depending upon which system is queried. See Figure 1.
Figure 1. Data mart evolution provides users with different variations of truth
In a hub and spoke architecture, the Enterprise Data Warehouse may act as a hub for a company‟s various OLTP
systems using Extract Transformation and Load (ETL) tools such as SQL Server Integration Services (SSIS). Once the
data is loaded into the Enterprise Data Warehouse, it may act as an enterprise-wide central repository for an
organization‟s historical data. Ideally, the data in the Enterprise Data Warehouse should be stored and structured
using “conformed dimensions.” This design ensures that all departments within a company agree upon a common set
of names, terms and algorithms which are documented and maintained in the company‟s meta-data management
system.
The use of conformed dimensions, if at all possible, should become standardized throughout a company and form the
basis of providing management with a “single version of the truth.” See Figure 2. All users should be able to view the
business intelligence environment‟s metadata using tools such as Microsoft SQL Server Master Data Services. More
information on Master Data Services may be located at:
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/master-data-services.aspx
Once the data is stored in the conformed dimensional Enterprise Data Warehouse (hub), it may be exported to one or
more data mart (spokes). At this point, the users, business analysts, or management may query the data marts or
EDW directly, depending upon security and network connectivity.
3
Figure 2. Hub and spoke architecture – “single version of the truth”
Note
The PDW software supports a feature called Remote Table Copy (Parallel Data
Export). This capability allows the EDW Appliance to perform high speed data
exports to external, SMP based SQL Server 2008 systems which are connected
to the EDW via InfiniBand.
Performance in the BI environment
Performance in the BI environment may be measured from different points of view.
 I/O throughput (GB/second)
 Queries per hour rate
 Number of simultaneous users
 End user response time
Performance metrics need to be evaluated based upon your BI infrastructure and may vary from customer to customer.
For example, if end users are physically submitting queries against front end, OLAP cube spokes (SQL Server Analysis
Services), these memory-based OLAP cubes will have the fastest possible response times which will, in turn, support
high query per hour rates.
Today, OLAP cubes should, practically, be kept to less than 1TB for a single SQL Server Analysis Services (SSAS)
data mart. However, OLAP cubes may scale out by running multiple cubes on multiple servers in an OLAP cube
“farm” type of architecture.
If your data mart will grow to more than 1TB within a single cube, it is recommended that the data mart be kept
relational in flavor using the HP Business Data Warehouse (BDW) Appliance, SQL Server Fast Track reference
architectures, or EDW appliances, etc. As of October 2011, the following metrics may be used to guide customers in
4
finding the correct balance of storage space and I/O throughput required to support their data warehouse or data
mart spoke.
 Business Data Warehouse (BDW) Appliance
– Symmetric Multi-processing (SMP)
Scale up to 5TB
Up to 2GB/sec of I/O throughput
 Fast Track (as of March 2011) – Overview: http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA28181ENW.pdf
– Symmetric Multi-processing (SMP)
Entry reference architecture (RA) (4.5 – 14TB); 1 GB – 1.8GB/sec of I/O throughput
http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2184ENW.pdf
Basic RA (10-20TB); up to 2.8 GB/sec
http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2186ENW.pdf
Mainstream RA (20-40TB); up to 5.1 GB/sec
http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2690ENW.pdf
Premium RA (40-80TB); up to 11 GB – 14GB/sec
http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-3609ENW.pdf
 HP EDW Appliance
– Loosely-coupled, massively parallel (MPP)
Table 1. HP EDW Appliance user data capacity (3.5x compression ratio is assumed)
User data capacity
1 data rack
2 data racks
3 data racks
4 data racks
11 300GB LFF
Disks per MSA
38TB
76TB
114TB
153TB
24 300GB SFF
Disks per MSA
76TB
153TB
229TB
305TB
11 1TB LFF
Disks per MSA
127TB
254TB
381TB
509TB
24 600GB SFF
Disks per MSA
153TB
305TB
458TB
610TB
LFF disks; 12.8 GB/sec per data rack = 51.2 GB/sec per four data rack EDW
SFF disks; 16.5 GB/sec per data rack = 66 GB/sec per four data rack EDW
As you can see, the HP EDW Appliance may function as a hub or a spoke. The main determining factors will be
storage space, throughput and scalability requirements.
I/O throughput vs. query per hour rates
I/O throughput and query per hour rates are closely related. Regardless of a query‟s complexity, all query types will
want to drive the I/O subsystem at its maximum throughput rate.
SQL Server‟s Parallel Data Warehouse‟s Ultra-Shared Nothing technology attempts to eliminate data movement
between nodes when some queries perform “join” operations between replicated dimension tables and fact tables. In
addition, other queries may want to perform parallel table scans. This is common when large subsets of data are
selected for data mining or if users want to perform complex analytics.
5
Since DBAs may not be able to predict how business analysts may want to look at their data, the EDW Appliance has
been optimized to stream data off disk at very high rates of speed to support ad-hoc queries. Tuning and managing
the HP EDW Appliance has also been greatly simplified, even when compared to SMP versions of SQL Server.
When profiling workloads, simple queries will return result sets fastest because they are typically fast table lookups in
dimension tables or else the query predicate allows for only a small number of rows to be accessed. In general, most
users execute simple queries most of the time because they are trying to determine which dimensions will be of most
use to them when formulating medium or complex queries. On occasion, some customers may even create aggregate
or summary tables to minimize query service times. In general, the largest populations of casual users fall into the
simple or medium query complexity category.
Medium and complex queries require large data sets to be accessed or scanned and may involve complex join
operations. The EDW Appliance excels in executing these query types.
While query per hour rates may sound like a useful metric, unless the metric is qualified with an in-depth
understanding of the mix of simple, medium and complex query profiles, a simple query per hour rate is not easily
mapped to your organization‟s respective workload.
Since profiling user workloads is a time consuming task, coupled with the fact that workload mixes evolve over time, it
is frequently more practical to use I/O throughput as a simpler gauge when monitoring data warehouse performance.
Figure 3 shows how the EDW Appliance with a “fatter” I/O throughput pipe can support more queries per hour than
a SMP based SQL Server, Fast Track reference architecture. Notice that although the workload mix of simple, medium
and complex queries remained the same in both examples in the graphic; the EDW, MPP architecture‟s “fat pipe” is
able to funnel more queries per hour through the system.
Figure 3. How query workload translates into I/O throughput requirements
Different classes of queries have very different workload profiles. As users become more sophisticated, over periods
of time, and as business needs change, you will notice that workload profiles and query mixes also evolve.
The good news is that the HP Enterprise Data Warehouse Appliance allows your workload mix to vary and evolve
over time. The appliance may be thought of as a “scan engine” because its goal is to scan data, efficiently, at high
rates of speed. Regardless of your mix of simple, medium or complex queries, the HP EDW Appliance will try and
maintain high throughput rates which, in turn, lead to consistent, rapid response times for ad-hoc queries which
change and evolve as your business demands change.
Figure 4 shows how the ratio of simple, medium and complex queries typically changes over time.
6
Note
The numbers shown in Figure 4 are for illustrative purposes only.
Figure 4. How ratio of simple, medium and complex queries typically change over time
Now compare Figure 3 above to Figure 5 following, which is a visual example of how workload mixes can change
over time. Even though the workload mix changes, the I/O rate supported by the data warehouse remains relatively
consistent.
7
Figure 5. How workload mixes can change over time
One of the many benefits that the EDW Appliance and PDW software bring to the table is more consistent end user
response times. This is accomplished by using the PDW index-lite design techniques and other best practices
addressed in the remainder of this document.
Number of supported users
It is common for people to ask “how many users will the system support?” It should be noted that the number of
supported users or the number of simultaneous users may be a misleading and sometimes meaningless metric.
In actuality, hundreds or thousands of users could conceivably be connected to a very small server, perhaps even a
laptop. The real question is not necessarily “the number of users”, but rather, a determination of the kind of I/O
throughput demand which will be generated by a given user population. Some issues which explain why the number
of “simultaneous users” may give false impressions are:
 Thousands of users may connected, but may be sitting idle, at lunch or doing non-business intelligence related
activities
 Only users who actively submit queries to the data warehouse will be consuming resources
 The amount of resources users consume is closely related to query complexity
 Query complexity is tied to I/O throughput demand
 The query mix multiplied by the volume of each query type is what generates the I/O throughput demand on the
data warehouse.
 The mix and volume of queries will vary over time
In one organization, 10,000 users may generate a low query volume and if that query mix contains simple queries,
then the I/O throughput demand will be low. On the other hand, if 10 dedicated power users execute mostly
complex queries or extract large data sets for data mining and other analytics, the demand will be high. These 10
power users have the potential to fully saturate the most powerful MPP servers in the world. Consequently, the
question of how many users the system will support is best answered by the kind of I/O throughput demand, not by
the number of simultaneous users.
Fortunately, the HP Enterprise Data Warehouse Appliance can scan data at exceptionally fast speeds and execute
queries, in parallel, at low affordable costs. The intention of this document is to discuss various EDW/PDW specific
8
best practices which will enable you to provide your user population with excellent response times and high
throughput rates.
Response time
Response time is the other critical metric about which end users are concerned. Without getting into queuing theory,
the most basic formula for response time is:
Response time = Service time + wait time
“Service time” is a function of how much data a query must read and evaluate, based upon your predicate. This
would include the time to perform joins, unions, sorts, etc. Generally, it is a function of how fast you can stream
massive amounts of data off disk (I/O throughput).
Some people wonder if flash cache (SSD) will speed up query performance by reducing service times. While flash
cache will help many OLTP systems, it is unlikely that flash cache will provide the performance gains that most people
expect in a large data warehouse. The three main reasons why flash cache does not significantly boost business
intelligence style and query performance as much as people hope are:
 Unless the entire database can fit in cache, data will most likely be flushed out of cache when a table scan is
performed
 If you plan on doing frequent database updates or trickle feed/real time updates to the data warehouse or ODS,
flash cache needs to be maintained in order to make sure it is in sync with the database
 The flash cache path to the CPU is not much faster than multiple streaming disks and disks are much less expensive
than flash cache
The second component of the response time equation is wait time. “Wait time” is basically queue time, which is the
amount of time spent waiting for resources. Queuing tends to increase as more simultaneous queries are executed.
When compared to SMP systems, the HP EDW Appliance will lower “wall clock” service times by automatically
decomposing a query and executing multiple scans in parallel on multiple servers (each of which have multiple
cores).
Basically, you have multiple multi-core SMP servers working on a workload in parallel. “Wall clock” service time is
further improved because the EDW utilizes multiple, shared nothing, I/O subsystems (separate and distinct HP P2000
G3 MSA arrays). These lower “wall clock” service times reduce the risk of incurring queuing delays waiting for
system resources when the system is under load. In essence, this feedback loop of providing faster service times to
reduce wait times provides users with overall better performance on the EDW Appliance than on a large scale-up
SMP system.
It should also be noted that the EDW Appliance‟s shared nothing design has a major advantage over systems with
shared storage because loosely coupled, shared nothing architectures are not competing for and queuing on shared
storage resources when multiple queries hit your warehouse simultaneously.
Factors contributing to I/O throughput demand in a data
warehouse or data mart
Classes of users
Users today realize that they do not want to look through paper reports or large PDF files to find an answer to their
question or make a business decision. They prefer to make ad-hoc queries which will allow the data warehouse or
data mart to return the specific result set which will answer their question specifically and quickly. In addition, the
questions asked of the system will tend to change weekly, daily or hourly.
Classes of users and types of queries
In general there are 3 classes of business Intelligence users within an organization. Typically, the nature of queries
generated by each user class tends to vary. Here are some generalizations which are commonly associated with
each user class:
 Casual users
This class of users usually has the largest population. Traditionally, this user needs static reports (such as PDF files or
hard copy), or to access aggregates or summary tables by using simple query tools. Casual users also tend to like
9
OLAP cubes (SSAS) because of memory speed access to aggregated data. In other situations, these users may access
front-end relational data marts.
Generally, queries generated by casual users have a highly qualified, time range, and predicate which limits the
number of rows that have to be accessed.
The main question this class of users tries to answer is “What happened?”
 Business analyst users
This class of users tends to be more sophisticated and are fewer in population in most organizations. Traditionally,
they also use static reports such as PDF files or summary tables. However, by using more powerful tools, they are now
able to make business decisions faster by using dashboards, scorecards, Microsoft Excel, PowerPivot, etc. The queries
these tools generate may be targeted at a relational data mart, OLAP cube or directly to an EDW Appliance.
The main question business analysts try to answer is “Why did it happen?”
This type of user also tends to have a hypothesis they are trying to prove by querying the data warehouse. They may
take advantage of drill down functionality in an effort to gather data to prove their theory.
A good front end appliance to use for business analysts is the HP Business Decision Appliance (BDA) which comes
pre-configured and pre-installed with Microsoft SharePoint and PowerPivot. Another appliance which serves as an
excellent relational data mart (spoke) is the HP Business Data Warehouse (BDW) Appliance.
HP Appliances tend to be easier to purchase and allow customers to be productive faster and easier because most of
the building is performed in the HP factory. The goal is to make appliances as close to plug-and-play as possible.
Additionally, HP has run tests and has published a white paper titled: “High speed data export from an HP Enterprise
Data Warehouse Appliance to an SMP Microsoft SQL Server” available at:
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-8382ENW&cc=us&lc=en
The above white paper is a step-by-step guide to implement high speed data export using the EDW Appliance
InfiniBand switches connected to the HP Business Decision Appliance or any other SMP-based SQL Server 2008
database. The Remote Table Copy feature is perfect to use when implementing an EDW hub and spoke architecture.
Microsoft, also, has a paper entitled “Parallel Data Warehouse Data Portability Overview”, which may be of interest
to the readers who plan on exporting large quantities of data from the EDW Appliance:
http://download.microsoft.com/download/1/7/5/175A7833-6F75-418D-8800-86D44D1D712D/[MSDPPDW].pdf.
 Power users
The power users tends to be the fewest in number yet are critical in advising or setting a company‟s strategic
direction. They are looking for customer patterns, financial trends and may even use external data (such as
demographic or weather data, etc.) to gain insight into their company‟s current strategy with an eye open to
changing course as the business or market forces dictate.
The faster data can be analyzed and business decisions made, the higher the odds that these companies can gain a
competitive advantage in the market.
Power users tend to perform interactive, ad-hoc queries, OLAP operations and data mining. They also favor the use of
analytics, statistics and occasional modeling or simulations which may require selecting large data sets from the
warehouse. The PDW Remote Table Copy feature is a high speed, InfiniBand based method to extract these data sets
to be used by front end analytical tools.
In summary, power users frequently use EDW data for knowledge discovery; they explore the data looking to create
a new hypothesis, theory or market insight. Bear in mind that having the data assist in creating a theory is very
different than having a theory and looking for data to prove your hypothesis.
10
Classes of users and types of data accessed
In understanding how the nature of the questions will vary between different classes of users, we also need to drill
down and consider the type of data accessed by each user class. Frequently, the groups of users are likely to be
interested in different types of data. Similar to the generalizations used in the attempt to classify user queries, the
following discussion regarding data access also contains generalizations. Therefore, the actual data accessed by
users in your organization may vary.
 Casual users
These users tend to be interested in the subset of data which is aligned with their department or organization. They
also tend to be interested in recent data (days or months) vs. business/power users who typically perform longer term
trend analysis. More frequently than not, casual users access aggregated or summarized data. However, detailed
data is sometimes required for day-to-day operational use. In this case, some companies build an Operational Data
Store (ODS) within their business intelligence infrastructure for detailed, time critical, information. Rather than
implementing an ODS, some companies give users direct secure access to low level “fact” data in the EDW
Appliance.
The bottom line is that casual users traditionally access a relatively small percent of the data in the warehouse.
Therefore, the disk I/O bandwidth requirement to service their queries is relatively low when compared to business or
power users.
 Business analysts and power users
A business analyst and power user‟s view of the world and the data they look at can vary significantly from company
to company. Some business and power users are organization specific and others may require a more global view of
multiple divisions or the whole corporation.
In general, this class of users frequently need to access low level detail data to support their analysis. They also tend
to access detail data over longer periods of time and hit a higher percentage of the database than casual users. This
translates into increased disk I/O bandwidth demand when their demand is compared to that of casual users.
Based upon the above discussion, you can infer that queries which look for long term trends, perform time window
comparisons and are more global in nature are likely to demand more I/O bandwidth. Oftentimes people will try to
reduce this demand by implementing summary or aggregate tables (sometimes called materialized views). However,
summarization or aggregation pre-supposes that you know the business question or how the user plans on accessing
the data ahead of time. This scenario is unlikely, especially as queries tend to evolve as the business environment
changes. Therefore, it is conservative to assume that business or power users will most likely want to access large
amounts of low level raw data stored on the EDW Appliance instead of concentrating on aggregated data which is
frequently found in OLAP cubes or on smaller front-end data mart (spokes) in a hub and spoke architecture.
How time plays into user workload profiling
The time factor is also an important dimension to understand. Some classes of users only care about what happened
in the last few days, weeks or months; whereas other users may be interested in trend analysis relating to data
maintained over years. Obviously, the larger the time window for each class of users, the more I/O needs to be
performed to service their respective queries. For example, it will take longer to scan the last 60 months of data (5
years) vs. scanning the previous 12 months.
In summary, the above discussion relating the typical user‟s business needs to query profiling provides you with some
insight when trying to understand your own user population, query workload and mix. Once you understand your
end user workload profile, you will be able to estimate the I/O throughput demands expected of your data
warehouse.
The current release of the HP Enterprise Data Warehouse can scale to over 500 TB and support scan rates up to
3.9TB per minute. This will support most of the data warehouses or data marts which major corporations or
government agencies are implementing today.
Business intelligence user expectations from a performance perspective
When interacting with data warehouses, users generally want easy-to-use tools which will provide them with answers
to their business questions or insight into their data with consistent rapid response time.
The PDW software running on the HP EDW Appliance supports a wide range of Microsoft and third party user tools
which connect to the HP EDW Appliance via ODBC, OLE DB and ADO.Net drivers. These tools may present
information to users using deluxe executive dashboards, dials, or charts in tabular, report, 2-D or 3-D format. Other
11
tools may have a more technical, yet user-friendly interface, such as Nexus, which allow users to write their own SQL
queries, provided they have the security to do so. It is important that these tools be selected based upon each class of
users‟ technical capabilities and their business needs.
Regardless of the tools selected, most of them generally generate SQL or MDX queries. These queries are then
submitted and execute on a data warehouse or data mart (the data mart may be relational in flavor or a, memorybased OLAP cube). In either case, the user generally does not care if the actual query is sent to a back-end data
warehouse or data mart. Most users are concerned about:
 Consistent response times to their queries
 Rapid response times when making a request
 High data quality
 System availability
 Support
Since this document concentrates on performance-related best practices, we will not be addressing data quality,
system availability and support. However, HP and Microsoft are very much aware of the fact that these are all critical
issues when implementing a successful BI environment.
Trying to qualify and set expectations regarding service level agreements (SLAs) for ad-hoc queries may be
challenging from a technical point of view. However, business users quickly realize that dimension table lookups are
typically extremely fast. Conversely, complex queries may take several or many minutes before a result set is returned
to the user.
Traditionally, DBAs spend a great deal of their, time, effort and energy monitoring query response times, profiling
queries, and monitoring end user response time complaints. They attempt to solve these performance issues by
creating indexes in an attempt to reduce the number of I/Os servicing a specific query or report. However, since the
BI environment is very dynamic, trying to create, delete and maintain indexes may present a double-edged sword.
Advantages of using indexes
 Tends to reduce the number of I/Os to service a set of known queries/reports
 If you are lucky, an ad-hoc query may find the data it needs in an index
Disadvantages of using indexes
 Typical, non-technical, end users expect all queries to perform in a similar fashion, with consistent response times.
Since most users do not know if a simple, medium or complex query is being generated by their ad-hoc query tool,
they tend to complain or, even worse, users may not use the system when large variations in response times occur
due to frustration.
If users don‟t complain, the “lack of feedback” is worse because DBAs may not even be aware that a performance
problem exists and users may find an unofficial workaround to answer their business questions. These workarounds
may result in poor business decisions because users may be viewing data from different sources, giving them an
inconsistent point of view of what they think is the same information in the warehouse. In other words, their
workaround may be looking at derived information from an unknown source, which may skew the analysis even
further.
It is desirable to set user response time expectations correctly (meet end user Service Level Agreements (SLAs) and
maintain relatively consistent I/O throughput levels.
 The nature of an index is to force random physical disk I/O (extra seek time is required to locate the indexed rows
on the disk). This extra seek time, potentially interrupting a table scan, slows down disk I/O throughput, resulting in
slower query response time and throughput.
Contrary to index-heavy designs, the EDW Appliance running SQL Server Parallel Data Warehouse software was
designed to support massively parallel, highly optimized scan rates. This was accomplished by minimizing seek
time to provide users with consistent, rapid performance and throughput by using the index-lite design techniques
that are discussed later in this document.
 Index-heavy designs require more DBAs that spend time analyzing frequently executed queries/reports and
constantly creating new indexes in an effort to reduce the number of disk I/Os per query. They also tend to be
wary of deleting indexes because trying to understand an existing index‟s effectiveness is time-consuming and may
12
be difficult. For these reasons, index-heavy designs tend to get heavier over time and performance can actually
degrade due to extra disk seek time as ad-hoc query volume increases.
 As indexes get added or deleted, SQL statistics should be updated. Depending upon the size of the data
warehouse and sample interval specified when updating statistics, this updating event can be a time consuming
procedure that can extend your ETL batch window.
 The more indexes that need to be maintained during load/insert/update/delete operations, the longer it will take
you to update your database.
 Index-heavy designs require longer batch ETL windows if indexes need to be dropped and rebuilt.
 Index-heavy designs require extra disk space to be used (indexes may require 4 – 10 times more space due to
redundant data being stored). If unused indexes are not known or deleted, they are simply taking up extra space
on your warehouse, causing more disk fragmentation and slowing down performance.
In summary, PDW best practices strongly recommend that all indexes be eliminated and only considered when
absolutely necessary. This strategy not only improves EDW/PDW performance, but, also shortens the ETL windows
and frees up a DBA‟s time for more productive tasks. End users are also happy because their response times are more
predictable.
Data compression variants
The EDW Appliance‟s PDW software uses SQL Server 2008 page compression. This feature automatically saves disk
space and allows more data to be retrieved per physical disk I/O. An advantage of SQL Server 2008 page
compression algorithms is that all the columns for multiple rows are compressed into a single page. If a query selects
columns or has a predicate which needs to access several columns, the PDW SQL Server 2008 R2 software does not
have to perform multiple reads to reconstruct rows before the columns may be used because the entire row is
compressed in each data block. Therefore, PDW page compression has the potential to save disk I/Os and consume
fewer CPU cycles for many query types. The end result is better performance for your executives, business analysts
and end users.
Some vendors have tried to improve query performance by storing data and compressing data by columns instead of
using row compression. While this technique may appear to be useful for some types of queries, it is not a golden
key to yield optimal performance for all types of analytic queries.
For example, it is common for column store and column compression algorithms to incur high reconstruction costs
when a query needs multiple columns returned to the user. These row reconstruction costs, when using column
compression algorithms, are likely to result in consuming more CPU resources as well as potentially requiring multiple
disk I/Os -- in the event that the compressed columns are stored in separate blocks on physical disk. Therefore, many
column store/compression techniques may not give users the response time gains they expect due to the longer than
expected query path lengths (extra CPU and disk I/O) when row reconstruction is required.
Microsoft Parallel Data Warehouse R2 software‟s row level compression algorithms does not consume extra resources
for row reconstruction when a query needs to return multiple columns.
The PDW software allows “real time” trickle insert/update/delete operations. When a row is inserted into the
EDW/PDW database only the page containing the affected row is compressed instead of having to compress
multiple pages or blocks. Some column store algorithms do not allow row level, trickle feed, operational data store
style insert/update/deletes.
In the final analysis, each customer‟s actual performance will be query workload dependent, but bear in mind that
end user query workloads will change over time and the HP EDW Appliance will tend to provide users with consistent
response times as your business grows and changes. In addition, SQL Server page level compression works efficiently
as more customers move toward real time business intelligence environments.
Traditional database designs vs. EDW and Fast Track
Traditional index-heavy trade offs
This section will recap and expand the discussion above by discussing why the HP EDW Appliance, optimized for
SQL Server Parallel Data Warehouse software and Fast Track index-lite database designs, tend to provide users with
rapid and more consistent response times than traditional index-heavy database designs. This section will also briefly
address how Fast Track reference architectures and the HP EDW Appliance are related.
13
Traditional index-heavy database designs
Traditional data warehouse/data mart database designs use fact tables, dimension tables, summary tables and
indexes in an effort to minimize I/Os when servicing queries.
When loading data, database tables are typically striped and balanced across one or more LUNs and file groups. In
turn, these filegroups are spread across multiple physical disks. Unfortunately, little attention is paid to the physical
location of where the data is loaded and where indexes are maintained.
Traditional index-heavy database load procedures
Typically, most customers try and load multiple fact tables and dimension tables simultaneously. Simultaneous load
execution, dynamic index updates or index rebuilds tend to result in data being physically fragmented on the disk as
shown in Figure 6
Figure 6. Hard drive fragmentation
Traditional index-heavy miscellaneous issues
Index-heavy database designs have some inherent issues discussed in the “Business intelligence user expectations
from a performance perspective – Advantages/disadvantages of using indexes” section of this document (above).
In brief, index-heavy designs result in extra disk head movement which is typically due to inefficient data and index
placement on the physical disk media. More importantly, index usage forces disk head movement that will slow down
table scan operations that typically occur for ad-hoc queries.
14
Figure 7. Index- heavy design encourages disk head movement which slows down throughput
Excessive disk head movement (seek times) can result in at least 2–4 times longer disk access times than expected.
Therefore, significantly more disk drives will need to be purchased to support a database with tables and indexes that
were loaded in a sub-optimal manner.
Traditional database design conclusions
Traditional database designs, load procedures and maintenance issues are likely to provide slower query response
times (scan rates) and index-heavy designs will be more difficult for DBAs to manage. In addition, there is likely to be
a variance between canned query throughputs, which were tuned by managing indexes versus ad-hoc queries, which
frequently scan data. Full table or partial table scans are slowed down due to extra, index induced, seek time.
EDW and Fast Track trade offs
EDW and Fast Track index-lite database designs
This section of the technical white paper will discuss how EDW and Fast Track design philosophies are similar. In
addition, this section will provide a more in-depth discussion regarding the advantages of EDW and Fast Track indexlite designs versus index-heavy designs.
EDW and Fast Track index-lite design
EDW, Fast Track and traditional data warehouse/data mart database designs typically use multi-dimensional, starschema or snow-flake schema designs. These are all variations on a similar database design philosophy that include
the use fact tables and dimension tables. Sometimes Enterprise Data Warehouse and ODS implementations normalize
data structures to a higher degree. The EDW appliance supports both multi-dimensional and more normalized
schemas.
An EDW and Fast Track best practice is to initially load data without any indexes. The performance of the data
warehouse executing large block sequential scans usually meets the user‟s SLAs. Care should be taken if indexes are
added. Performance tests should be performed to verify that the addition of an index actually helped EDW
performance. These tests should be done under load to more realistically represent multiple users querying the data
warehouse at the same time. Simply executing one query at a time will not accurately reflect a real world customer
environment which supports simultaneous query execution. This is because the extra seek time imposed by adding
indexes may not be observed during single query execution, but tends to become obvious as a mixed workload is
added.
EDW and Fast Track load procedures
As mentioned earlier, most executives, business analysts and BI users today do not necessarily know what questions
(queries) they will ask ahead of time. Users also tend to get frustrated if they find that query response times are wildly
inconsistent. EDW and Fast Track are both optimized to relatively consistent performance by optimizing the system to
maintain extremely high scan rates.
EDW and Fast Track are efficient “scan rate engines” which provide excellent price/performance. This is
accomplished by using some of the following best practices techniques.
15
Initial load best practices for Fast Track
 Create filegroups and pre-define filegroup storage space
 Allocate extents sequentially in the filegroups to store data sequentially on disk by allocating extents sequentially
during the initial load
 In order to maintain optimal performance, it is desirable to maintain about 15-20 percent free disk space per
storage node.
Example of EDW/PDW simplified database management
Unlike Fast Track or any other SMP SQL Server implementation, managing filegroup, extent, database and table
creation operations across 40 independent instances of SQL Server is simplified on the EDW appliance. The
PDW software on the HP EDW Appliance automatically performs these operations on your behalf across all 1040 EDW compute nodes. Therefore, your operations staff or DBA does not have to manage multiple copies of
SQL Server instances or servers separately. Hence, managing an EDW Appliance is easier than managing
multiple SMP-based SQL Server environments (such as Fast Track).
Since the goal of both EDW and Fast Track is to physically store data sequentially, data may be streamed at high
rates of speed off of the disks because there should be little or no disk head movement to seek and find data.
Figure 8. Example of how data may be stored sequentially for optimal performance
Subsequent loads
 Subsequent loads will tend to have data physically clustered on the disk in their respective filegroups. This
clustering of data allows for efficient scan operations.
How data can become fragmented over time
 Loading data into multiple fact tables may cause a slight increase in disk latency and/or seek time
 Changes to dimension tables after the initial load may also cause some fragmentation. However, in most data
warehouses and data marts, dimension tables are relatively static.
Later in this document we will discuss how EDW/PDW best practices optimize dimension table performance.
16
Figure 9. Example of how data may be fragmented over time
Note
Similar colors represent the same table. Notice how all similar colors are
clustered in large blocks, even though they are not sequential. For example,
this type of large block fragmentation may be clustered date ranges;
therefore, query performance should not be significantly affected because
most queries have a date range as part of the predicate.
 This clustering of fact table data across the disk is typically not a problem because most queries access the most
recent data which tends to be clustered in groups (e.g., users may query what happened last week).
 If a query accesses multiple date ranges, each date range will likely be clustered together on disk. Therefore, disk
head movement will be minimal. Even though seek time will not be completely eliminated; scan rates will still be
excellent.
How to defragment data on EDW to encourage sequential scan rates
 To re-org, perform a CTAS (CREATE TABLE as SELECT) operation
OR
 Backup and restore
In general, CTAS is fast and the preferred method to re-org a table. However, it should be noted that a CTAS
operation may consume as much as three times the storage space as the original table during execution, so make
sure that you have enough disk storage available to execute your CTAS commands.
EDW & Fast Track index-lite design conclusions
Try to eliminate all indexes from EDW and Fast Track database designs. The concept is to stream data as fast as
possible because fast table scans will service most queries faster than performing fewer, slower random disk I/Os
while the system is under load.
17
EDW and Ultra-Shared Nothing technology
EDW is more than just an index-lite design. Other parallel database implementations on the market share storage.
EDW has been intentionally designed as a true, InfiniBand-based, loosely-coupled architecture that does not share
storage. Ethernet is only used for connecting the EDW Appliance to the customer‟s network and for internal command
and control functionality.
From a performance point of view, it is important to note that shared nothing MPP implementations are more efficient
than shared storage architectures. A major reason for the added efficiency is because all of the disks in a shared
storage environment get bombarded with requests from all the database servers requesting data on behalf of their
respective database instance. The more database servers and instances in the database layer, the more pressure is
exerted on the shared storage subsystem. In addition, all of these simultaneous requests result in the disks performing
more random I/O because of this contention. Hence, shared disk architectures tend to not stream data as fast as
systems which do not share storage when multiple queries are executing simultaneously.
Conversely, loosely-coupled, shared nothing (MPP) architectures like EDW have storage subsystems that are
separated and isolated. These isolated units are called compute nodes. Each compute nodes contains a database
node and a storage node. See Figure 10. This loosely coupled, compute node concept allows EDW to stream data
off disk faster due to less disk contention and more efficient disk head movement than shared storage systems can
provide because the storage nodes do not accept requests from competing database instances/nodes.
Figure 10. EDW Appliance is a shared nothing, MPP architecture where multiple instances do not compete for storage nodes
18
Note
Each EDW compute node does more than simple predicate evaluation. Each
node has a fully functional copy of SQL Server 2008, which allows the
compute node to execute most SQL operations (joins, sorts, etc.) as close to
the physical data as possible. In addition, each compute node is aligned with
its own storage. This non-shared storage design avoids the contention that
shared storage subsystems encounter. Database servers send data to each
other, via InfiniBand, only when needed.
EDW‟s loosely coupled architecture encourages the use of “ultra-shared nothing” technology. The use of this
technology can dramatically reduce inter-nodal traffic, which allows the EDW Appliance to scale effectively.
In order to exploit the power of ultra-shared nothing technology, you need to know when to use distributed or
replicated tables. The concept of distributed or replicated tables tends to be easiest to explain using a multidimensional database designs. The concepts work equally as well when implementing a more normalized schema
sometimes found in operational data stores or in some Enterprise Data Warehouse environments.
Ultra--shared nothing – replicated tables
In order to minimize inter-nodal traffic for join operations or dimension table look-ups, a best practice for EDW is to
define small tables (typically dimension tables), to be replicated on each compute node. In other words, each
compute node has its own copy of the replicated table on disk. It is also likely that frequently accessed replicated
tables are cached for rapid response time.
This replication process is transaction-protected and automated, so the operations staff or DBAs do not have to
manually replicate tables across the 10-40 servers housed in your EDW Appliance. The PDW software automatically
assumes that tables will be replicated (default) unless the table was specifically defined as a distributed table.
19
Figure 11. The replication process across multiple compute nodes
Ultra--shared nothing – distributed tables
As shown in Figure 11, replicated tables allow for very efficient join operations, especially star joins, because the
dimension tables will always be local on each compute node, thus minimizing or eliminating dimension table internodal traffic. Ultra-shared nothing replicated tables are not available in a shared disk architecture. Hence, EDW ultrashared nothing design techniques will place less of a demand on the InfiniBand network and many join operations
will be more efficient.
On the other hand, fact tables can contain billions of rows, so they are not practical to replicate. Therefore, fact
tables are typically “distributed.”
Distribution is accomplished by defining a “distribute” key column. This key column subsequently hashed so the large
(fact table) data will be distributed across all of the compute nodes in the EDW.
The goal of the distribution (hash) column is to evenly distribute data across all the compute nodes in the data racks.
See Figure 12.
Note
One data rack has 10 active compute nodes and four data racks contain 40
active compute nodes.
20
Figure 12. Distributed table
Distributing data evenly allows table scans to execute efficiently, in parallel, in order to provide users with rapid
response times.
Other PDW software benefits
Partitioning distributed data
In addition to distributing data, PDW software also allows for “partitioning” and “clustering.” These features allow
DBAs and your operations staff to manage large quantities of data more effectively.
A good example would be to partition data by date range. As the data warehouse grows, archiving old historical
partitions which may be dropped after they are archived.
Loading data into the EDW
Data loaded into the EDW is stored on the Landing Zone node before it gets loaded into the PDW database. Once
data is on the landing zone, it may be loaded using DWLoader which loads data in parallel for maximum speed.
It is also possible to have application software, such as SQL Server Integration Services (SSIS) perform direct
insert/update or delete operations to the PDW data warehouse/mart. These insert/update or delete operations may
also be transaction protected.
External ETL tier
Some customers may already have an existing BI infrastructure which already has ETL software on a dedicated
server. Therefore, input data may have been already remapped, cleansed and formatted by the ETL software running
on the external ETL server. In this case, it is likely that the landing zone functions as a true “landing zone” which
temporarily stores data to be loaded into the EDW Appliance compute nodes.
21
EDW/PDW internal ETL tier (landing zone)
Customers also have the option to have their OLTP systems or other external feeds store raw data, directly, on the
landing zone. Then, SQL Server Integration Services (ETL software) may remap, cleanse and reformat data to be
loaded into the EDW compute nodes.
Figure 13. Landing zone is used to load data into various schemas
The landing zone is physically a ProLiant DL370 G6 (X5690) with 36 GB of memory and 10 * 2 TB disks (RAID 5),
which may be used for storage. If your ETL workload requires more resources than what is provided on the EDW
Landing Zone, an external ETL tier would be desirable to run ETL software.
Note
The landing zone is the only server on which the customer is allowed to install
application code. The server has a CD/DVD and USB port for software
installation.
SQL Server Integration Services (SSIS) optimization
This document will not cover SSIS optimization. We are assuming that the OLTP data was already cleansed by ETL
software (such as SSIS) on an external server or using ETL software executing on the landing zone. Microsoft has
useful documentation regarding SSIS tuning tips, available at:
http://technet.microsoft.com/en-us/library/cc966529.aspx
How loading data can affect EDW/PDW performance
Earlier in this document we discussed how PDW table replication and distribution features are leveraged to optimize
query response time by performing more efficient join operations and executing faster (parallel) scan rates.
In addition, PDW software also provides the option of using “staging tables” to automatically optimize how data is
physically stored on disk.
When loading data you simply have to specify the staging database in the load command and “dwloader” will
automatically optimize load operations by insuring that data in the target PDW data warehouse is stored sequentially
on disk.
22
Sequential storage of your data warehouse data on disk and efficiently organized SQL Server block structures allow
replicated and distributed tables to perform more efficiently. Scan operations are faster and more data is retrieved
per data block due to these load optimization features. In addition, PDW data is automatically compressed to further
enhance the system‟s effective throughput. Compression allows more rows to be read in each block retrieved from the
physical disk and no row reconstruction is required if multiple columns need to be accessed when executing the users
query.
The default for PDW is to compress rows in the database and PDW compression algorithms allow:
 Data to be compressed during bulk load or during trickle insert/update/delete operations. This means that EDW
performance can be maintained in real time operational data stores or data warehouses. Queries can also execute
while loads, inserts, updates or deletes are taking place.
 All the columns in your selected rows will be decompressed as each block is read into the EDW memory. This is an
advantage over column compression algorithms because multiple compression groups do not need to be read to
reassemble rows in an effort to retrieve all the columns required to service queries.
 PDW rows level compression to use CPU resources efficiently, freeing up CPU time for other useful work.
Reorganizing EDW/PDW tables
CTAS (CREATE TABLE as SELECT)
CTAS is different than SQL Server‟s INSERT with a SELECT. This is because a SQL Server INSERT with a SELECT:
 Runs “serially” – each distribution is run separately to provide transaction safety across the entire appliance
 Runs within a SQL Server transaction
 In the event of failure, a full transaction rollback (up to the last commit point) will occur
On the other hand, CTAS using PDW software is significantly faster because it runs in parallel and does minimal
logging.
CTAS will, most likely, be used by your operations staff or DBAs if data becomes fragmented. If this occurs, it may be
remedied by performing a CTAS (Create Table As Select) operation. This task will restructure the target table to allow
for more efficient sequential scan performance.
In addition to defragmentation, CTAS operations are useful in order to perform the following tasks:
 Create another copy of a table
 Create distributed from replicated or vice versa
 Create new table distributed on a different column
 Create new table with different clustered index, or with no index
 Create new table with different partitioning
 Create a temporary version of a table
CTAS will always create a new table when executed. If no WITH option is used, CTAS will copy the existing table
with no indexing or partitioning. (See the .chm, help file for CTAS syntax.)
23
CTAS example:
CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
[ ( { column_name } [ ,...n ] ) ]
WITH (
DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE }
[ , <CTAS_table_option> [ ,...n ] ]
)
AS SELECT select_criteria
[;]
The RENAME statement may be used after a CTAS operation to rename a table back to its original name.
RENAME <class_type> entity_name
TO new_entity_name[;]
<class_type> ::= {
DATABASE
| OBJECT}
Examples:
RENAME
RENAME
DATABASE SalesNew TO Sales;
OBJECT CustomerNew TO Customer;
Defragmenting indexes
Despite that fact that PDW best practices discourage the use of indexes to improve I/O throughput, it is likely that
some indexes may be required in some situations. Performance testing is encouraged to verify whether the addition of
an index helps or actually slows does the EDW/PDW throughput.
If your database does have indexes, in the event that these indexes become fragmented, you should use the ALTER
INDEX command to defragment these indexes.
Creating database & tables
Creating a database
At this point, the reader will recognize the differences between replicated and distributed tables. We will now take a
look at the next level of technical detail. This involves actually creating the database and tables.
Behind the scenes, the PDW software manages anywhere from 10 to 40 active SQL Server instances when executing
the Create Database command. Each compute node has its own copy of SQL Server 2008, but commands are
executed, as if you are managing a single database.
Transparent to your Database Administrator (DBA) and operations staff, files and filegroups are organized across
each SQL Server instance and are managed by PDW software. This virtually eliminates the need for manual filegroup
and file placement. Hence, the EDW Appliance is automatically tuned to maximize parallel I/O throughput while
providing users with excellent query performance and fast load times.
24
Let‟s take a look at the syntax by using an example:
CREATE DATABASE db_name
WITH (
AUTOGROW = ON | OFF (default is OFF)
,REPLICATED_SIZE = 1024 (in GB)
In this example, the total space required on all 10 compute nodes is 10 terabytes:
1024 GB per replicated table * 10 (compute nodes) = 10 terabytes
In the background, each compute node will have files created:
Filesize = 1024GB/8 Files = 128GB per file
Each 128GB file is stored in each Filegroup spanned across all 8 LUNs (16 RAID10 physical disks).
,DISTRIBUTED_SIZE = 16384 (in GB)
Total disk space required: 16384 GB or 16 terabytes
Space is divided evenly across all compute nodes in the EDW (10 - 40 nodes) using 1 to 4 data racks.
Each compute node has 8 distributions (1 distribution on each of the 8 LUNs in the storage node). Each data
rack has 10 compute nodes, so each data rack will have 80 distributions.
One LFF (Large Form Factor) Storage node is associated with each compute node.
Note that Small Form Factors are configured differently.
In the example shown in Figure 14, we will assume 1 data rack with 80 distributions. Therefore, each
distribution in our single data rack example equals 204.8GB:
Distribution Size = 16,384GB/80 Distributions = 204.8GB/distribution
Figure 14. One P2000 G3 MSA (LFF disk example) associated with a single compute/storage node pair
,LOG_SIZE = 100 (in GB)
The log file should be at least 2X the size of the largest load file or CTAS operation for a Clustered Index Table,
especially if varchar is used
In this example, each compute node: Filesize = 100GB per node (10-nodes = 1TB Total)
)
25
The CREATE DATABASE command is a good example of how the PDW software gives your DBA and operations
staff the appearance of a single system. When the command is executed, behind the scenes, the CREATE DATABASE
command is simplifying management by accessing all 10 to 40 instances of SQL Server, while, giving the
appearance of a single database to the person executing the command.
Creating a table
Creating a table in a PDW database is relatively straightforward from a syntax point of view. However, in the next
section we will be discussing best practices for determining good columns in the “distribute_on”, “cluster_on” and
“partition_on” columns.
Figure 15. How to define the column to “distribute_on”
Earlier in the document, we talked about distributing large tables. These are traditionally fact tables. In order to
improve performance, it sometimes makes sense to cluster data within each distribution. Finally, partitioning is
frequently used to make database management easier, especially as partitions need to be dropped from the tables
over time. For example, it is common to partition by date so the oldest months may be more easily dropped and
potentially archived.
Important
If the DISTRIBUTE_ON option is not specified, the default is to replicate the
table. Therefore, you must use the “DISTRIBUTE_ON” option for large,
multi-billion row tables.
Partitioning a replicated table
Frequently, dimension tables are relatively small to medium sized. Replication of these tables is simple and straight
forward. However, you may want to consider partitioning a large, replicated, dimension table. It is recommended
that performance testing is done with and without partitioning to determine if the partitioning helps with performance
for your specific query workload. For example, sometimes partitioning a table with a non-clustered index may cause
extra random I/O which may inhibit query response time and throughput.
26
Partitioning example:
CREATE TABLE TestTable
(
partition integer NOT NULL,
FirstName varchar(25),
PostalCode integer)
WITH
( PARTITION ( partition RANGE LEFT
FOR VALUES (10, 20, 30, 40, 50, 60, 70, 80, 90)) );
The data is partitioned based upon the range of partition values in the example above.
Figure 16. Example of partitions based upon range of partition values
Partition
1
2
3
4
5
6
7
8
9
10
Value
col <=
10
10 < col
<= 20
20 < col
<= 30
30 < col
<= 40
40 < col
<= 50
50 < col
<= 60
60 < col
<= 70
70 < col
<= 80
80 < col
<= 90
col > 90
Currently, PDW software allows:
 Up to 2 billion tables per database
 Up to 1,024 columns per table
 Number of rows limited only by available storage
 Maximum bytes per row is 8,060
The maximum bytes per row is allowed for tables with varchar columns that cause the total defined table width to
exceed 8,060 bytes. The actual width of all columns must still fall within the 8,060 byte limit.
Best practices for distributed tables
Supported schemas
The HP EDW Appliance supports multi-dimensional, star, snowflake schemas which are common in data warehouses
and data marts. In addition, the appliance also supports normalized database designs which may also be found in
ODSs or in some large Enterprise Data Warehouse designs. The use of distributed tables is one of the features which
allow the HP EDW Appliance to provide customers with the performance of truly scalable, massively parallel, shared
nothing architecture.
Distribution based upon hash keys
In most data warehouse schemas, it is common to find large and small tables. Generally, large tables are evenly
distributed across all the available compute nodes by using a hash key which is generated from a single column in its
respective table. However, it should be noted that it is sometimes desirable to replicate a large table in order to
minimize InfiniBand network traffic if large tables are joined. Very large table replication is rarely desirable and
replicating large files should only be considered on a case-by-case basis dependent upon the nature of the end user
query workload.
When creating distributed tables, in the background, the PDW software creates eight physical distributions per DBMS
instance. (80 distributions in per 10 node data rack or 320 distributions in a 4 data rack appliance.) Behind the
scenes, each distribution results in a distinct physical table within each compute node‟s instance of SQL Server. This
fact implies that the data in the column selected to be used to generate the hash key for any given table needs to
have many more possible values than distributions on the appliance.
27
How to determine a good hash key column
Before considering a hash key column, it is highly recommended that you understand your data and the nature of
your query workload.
Test the cardinality (high cardinality means many unique values) and distribution key by using a query with „count(*)‟
and a GROUP BY on the targeted distribution column to detect likely skew.
 Use SELECT COUNT/GROUP BY to determine cardinalities of candidate columns
Example: Select top 1000 <key>, count(*) as cnt from <table> group by <key> order by cnt desc
 The “hash” key column should have a large number of distinct values (high cardinality) with significant variation in
values and in volume.
 Note that more important than simply having a large number of possible values, it is also important that the
possible values in the hash key column contain a relatively even distribution (minimal skew).
 Minimal skew can be controlled by ensuring that no single key value represents a large percentage of rows on its
own (including NULLs). Watch out for NULL because all nulls will hash to a single distribution and skew rows
across your distributions poorly.
 It is important that the distinct values of distribution key be at least 10 times the number of table distributions on the
appliance. The larger the number of distribution key values, the less concern there will be with a few dominating
values that can skew the number of rows in each distribution.
Example:
If we have a single data rack appliance, then we would expect 80 distributions for any given distributed table.
Therefore, it is desirable to have at least 800 possible values in the column that you would like to use to hash the
distributed table. Conversely, if you have four data racks, then you want your hash key column to have at least 3,200
possible unique values because four data racks will have 320 distributions.
Example:
If you decide to distribute your table on “StoreID” column and there were only 100 stores in the company, then
“StoreID” is likely to give you a poor distribution on a single data rack with 80 distributions. If you have four data
racks, in this example, “StoreID” would be inappropriate because there are too few possible values.
Time-based elements tend to be poor choices for hash key distributions.
Once you define the hash key column in the “distribute_on” clause of the “Create Table” statement, PDW will
automatically create tables on each compute node to reflect this design choice.
In the rare case that you do not have a good hash key column allowing for a good distribution of data, you may
have to create an artificial hash key column or possibly concatenate columns into a single column in your ETL
software. In this way, the new column may be used to generate hash keys which are more evenly distributed across
all of the table‟s distributions.
Important
Sometimes during the design phase of a project a poor distribution key was
selected or, as time passes, the cardinality and distribution of your data
changes so your distributions are no longer even. In these cases, your may
decide to redistribute the table on a different column. The easiest way to do
this is to execute a CTAS statement based upon the new hash key column
distribution. CTAS is generally faster than doing a reload of the table.
Tip: If you distribute a large dimension table, try to distribute commonly
joined fact tables on the same key. This distribution will minimize network
traffic within the EDW Appliance.
Data skew
Data skew is defined as having a disproportionate number of rows in one or more distributions when compared to
other distributions across all nodes in the appliance. A user table is said to have data skew if one or many
distributions has significantly more rows than others.
28
Figure 17. Example caption wording
While desirable, it is unlikely that a hash key column can be found that provides a perfectly even distribution of data
across each distribution across all of the compute nodes. Therefore, as a general rule of thumb, it should be noted
that data skew between 10% and 30% may have an impact on performance. Data skew greater than 30% for a
given distribution will have a significant impact on performance. Hence, data skew should be less than 1% - 7% for
optimal and consistent performance.
What causes skew?
Data skew tends to be caused by a poor, non-uniform, distribution frequency of data within the hash key column. An
example may be when a retailer sells a few items disproportionately higher than other items. In this case, if an
“item_number” column was defined as a hash key for this retailer, it is possible that some distributions would have
significant data skew. Therefore, queries that require a table scan would have to wait for the distributions that have
more rows to complete before query execution will complete.
If data structures allow for non-unique distribution keys, this can also become an issue if the key is heavily weighted to
certain values.
Another reason for data skew can be a result of poor data quality. Disproportionate NULL values in distribution
column can result in a skewed distribution of data among distributions. Bad data can also create repeating values in
a distribution key.
Detecting skew
Before you go into production, it is desirable to make sure that data is evenly distributed across your distributions.
This may be accomplished by:
 Loading a representative sample of data or the complete initial load
 Perform a “Select 1 from <table>”
 Look at the query plan and see how many rows came from each distribution
Or issue:
“Select top 1000 <key>, count(*) as cnt from <table>
group by <key> order by cnt desc”
If the data in your distributions are skewed, try another key column or you can try creating a random or surrogate
key, if necessary.
29
Partitioning distributed tables
By definition, distributed tables are already segmented by hashed distributions to allow for massively parallel
performance. Tables are distributed with the intent of optimizing appliance-wide CPU and I/O subsystem
consumption. Partitioning distributed tables is typically done for manageability. In essence, partitioning will further
partition rows within a distribution. Based on a partition function:
 Partitions are within each respective compute node
 Date/time or integer surrogate based partitioning is common
 Typically the partition column is the same as clustered key index
For those readers who are interested, partitions (applied within each schema) are all placed in the same filegroup.
Examples of why partitioning is useful include operational efficiency when: adding, loading, archiving, dropping,
and switching/sliding partitions. Partitioning may also be desirable to SWITCH partitions OUT for fast delete of
history or IN to modify or add a specific historical slice. In general, partitioning allows DBAs to manage data for very
large tables in smaller chunks by using the ALTER TABLE syntax.
ALTER TABLE [ database_name.[ dbo ].| dbo.] source_table_name
[ { SPLIT | MERGE } RANGE boundary_value ]
| [ SWITCH [ PARTITION source_partition_number ]
TO target_table_name
[ PARTITION target_partition_number ] [;]
Some customers partition distributed tables in order to optimize the loading of large tables. For example, partitioning
allows for fast loading of an unused partition and switching it in after the loading process.
Even though PDW best practices encourage the use of an index-lite database design, if a table has an index,
partitioning allows for re-indexing of smaller partitions, which results in a faster operation.
Finally, in the event that data needs to be re-organized partitioning allows for piecemeal data reorganization at the
partition level.
Note
You cannot move partitions between distributed and replicated tables. If you
feel that a replicated table needs to be partitioned, consider making it a
distributed table.
For more information on SQL Server partitioning, see
http://technet.microsoft.com/en-us/library/cc966457.aspx
30
Example of distributed table with partitions:
CREATE TABLE TestTable
(
id integer NOT NULL,
LastName varchar(25),
ShipDate datetime
)
WITH
(DISTRIBUTION = HASH (id),
CLUSTERED INDEX (ShipDate),
PARTITION (ShipDate
RANGE RIGHT FOR VALUES
(
'1995-01-01', '1995-02-01', '1995-03-01',
'1995-04-01', '1995-05-01', ... )));
The number of distributions equals eight times the number of compute nodes. Therefore, one data rack will have
80 distributions created because there are 10 active compute nodes per data rack. Four data racks will have
320 distributions.
Partitioning into 10 ranges per compute node will effectively segment a table into 800 horizontal (row based)
partitions on a single data rack EDW Appliance.
Partitions for a given distribution are all placed in the same filegroup on its respective compute node.
Generally, it is a best practice to cluster and partition on the same column; this best practice holds true most of
the time, but it is query-dependent. Consequently, it is always a good idea to do performance testing to verify
your database design.
Clustered index vs. a heap table
Up until this point, we have discussed the two of the most critical issues which will affect the EDW Appliance
performance -- distributed vs. replicated tables and selecting a good hash key column for distributed tables. Another
common issue which should be evaluated when developing a PDW data warehouse involves the decision of whether
to use clustered indexes or a heap table structure.
Generally, a clustered index is recommended for most tables because query performance tends to be better if your
primary clustered index is frequently used in many queries. If your clustered index is not frequently used, then a heap
table may be a better alternative.
In line with the index-lite philosophy, secondary (non-clustered) indexes should be minimized or, preferably,
eliminated on the EDW Appliance in order to reduce random I/O. In addition to throughput gains, elimination (or
minimization) of indexes also makes the job of the DBA easier, makes system management simpler and ensures that
your ETL batch window will be shorter.
A clustered index is different than a secondary index. A clustered index is built into the table structure itself. This
means that a clustered index cannot be added or dropped later. However, if desired, you can use the CTAS
operation to change a clustered index. A clustered index is only valid when you use CREATE TABLE. On the other
hand, secondary indexes are physically stored and managed as separate files and may be added or dropped at any
time.
Even though query response times are generally improved when a clustered index is used, the reader should be
aware of the fact that clustered indexes will affect load windows by:
 Introducing a “sort” step during table loads (load slower than heap loads)
 Requiring periodic maintenance to defragment (use CTAS operation)
On the other hand, heap tables may want to be considered if batch load windows are weighted as a more important
issue to your business than end user query performance. Heap tables also have an advantage in that fragmentation is
minimized, so defragmentation is not necessary.
31
If you decide that you want a clustered index for your table, please consider the following best practices:
 Cluster fact tables on same column used for partitioning. This will allow optimal sequential I/O
 Minimize secondary (non-clustered) indexes on PDW to reduce random I/O
 Avoid using multi-column (compound) clustered indexes
 Use CREATE TABLE AS SELECT (CTAS) to defragment a clustered table
Heap table advantages:
 Provides for the fastest data loading
 Can be used on replicated tables if a clustered index cannot be easily determined
Now that you have a grounding in the tradeoffs between a clustered index and a heap, if you are still not sure you
have found a good clustered index that will help your query performance, your best course of action would be to test
some of your frequently used queries against both types of table structures. Monitoring query throughput, response
time and resource utilization during the testing will allow you to make the best decision for your database and
workload environment.
Non-clustered indexes
When designing a new data warehouse using the HP EDW Appliance and PDW software, it is desirable to start with
no indexes. Generally, best performance under load may be tables with no secondary indexes. As the query load
increases, seek time delays and disk queuing become more noticeable. Therefore, negative effect of non-clustered
indexes may not be noticed under a light workload.
Non-clustered index considerations:
 Maximum index size for a non-clustered key is 900 bytes (Varchar columns may bypass this limit, but enforcement
is based on actual data)
 Requires ALTER permission on the table
 To view information on existing indexes, you can query the sys.indexes catalog view
 Indexes cannot be created on views
 Maximum number of non-clustered indexes is 999 per table
If you decide to rebuild an index via the ALTER INDEX command, consider the fact that indexes are rebuilt in
parallel on each compute node. This allows for rebuilding an index to complete faster, yet it also consumes more
resources during the rebuild process.
The ALL option will alter all indexes on table, both clustered and non-clustered
ALTER INDEX {index_name | ALL }
ON [ database_name.[ dbo ]. | dbo.]
table_name
REBUILD [;]
Remember that a PDW best practice is to use non-clustered indexes sparingly.
Best practices for loading data
PDW loading tools
Now that you created your database and created distributed, partitioned and replicated tables, it is time to load data
into your EDW/PDW database. Determining the best data loading logistics for your environment is an important step
in maintaining high levels of performance. The following are some important issues to consider when implementing a
loading methodology:
 Reduce disk fragmentation
Loading data into your warehouse using the best practices outlined below will help reduce table fragmentation as
much as possible and allow for maximum I/O throughput because data will be able to stream off the disk drives
with minimal disk head movement.
32
 Provide system recovery capabilities in the event of data load failure
Some thought should be given to restart and/or recovery procedures in the event that a failure occurs in the middle
of a batch load.
 Load data while respecting concurrent queries and loads
The PDW software allows you to load data into your data warehouse while maintaining query performance.
However, despite the fact that PDW uses SQL Server Resource Governor internally to balance query vs. load
resources, you do not want to push this feature to its limit by loading excessively large volumes of data during your
peak query volume windows. End users who do so may see some degradation in performance.
The PDW software supports several different methods to load data into the PDW database. Extract transformation
and load software, such as SQL Server Integration Services (SSIS), may execute on the landing zone or on an
external ETL tier which generates flat files which are to be loaded into the PDW database. The landing zone is
physically configured with a DVD and USB, which allows customers to load and execute 3rd party software of their
choosing.
The PDW software allows customers to use multiple load options.
Loading tools:
 DWLoader Utility
 SQL Server Integration Services (SSIS)
SSIS may be used for bulk loads or for “close to” real time trickle updates by having very tight windows for minibatch loads.
 Standard SQL DML statements: INSERT/SELECT
SQL Insert/Select is useful for real time updates into the data warehouse or operational data store database. It is
generally not the preferred method for bulk loads.
 CREATE TABLE AS SELECT (CTAS)
CTAS is not used to load data from flat files. As stated earlier in the document, CTAS is good for defragmenting
tables, creating another copy of a table, creating distributed from replicated tables or vice versa, creating a new
table distributed on a different column, creating a new table with different clustered index or with no index, and
creating a new table with different partitioning or creating a temporary version of a table.
DWLoader best practices
 DWLoader is a bulk load, command line oriented utility. It is the tool which is most frequently used when loading
flat files which were already remapped and cleansed by Extract Transformation and Load (ETL) software.
 DWLoader is integrated with PDW‟s Data Movement Service (DMS) and streamlines I/O while minimizing dataload times.
 The DWLoader utility input file must reside on the landing zone and DWLoader is not intended to support more
than 2,147,483,647 (2 billion) rows in the input file.
 DWLoader provides transactional protection and configurable batch size (10,000)
 The PDW software can load a maximum of 10 concurrent loads, running in parallel on the same appliance. At that
point it queues up subsequent load operations. No error is generated.
SQL Server Integration Services (SSIS)
SSIS can run either on the landing zone or on a separate server tier outside the PDW appliance. However, if you
choose to execute SSIS on the landing zone, make sure that it has enough horsepower to meet your service level
agreements.
Similar to DWLoader, SSIS can perform parallel load operations.
When using SSIS to load data the following may be useful tips to provide optimal SSIS performance:
 For good PDW loading throughput, it is important to keep a steady stream with minimal starts and stops.
 PDW connections and queries are very costly to initiate. Using fewer connections will allow for more efficient and
faster load rates.
 Data type conversion in the PDW destination adapter is very expensive. Be sure the input types match the
destination types, especially for strings and decimals.
33
An SSIS “.dtsx” package may be used to load data directly into the PDW database or create a flat file for
DWLoader. In the package workflow for SQL Server PDW, you can load and merge data from multiple sources and
load data to multiple destinations. The loads occur in parallel, both within a package and among multiple packages
running concurrently.
The combined total of loads with SSIS and DWLoader is 10 active loads per appliance. Each SSIS package
destination from the same data source counts as one load when the package is running. Multiple packages can run
concurrently as long as each package does not use more than the maximum active loads. For example, if there are a
maximum of 10 active loads, you can concurrently run two packages with each using 10 destinations. One package
will run while the other one waits in the load queue.
For maximum appliance throughput using SSIS loading, you will want to stream to multiple concurrent PDW
destination adapters.
In general, if you have a choice between SSIS and DWLoader, tests have shown that DWLoader is faster and is the
preferred method when loading from flat files. Some reasons why DWLoader is faster is due to SSIS locking behavior:
 When loading data with SSIS, SQL Server PDW uses row-level locks to update data in the destination table. Bear
in mind that the rows in the destination table are not locked while the data is loaded into the staging table.
 Loading data with SSIS initiates these locks:
A SharedUpdate lock on the destination table and the destination database
A SharedUpdate lock on the staging table (when a staging database is used)
The best suggested practice in general is that DWLoader is the most optimal way to load data into your PDW
database. However, if you require more ETL functionality (such as data cleansing, table lookups, mapping, etc.), then
SSIS is a better choice because SQL Server Integration Services is a fully functional ETL tool.
Staging database
Even though the staging database is technically optional, benchmarks have shown that the use of a staging database
generally improves the EDW Appliance‟s load rates for most data warehouses.
 There should only be one staging database per EDW Appliance.
The staging database is shared by all load jobs for all destination databases on the EDW Appliance.
 When creating the staging database, use the following guidelines:
Replicated table size should be the estimated size per compute node of all the replicated tables that will load
concurrently.
Distributed table size should be the estimated size per appliance of all the distributed tables that will load
concurrently.
Log size is typically similar to the replicated table size.
 When your data warehouse first goes into production, your initial PDW load jobs tend to be the largest. Therefore,
you may want to consider reducing the size of your staging database to more closely represent your largest
incremental loads vs. the initial load.
To reduce the size, you can drop the staging database and create it again with smaller allocations, or use the
ALTER DATABASE statement.
Miscellaneous loading tips
This section will point out a few more PDW best practices which were alluded to, but that were not necessarily
spelled out in the earlier sections of this document.
 Make load partition friendly by grouping data by partition prior to loading.
 If necessary, drop non-clustered indexes before loading and re-index after all loads are complete. This action will
help minimize fragmentation. During development, performance tests should be performed to find out which
method performs best in your environment.
 There is no benefit to sorting data before hitting the landing zone.
34
 The main reason why there no performance gain by pre-sorting, has to do with the staging table and the use of a
clustered index on the destination table. Since most customers use a staging table as part of the loading process,
data is automatically sorted during the load procedure. Hence, pre-sorting data is not saving any time during the
load.
 Pre-sorting data can also hurt load performance because pre-sorted input rows may result in unbalanced compute
node activity during the load window.
 Use CTAS to expand or grow data (e.g., for testing) instead of using a standard SQL DML INSERT/SELECT
statement.
 Minimize page breaks (fragmentation) by designing partition-friendly loads.
 Loading and updating “wall clock” times for replicated tables tends to be much longer than for distributed tables.
This is because replicated tables need to make multiple copies of the replicated tables on each compute node in
the appliance.
After data is loaded into the PDW database
During the PDW database development cycle, it is desirable to define and use a few “master queries” which
represent your end user query workload. These queries should be run once every few months to verify that end user
response time and I/O throughput remained relatively consistent over time. Pro-active periodic query performance
monitoring is more desirable than having end users bring up query performance as an issue.
PDW software will automatically update statistics after table changes by 20%. However, based upon your
performance monitoring tests, it may be desirable to increase your update statistics frequency. However, don‟t update
statistics too often because there may be a trade-off between improving query plans and recompilation time due to
updated statistics.
Miscellaneous facts about update statistics:
 All statistics use standard SQL Server sampling.
 PDW creates only single-column stats. However, compound, multi-column (covering) statistics may be implemented,
but they must be manually created.
If your joins are on multiple key columns, multicolumn stats help avoid mistaken Nested Loop plans on the SQL Server
nodes. The following are some clues which may warrant the exploration of using multi-column statistics:
 A long-running query step involving multicolumn join key and experiencing an inordinate amount of CXPACKET
waits
 SQL query plan (estimated) shows Nested Loop and a very small number of estimated result rows from the join
 Actual query has millions of rows running through a Nested Loop operation
Multi-column statistics:
SELECT...
example:
FROM fPageView A...
JOIN fCookiePageView cpv
ON cpv.PageViewID = A.PageViewID
AND cpv.SessionID = A.SessionID ...
Performance can be improved significantly by creating multicolumn stats in the above query which joins two fact
tables and multiple columns:
Create Statistics pv_page_sess
on fPageView(PageViewID, SessionID)
Create Statistics cpv_page_sess
on fCookiePageView(PageViewID, SessionID)
35
PDW Create Statistics uses syntax similar to the standard SQL Server command but the current release of
PDW software has these limitations:
 No filter predicates
 No sampling percentage options
In order to complete creating or updating statistics as fast as possible, PDW executes the command in parallel on all
the compute nodes.
EDW performance metrics
HP and Microsoft have run various performance tests internally and with customers. The following are a few
reasonable performance metrics which may be used when evaluating EDW throughput.
Since the EDW is a loosely coupled MPP architecture, system scan rates are expected to perform in a linear fashion.
Therefore, it should be expected that a 4-rack EDW can scan at about 66GB/second.
Table 2 shows what is considered to be reasonable performance metrics for each data rack.
Table 2. One data rack performance metrics
EDW Performance Metrics
Small Form Factor disks per data
rack
Large Form Factor disks (15k) per data
rack
Load rates
1.7 TB/Hr
1.1 TB/Hr
Scan rates compressed (raw disk, per
rack)
16.5GB/sec
12.8 GB/sec
Backup rates (to disk)
5.4TB/Hr
5.1 TB/Hr
Summary
This technical white paper discusses many of the major issues relating to EDW/PDW best practices. While these
recommendations are practical and very useful in most customer environments, HP recognizes that each customer‟s
workload and set of requirements will vary. Therefore, it is highly suggested that customers test the various best
practice suggestions presented in this document in their own environment using a master query set that represents the
most frequently executed queries of their users.
These queries should be run single-threaded and under load in such a way to reflect the customer‟s “real world”
workload volume. Once complete, these tests will provide the customer‟s technical staff with a great deal of insight
regarding data warehouse performance. By changing one variable at a time, the data warehouse may be customtuned to provide optimal performance for all users.
In the meantime, please feel free to contact HP or Microsoft to provide more in-depth insight relating to EDW/PDW
best practices to implement a successful and efficient foundation for your business intelligence environment.
36
For more information
EDW solutions page: http://h71028.www7.hp.com/enterprise/us/en/partners/microsoft-enterprise-data-warehousesolution.html?jumpid=ex_r2858_us/en/large/tsg/microsoft_edw
EDW product page: http://h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/sm/solutions/enterpriseoverview.html
HP Business Intelligence Solutions for Microsoft SQL Server: http://www.hp.com/solutions/microsoft/sqlbi
HP Business Data Warehouse Appliance You Tube video: http://www.youtube.com/watch?v=uoXD9nJKmcw
BDW product page: http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/457757-457757-42926994346415-4292808-5104076.html
HP Business Decision Appliance Overview (Useful for a data mart spoke, in a hub and spoke BI architecture or a
standalone data mart): http://h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/sm/solutions/businessoverview.html
ActiveAnswers page: http://www.hp.com/solutions/activeanswers
Microsoft Remote Data Copy (Parallel Data Export) documentation:
http://download.microsoft.com/download/1/7/5/175A7833-6F75-418D-8800-86D44D1D712D/[MSDPPDW].pdf
To help us improve our documents, please provide feedback at
http://h71019.www7.hp.com/ActiveAnswers/us/en/solutions/technical_tools_feedback.html.
© Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The
only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services.
Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or
omissions contained herein.
Microsoft is a U.S. registered trademarks of Microsoft Corporation.
4AA3-8477ENW, Created November 2011