Best Practices for HP Enterprise Data Warehouse Appliance/Microsoft SQL Server 2008 Parallel Data Warehouse R2 Technical white paper Table of contents Introduction ..................................................................................................................................2 What is big data? .....................................................................................................................2 Hub and spoke business intelligence infrastructure ............................................................................2 Data quality issues ....................................................................................................................2 Performance in the BI environment ..............................................................................................4 Factors contributing to I/O throughput demand in a data warehouse or data mart ...............................9 Classes of users ........................................................................................................................9 Business intelligence user expectations from a performance perspective.........................................11 Data compression variants .......................................................................................................13 Traditional database designs vs. EDW and Fast Track .................................................................13 EDW and Ultra-Shared Nothing technology ...............................................................................18 Other PDW software benefits ...................................................................................................21 How loading data can affect EDW/PDW performance ...............................................................22 Best practices for loading data .................................................................................................32 EDW performance metrics .......................................................................................................36 Summary ...................................................................................................................................36 For more information...................................................................................................................37 Introduction The HP Enterprise Data Warehouse (EDW) Appliance has been optimized for Microsoft® SQL Server 2008 Parallel Data Warehouse (PDW) software. The system has been integrated to provide organizations with excellent performance in a scalable, highly available environment. HP provides this document as a guide to our customers so that you will be able to take advantage of the EDW appliance‟s capabilities. Each section discusses basic EDW/PDW concepts from different points of view, and digs deeper into each concept while recommending various best practices along the way. It has been assumed that the reader is familiar with business intelligence environments, EDW architecture and terminology. If not, we recommend reviewing the first half of the “HP Enterprise Data Warehouse Appliance architecture overview and performance guide – Introduction to Business Intelligence architectures” as a prerequisite to reading this document, available at: http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-5625ENW.pdf. Several performance issues discussed in the latter half of the above document have been readdressed in this white paper because they pertain to some fundamental EDW performance concepts which bear repeating. The good news is that this document extends the performance discussions in earlier HP white papers by delving into additional EDW and PDW best practices in more detail. It is important to note that the industry terms “Enterprise Data Warehouse” (EDW) and the HP “Enterprise Data Warehouse (EDW) Appliance” may sound the same; however, the former EDW term typically relates to a generic centralized enterprise wide hub for an organization‟s historical data. The latter interpretation refers to the HP Enterprise Data Warehouse Appliance, which is physical hardware that was engineered to be a loosely-coupled, massively parallel server running Microsoft‟s Parallel Data Warehouse (PDW) software. The HP EDW Appliance was designed to support not only a company‟s Enterprise Data Warehouse functionality (hence, its name), but also the company‟s high end data marts or Operational Data Stores (ODSs). To recap, even though the product called the HP Enterprise Data Warehouse Appliance implies that it is to be used for an organization‟s Enterprise Data Warehouse hub, more frequently the HP Enterprise Data Warehouse Appliance is used to support high end data marts which require a data store for structured “big data” analytics or large ODSs. What is big data? For the sake of this document, we would like to define the term “big data.” This term can be very subjective. After all, the word “big” to some may be “small” to others and vice versa. In the past, people thought of big data as being things like phone call detail records, POS credit card transactions or stock/financial transactions. While these forms of data are truly “big”, today, many people also think of big data as including RFID, web logs, social data, etc. All of the above are examples of big data. Other examples, not mentioned above, will vary from customer to customer and from application to application. At this point in time, the HP Enterprise Data Warehouse Appliance using SQL Server Parallel Data Warehouse software defines big data as that which supports databases up to 600TB. HP believes that the EDW Appliance‟s real benefit is not only supporting big data relating to database sizes, but, more importantly, supporting various business intelligence, analytical, mining or reporting tools which can unlock the data‟s business value while meeting end user service level agreements. Hub and spoke business intelligence infrastructure When designing a business intelligence environment, users tend to be mainly concerned with data quality and performance. A hub and spoke infrastructure can address these issues in several ways. Data quality issues Data quality issues may be addressed by providing users and executives with a single version of the truth. Many business intelligence (BI) systems have evolved over time. Typically, each organization extracts data from various OLTP systems at different times during the day. They also massage data differently using various algorithms which may yield different results that may be misinterpreted when data is shared across organizations. 2 This problem is compounded and becomes more visible when executive management obtains different answers to their business questions depending upon which system is queried. See Figure 1. Figure 1. Data mart evolution provides users with different variations of truth In a hub and spoke architecture, the Enterprise Data Warehouse may act as a hub for a company‟s various OLTP systems using Extract Transformation and Load (ETL) tools such as SQL Server Integration Services (SSIS). Once the data is loaded into the Enterprise Data Warehouse, it may act as an enterprise-wide central repository for an organization‟s historical data. Ideally, the data in the Enterprise Data Warehouse should be stored and structured using “conformed dimensions.” This design ensures that all departments within a company agree upon a common set of names, terms and algorithms which are documented and maintained in the company‟s meta-data management system. The use of conformed dimensions, if at all possible, should become standardized throughout a company and form the basis of providing management with a “single version of the truth.” See Figure 2. All users should be able to view the business intelligence environment‟s metadata using tools such as Microsoft SQL Server Master Data Services. More information on Master Data Services may be located at: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/master-data-services.aspx Once the data is stored in the conformed dimensional Enterprise Data Warehouse (hub), it may be exported to one or more data mart (spokes). At this point, the users, business analysts, or management may query the data marts or EDW directly, depending upon security and network connectivity. 3 Figure 2. Hub and spoke architecture – “single version of the truth” Note The PDW software supports a feature called Remote Table Copy (Parallel Data Export). This capability allows the EDW Appliance to perform high speed data exports to external, SMP based SQL Server 2008 systems which are connected to the EDW via InfiniBand. Performance in the BI environment Performance in the BI environment may be measured from different points of view. I/O throughput (GB/second) Queries per hour rate Number of simultaneous users End user response time Performance metrics need to be evaluated based upon your BI infrastructure and may vary from customer to customer. For example, if end users are physically submitting queries against front end, OLAP cube spokes (SQL Server Analysis Services), these memory-based OLAP cubes will have the fastest possible response times which will, in turn, support high query per hour rates. Today, OLAP cubes should, practically, be kept to less than 1TB for a single SQL Server Analysis Services (SSAS) data mart. However, OLAP cubes may scale out by running multiple cubes on multiple servers in an OLAP cube “farm” type of architecture. If your data mart will grow to more than 1TB within a single cube, it is recommended that the data mart be kept relational in flavor using the HP Business Data Warehouse (BDW) Appliance, SQL Server Fast Track reference architectures, or EDW appliances, etc. As of October 2011, the following metrics may be used to guide customers in 4 finding the correct balance of storage space and I/O throughput required to support their data warehouse or data mart spoke. Business Data Warehouse (BDW) Appliance – Symmetric Multi-processing (SMP) Scale up to 5TB Up to 2GB/sec of I/O throughput Fast Track (as of March 2011) – Overview: http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA28181ENW.pdf – Symmetric Multi-processing (SMP) Entry reference architecture (RA) (4.5 – 14TB); 1 GB – 1.8GB/sec of I/O throughput http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2184ENW.pdf Basic RA (10-20TB); up to 2.8 GB/sec http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2186ENW.pdf Mainstream RA (20-40TB); up to 5.1 GB/sec http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-2690ENW.pdf Premium RA (40-80TB); up to 11 GB – 14GB/sec http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-3609ENW.pdf HP EDW Appliance – Loosely-coupled, massively parallel (MPP) Table 1. HP EDW Appliance user data capacity (3.5x compression ratio is assumed) User data capacity 1 data rack 2 data racks 3 data racks 4 data racks 11 300GB LFF Disks per MSA 38TB 76TB 114TB 153TB 24 300GB SFF Disks per MSA 76TB 153TB 229TB 305TB 11 1TB LFF Disks per MSA 127TB 254TB 381TB 509TB 24 600GB SFF Disks per MSA 153TB 305TB 458TB 610TB LFF disks; 12.8 GB/sec per data rack = 51.2 GB/sec per four data rack EDW SFF disks; 16.5 GB/sec per data rack = 66 GB/sec per four data rack EDW As you can see, the HP EDW Appliance may function as a hub or a spoke. The main determining factors will be storage space, throughput and scalability requirements. I/O throughput vs. query per hour rates I/O throughput and query per hour rates are closely related. Regardless of a query‟s complexity, all query types will want to drive the I/O subsystem at its maximum throughput rate. SQL Server‟s Parallel Data Warehouse‟s Ultra-Shared Nothing technology attempts to eliminate data movement between nodes when some queries perform “join” operations between replicated dimension tables and fact tables. In addition, other queries may want to perform parallel table scans. This is common when large subsets of data are selected for data mining or if users want to perform complex analytics. 5 Since DBAs may not be able to predict how business analysts may want to look at their data, the EDW Appliance has been optimized to stream data off disk at very high rates of speed to support ad-hoc queries. Tuning and managing the HP EDW Appliance has also been greatly simplified, even when compared to SMP versions of SQL Server. When profiling workloads, simple queries will return result sets fastest because they are typically fast table lookups in dimension tables or else the query predicate allows for only a small number of rows to be accessed. In general, most users execute simple queries most of the time because they are trying to determine which dimensions will be of most use to them when formulating medium or complex queries. On occasion, some customers may even create aggregate or summary tables to minimize query service times. In general, the largest populations of casual users fall into the simple or medium query complexity category. Medium and complex queries require large data sets to be accessed or scanned and may involve complex join operations. The EDW Appliance excels in executing these query types. While query per hour rates may sound like a useful metric, unless the metric is qualified with an in-depth understanding of the mix of simple, medium and complex query profiles, a simple query per hour rate is not easily mapped to your organization‟s respective workload. Since profiling user workloads is a time consuming task, coupled with the fact that workload mixes evolve over time, it is frequently more practical to use I/O throughput as a simpler gauge when monitoring data warehouse performance. Figure 3 shows how the EDW Appliance with a “fatter” I/O throughput pipe can support more queries per hour than a SMP based SQL Server, Fast Track reference architecture. Notice that although the workload mix of simple, medium and complex queries remained the same in both examples in the graphic; the EDW, MPP architecture‟s “fat pipe” is able to funnel more queries per hour through the system. Figure 3. How query workload translates into I/O throughput requirements Different classes of queries have very different workload profiles. As users become more sophisticated, over periods of time, and as business needs change, you will notice that workload profiles and query mixes also evolve. The good news is that the HP Enterprise Data Warehouse Appliance allows your workload mix to vary and evolve over time. The appliance may be thought of as a “scan engine” because its goal is to scan data, efficiently, at high rates of speed. Regardless of your mix of simple, medium or complex queries, the HP EDW Appliance will try and maintain high throughput rates which, in turn, lead to consistent, rapid response times for ad-hoc queries which change and evolve as your business demands change. Figure 4 shows how the ratio of simple, medium and complex queries typically changes over time. 6 Note The numbers shown in Figure 4 are for illustrative purposes only. Figure 4. How ratio of simple, medium and complex queries typically change over time Now compare Figure 3 above to Figure 5 following, which is a visual example of how workload mixes can change over time. Even though the workload mix changes, the I/O rate supported by the data warehouse remains relatively consistent. 7 Figure 5. How workload mixes can change over time One of the many benefits that the EDW Appliance and PDW software bring to the table is more consistent end user response times. This is accomplished by using the PDW index-lite design techniques and other best practices addressed in the remainder of this document. Number of supported users It is common for people to ask “how many users will the system support?” It should be noted that the number of supported users or the number of simultaneous users may be a misleading and sometimes meaningless metric. In actuality, hundreds or thousands of users could conceivably be connected to a very small server, perhaps even a laptop. The real question is not necessarily “the number of users”, but rather, a determination of the kind of I/O throughput demand which will be generated by a given user population. Some issues which explain why the number of “simultaneous users” may give false impressions are: Thousands of users may connected, but may be sitting idle, at lunch or doing non-business intelligence related activities Only users who actively submit queries to the data warehouse will be consuming resources The amount of resources users consume is closely related to query complexity Query complexity is tied to I/O throughput demand The query mix multiplied by the volume of each query type is what generates the I/O throughput demand on the data warehouse. The mix and volume of queries will vary over time In one organization, 10,000 users may generate a low query volume and if that query mix contains simple queries, then the I/O throughput demand will be low. On the other hand, if 10 dedicated power users execute mostly complex queries or extract large data sets for data mining and other analytics, the demand will be high. These 10 power users have the potential to fully saturate the most powerful MPP servers in the world. Consequently, the question of how many users the system will support is best answered by the kind of I/O throughput demand, not by the number of simultaneous users. Fortunately, the HP Enterprise Data Warehouse Appliance can scan data at exceptionally fast speeds and execute queries, in parallel, at low affordable costs. The intention of this document is to discuss various EDW/PDW specific 8 best practices which will enable you to provide your user population with excellent response times and high throughput rates. Response time Response time is the other critical metric about which end users are concerned. Without getting into queuing theory, the most basic formula for response time is: Response time = Service time + wait time “Service time” is a function of how much data a query must read and evaluate, based upon your predicate. This would include the time to perform joins, unions, sorts, etc. Generally, it is a function of how fast you can stream massive amounts of data off disk (I/O throughput). Some people wonder if flash cache (SSD) will speed up query performance by reducing service times. While flash cache will help many OLTP systems, it is unlikely that flash cache will provide the performance gains that most people expect in a large data warehouse. The three main reasons why flash cache does not significantly boost business intelligence style and query performance as much as people hope are: Unless the entire database can fit in cache, data will most likely be flushed out of cache when a table scan is performed If you plan on doing frequent database updates or trickle feed/real time updates to the data warehouse or ODS, flash cache needs to be maintained in order to make sure it is in sync with the database The flash cache path to the CPU is not much faster than multiple streaming disks and disks are much less expensive than flash cache The second component of the response time equation is wait time. “Wait time” is basically queue time, which is the amount of time spent waiting for resources. Queuing tends to increase as more simultaneous queries are executed. When compared to SMP systems, the HP EDW Appliance will lower “wall clock” service times by automatically decomposing a query and executing multiple scans in parallel on multiple servers (each of which have multiple cores). Basically, you have multiple multi-core SMP servers working on a workload in parallel. “Wall clock” service time is further improved because the EDW utilizes multiple, shared nothing, I/O subsystems (separate and distinct HP P2000 G3 MSA arrays). These lower “wall clock” service times reduce the risk of incurring queuing delays waiting for system resources when the system is under load. In essence, this feedback loop of providing faster service times to reduce wait times provides users with overall better performance on the EDW Appliance than on a large scale-up SMP system. It should also be noted that the EDW Appliance‟s shared nothing design has a major advantage over systems with shared storage because loosely coupled, shared nothing architectures are not competing for and queuing on shared storage resources when multiple queries hit your warehouse simultaneously. Factors contributing to I/O throughput demand in a data warehouse or data mart Classes of users Users today realize that they do not want to look through paper reports or large PDF files to find an answer to their question or make a business decision. They prefer to make ad-hoc queries which will allow the data warehouse or data mart to return the specific result set which will answer their question specifically and quickly. In addition, the questions asked of the system will tend to change weekly, daily or hourly. Classes of users and types of queries In general there are 3 classes of business Intelligence users within an organization. Typically, the nature of queries generated by each user class tends to vary. Here are some generalizations which are commonly associated with each user class: Casual users This class of users usually has the largest population. Traditionally, this user needs static reports (such as PDF files or hard copy), or to access aggregates or summary tables by using simple query tools. Casual users also tend to like 9 OLAP cubes (SSAS) because of memory speed access to aggregated data. In other situations, these users may access front-end relational data marts. Generally, queries generated by casual users have a highly qualified, time range, and predicate which limits the number of rows that have to be accessed. The main question this class of users tries to answer is “What happened?” Business analyst users This class of users tends to be more sophisticated and are fewer in population in most organizations. Traditionally, they also use static reports such as PDF files or summary tables. However, by using more powerful tools, they are now able to make business decisions faster by using dashboards, scorecards, Microsoft Excel, PowerPivot, etc. The queries these tools generate may be targeted at a relational data mart, OLAP cube or directly to an EDW Appliance. The main question business analysts try to answer is “Why did it happen?” This type of user also tends to have a hypothesis they are trying to prove by querying the data warehouse. They may take advantage of drill down functionality in an effort to gather data to prove their theory. A good front end appliance to use for business analysts is the HP Business Decision Appliance (BDA) which comes pre-configured and pre-installed with Microsoft SharePoint and PowerPivot. Another appliance which serves as an excellent relational data mart (spoke) is the HP Business Data Warehouse (BDW) Appliance. HP Appliances tend to be easier to purchase and allow customers to be productive faster and easier because most of the building is performed in the HP factory. The goal is to make appliances as close to plug-and-play as possible. Additionally, HP has run tests and has published a white paper titled: “High speed data export from an HP Enterprise Data Warehouse Appliance to an SMP Microsoft SQL Server” available at: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-8382ENW&cc=us&lc=en The above white paper is a step-by-step guide to implement high speed data export using the EDW Appliance InfiniBand switches connected to the HP Business Decision Appliance or any other SMP-based SQL Server 2008 database. The Remote Table Copy feature is perfect to use when implementing an EDW hub and spoke architecture. Microsoft, also, has a paper entitled “Parallel Data Warehouse Data Portability Overview”, which may be of interest to the readers who plan on exporting large quantities of data from the EDW Appliance: http://download.microsoft.com/download/1/7/5/175A7833-6F75-418D-8800-86D44D1D712D/[MSDPPDW].pdf. Power users The power users tends to be the fewest in number yet are critical in advising or setting a company‟s strategic direction. They are looking for customer patterns, financial trends and may even use external data (such as demographic or weather data, etc.) to gain insight into their company‟s current strategy with an eye open to changing course as the business or market forces dictate. The faster data can be analyzed and business decisions made, the higher the odds that these companies can gain a competitive advantage in the market. Power users tend to perform interactive, ad-hoc queries, OLAP operations and data mining. They also favor the use of analytics, statistics and occasional modeling or simulations which may require selecting large data sets from the warehouse. The PDW Remote Table Copy feature is a high speed, InfiniBand based method to extract these data sets to be used by front end analytical tools. In summary, power users frequently use EDW data for knowledge discovery; they explore the data looking to create a new hypothesis, theory or market insight. Bear in mind that having the data assist in creating a theory is very different than having a theory and looking for data to prove your hypothesis. 10 Classes of users and types of data accessed In understanding how the nature of the questions will vary between different classes of users, we also need to drill down and consider the type of data accessed by each user class. Frequently, the groups of users are likely to be interested in different types of data. Similar to the generalizations used in the attempt to classify user queries, the following discussion regarding data access also contains generalizations. Therefore, the actual data accessed by users in your organization may vary. Casual users These users tend to be interested in the subset of data which is aligned with their department or organization. They also tend to be interested in recent data (days or months) vs. business/power users who typically perform longer term trend analysis. More frequently than not, casual users access aggregated or summarized data. However, detailed data is sometimes required for day-to-day operational use. In this case, some companies build an Operational Data Store (ODS) within their business intelligence infrastructure for detailed, time critical, information. Rather than implementing an ODS, some companies give users direct secure access to low level “fact” data in the EDW Appliance. The bottom line is that casual users traditionally access a relatively small percent of the data in the warehouse. Therefore, the disk I/O bandwidth requirement to service their queries is relatively low when compared to business or power users. Business analysts and power users A business analyst and power user‟s view of the world and the data they look at can vary significantly from company to company. Some business and power users are organization specific and others may require a more global view of multiple divisions or the whole corporation. In general, this class of users frequently need to access low level detail data to support their analysis. They also tend to access detail data over longer periods of time and hit a higher percentage of the database than casual users. This translates into increased disk I/O bandwidth demand when their demand is compared to that of casual users. Based upon the above discussion, you can infer that queries which look for long term trends, perform time window comparisons and are more global in nature are likely to demand more I/O bandwidth. Oftentimes people will try to reduce this demand by implementing summary or aggregate tables (sometimes called materialized views). However, summarization or aggregation pre-supposes that you know the business question or how the user plans on accessing the data ahead of time. This scenario is unlikely, especially as queries tend to evolve as the business environment changes. Therefore, it is conservative to assume that business or power users will most likely want to access large amounts of low level raw data stored on the EDW Appliance instead of concentrating on aggregated data which is frequently found in OLAP cubes or on smaller front-end data mart (spokes) in a hub and spoke architecture. How time plays into user workload profiling The time factor is also an important dimension to understand. Some classes of users only care about what happened in the last few days, weeks or months; whereas other users may be interested in trend analysis relating to data maintained over years. Obviously, the larger the time window for each class of users, the more I/O needs to be performed to service their respective queries. For example, it will take longer to scan the last 60 months of data (5 years) vs. scanning the previous 12 months. In summary, the above discussion relating the typical user‟s business needs to query profiling provides you with some insight when trying to understand your own user population, query workload and mix. Once you understand your end user workload profile, you will be able to estimate the I/O throughput demands expected of your data warehouse. The current release of the HP Enterprise Data Warehouse can scale to over 500 TB and support scan rates up to 3.9TB per minute. This will support most of the data warehouses or data marts which major corporations or government agencies are implementing today. Business intelligence user expectations from a performance perspective When interacting with data warehouses, users generally want easy-to-use tools which will provide them with answers to their business questions or insight into their data with consistent rapid response time. The PDW software running on the HP EDW Appliance supports a wide range of Microsoft and third party user tools which connect to the HP EDW Appliance via ODBC, OLE DB and ADO.Net drivers. These tools may present information to users using deluxe executive dashboards, dials, or charts in tabular, report, 2-D or 3-D format. Other 11 tools may have a more technical, yet user-friendly interface, such as Nexus, which allow users to write their own SQL queries, provided they have the security to do so. It is important that these tools be selected based upon each class of users‟ technical capabilities and their business needs. Regardless of the tools selected, most of them generally generate SQL or MDX queries. These queries are then submitted and execute on a data warehouse or data mart (the data mart may be relational in flavor or a, memorybased OLAP cube). In either case, the user generally does not care if the actual query is sent to a back-end data warehouse or data mart. Most users are concerned about: Consistent response times to their queries Rapid response times when making a request High data quality System availability Support Since this document concentrates on performance-related best practices, we will not be addressing data quality, system availability and support. However, HP and Microsoft are very much aware of the fact that these are all critical issues when implementing a successful BI environment. Trying to qualify and set expectations regarding service level agreements (SLAs) for ad-hoc queries may be challenging from a technical point of view. However, business users quickly realize that dimension table lookups are typically extremely fast. Conversely, complex queries may take several or many minutes before a result set is returned to the user. Traditionally, DBAs spend a great deal of their, time, effort and energy monitoring query response times, profiling queries, and monitoring end user response time complaints. They attempt to solve these performance issues by creating indexes in an attempt to reduce the number of I/Os servicing a specific query or report. However, since the BI environment is very dynamic, trying to create, delete and maintain indexes may present a double-edged sword. Advantages of using indexes Tends to reduce the number of I/Os to service a set of known queries/reports If you are lucky, an ad-hoc query may find the data it needs in an index Disadvantages of using indexes Typical, non-technical, end users expect all queries to perform in a similar fashion, with consistent response times. Since most users do not know if a simple, medium or complex query is being generated by their ad-hoc query tool, they tend to complain or, even worse, users may not use the system when large variations in response times occur due to frustration. If users don‟t complain, the “lack of feedback” is worse because DBAs may not even be aware that a performance problem exists and users may find an unofficial workaround to answer their business questions. These workarounds may result in poor business decisions because users may be viewing data from different sources, giving them an inconsistent point of view of what they think is the same information in the warehouse. In other words, their workaround may be looking at derived information from an unknown source, which may skew the analysis even further. It is desirable to set user response time expectations correctly (meet end user Service Level Agreements (SLAs) and maintain relatively consistent I/O throughput levels. The nature of an index is to force random physical disk I/O (extra seek time is required to locate the indexed rows on the disk). This extra seek time, potentially interrupting a table scan, slows down disk I/O throughput, resulting in slower query response time and throughput. Contrary to index-heavy designs, the EDW Appliance running SQL Server Parallel Data Warehouse software was designed to support massively parallel, highly optimized scan rates. This was accomplished by minimizing seek time to provide users with consistent, rapid performance and throughput by using the index-lite design techniques that are discussed later in this document. Index-heavy designs require more DBAs that spend time analyzing frequently executed queries/reports and constantly creating new indexes in an effort to reduce the number of disk I/Os per query. They also tend to be wary of deleting indexes because trying to understand an existing index‟s effectiveness is time-consuming and may 12 be difficult. For these reasons, index-heavy designs tend to get heavier over time and performance can actually degrade due to extra disk seek time as ad-hoc query volume increases. As indexes get added or deleted, SQL statistics should be updated. Depending upon the size of the data warehouse and sample interval specified when updating statistics, this updating event can be a time consuming procedure that can extend your ETL batch window. The more indexes that need to be maintained during load/insert/update/delete operations, the longer it will take you to update your database. Index-heavy designs require longer batch ETL windows if indexes need to be dropped and rebuilt. Index-heavy designs require extra disk space to be used (indexes may require 4 – 10 times more space due to redundant data being stored). If unused indexes are not known or deleted, they are simply taking up extra space on your warehouse, causing more disk fragmentation and slowing down performance. In summary, PDW best practices strongly recommend that all indexes be eliminated and only considered when absolutely necessary. This strategy not only improves EDW/PDW performance, but, also shortens the ETL windows and frees up a DBA‟s time for more productive tasks. End users are also happy because their response times are more predictable. Data compression variants The EDW Appliance‟s PDW software uses SQL Server 2008 page compression. This feature automatically saves disk space and allows more data to be retrieved per physical disk I/O. An advantage of SQL Server 2008 page compression algorithms is that all the columns for multiple rows are compressed into a single page. If a query selects columns or has a predicate which needs to access several columns, the PDW SQL Server 2008 R2 software does not have to perform multiple reads to reconstruct rows before the columns may be used because the entire row is compressed in each data block. Therefore, PDW page compression has the potential to save disk I/Os and consume fewer CPU cycles for many query types. The end result is better performance for your executives, business analysts and end users. Some vendors have tried to improve query performance by storing data and compressing data by columns instead of using row compression. While this technique may appear to be useful for some types of queries, it is not a golden key to yield optimal performance for all types of analytic queries. For example, it is common for column store and column compression algorithms to incur high reconstruction costs when a query needs multiple columns returned to the user. These row reconstruction costs, when using column compression algorithms, are likely to result in consuming more CPU resources as well as potentially requiring multiple disk I/Os -- in the event that the compressed columns are stored in separate blocks on physical disk. Therefore, many column store/compression techniques may not give users the response time gains they expect due to the longer than expected query path lengths (extra CPU and disk I/O) when row reconstruction is required. Microsoft Parallel Data Warehouse R2 software‟s row level compression algorithms does not consume extra resources for row reconstruction when a query needs to return multiple columns. The PDW software allows “real time” trickle insert/update/delete operations. When a row is inserted into the EDW/PDW database only the page containing the affected row is compressed instead of having to compress multiple pages or blocks. Some column store algorithms do not allow row level, trickle feed, operational data store style insert/update/deletes. In the final analysis, each customer‟s actual performance will be query workload dependent, but bear in mind that end user query workloads will change over time and the HP EDW Appliance will tend to provide users with consistent response times as your business grows and changes. In addition, SQL Server page level compression works efficiently as more customers move toward real time business intelligence environments. Traditional database designs vs. EDW and Fast Track Traditional index-heavy trade offs This section will recap and expand the discussion above by discussing why the HP EDW Appliance, optimized for SQL Server Parallel Data Warehouse software and Fast Track index-lite database designs, tend to provide users with rapid and more consistent response times than traditional index-heavy database designs. This section will also briefly address how Fast Track reference architectures and the HP EDW Appliance are related. 13 Traditional index-heavy database designs Traditional data warehouse/data mart database designs use fact tables, dimension tables, summary tables and indexes in an effort to minimize I/Os when servicing queries. When loading data, database tables are typically striped and balanced across one or more LUNs and file groups. In turn, these filegroups are spread across multiple physical disks. Unfortunately, little attention is paid to the physical location of where the data is loaded and where indexes are maintained. Traditional index-heavy database load procedures Typically, most customers try and load multiple fact tables and dimension tables simultaneously. Simultaneous load execution, dynamic index updates or index rebuilds tend to result in data being physically fragmented on the disk as shown in Figure 6 Figure 6. Hard drive fragmentation Traditional index-heavy miscellaneous issues Index-heavy database designs have some inherent issues discussed in the “Business intelligence user expectations from a performance perspective – Advantages/disadvantages of using indexes” section of this document (above). In brief, index-heavy designs result in extra disk head movement which is typically due to inefficient data and index placement on the physical disk media. More importantly, index usage forces disk head movement that will slow down table scan operations that typically occur for ad-hoc queries. 14 Figure 7. Index- heavy design encourages disk head movement which slows down throughput Excessive disk head movement (seek times) can result in at least 2–4 times longer disk access times than expected. Therefore, significantly more disk drives will need to be purchased to support a database with tables and indexes that were loaded in a sub-optimal manner. Traditional database design conclusions Traditional database designs, load procedures and maintenance issues are likely to provide slower query response times (scan rates) and index-heavy designs will be more difficult for DBAs to manage. In addition, there is likely to be a variance between canned query throughputs, which were tuned by managing indexes versus ad-hoc queries, which frequently scan data. Full table or partial table scans are slowed down due to extra, index induced, seek time. EDW and Fast Track trade offs EDW and Fast Track index-lite database designs This section of the technical white paper will discuss how EDW and Fast Track design philosophies are similar. In addition, this section will provide a more in-depth discussion regarding the advantages of EDW and Fast Track indexlite designs versus index-heavy designs. EDW and Fast Track index-lite design EDW, Fast Track and traditional data warehouse/data mart database designs typically use multi-dimensional, starschema or snow-flake schema designs. These are all variations on a similar database design philosophy that include the use fact tables and dimension tables. Sometimes Enterprise Data Warehouse and ODS implementations normalize data structures to a higher degree. The EDW appliance supports both multi-dimensional and more normalized schemas. An EDW and Fast Track best practice is to initially load data without any indexes. The performance of the data warehouse executing large block sequential scans usually meets the user‟s SLAs. Care should be taken if indexes are added. Performance tests should be performed to verify that the addition of an index actually helped EDW performance. These tests should be done under load to more realistically represent multiple users querying the data warehouse at the same time. Simply executing one query at a time will not accurately reflect a real world customer environment which supports simultaneous query execution. This is because the extra seek time imposed by adding indexes may not be observed during single query execution, but tends to become obvious as a mixed workload is added. EDW and Fast Track load procedures As mentioned earlier, most executives, business analysts and BI users today do not necessarily know what questions (queries) they will ask ahead of time. Users also tend to get frustrated if they find that query response times are wildly inconsistent. EDW and Fast Track are both optimized to relatively consistent performance by optimizing the system to maintain extremely high scan rates. EDW and Fast Track are efficient “scan rate engines” which provide excellent price/performance. This is accomplished by using some of the following best practices techniques. 15 Initial load best practices for Fast Track Create filegroups and pre-define filegroup storage space Allocate extents sequentially in the filegroups to store data sequentially on disk by allocating extents sequentially during the initial load In order to maintain optimal performance, it is desirable to maintain about 15-20 percent free disk space per storage node. Example of EDW/PDW simplified database management Unlike Fast Track or any other SMP SQL Server implementation, managing filegroup, extent, database and table creation operations across 40 independent instances of SQL Server is simplified on the EDW appliance. The PDW software on the HP EDW Appliance automatically performs these operations on your behalf across all 1040 EDW compute nodes. Therefore, your operations staff or DBA does not have to manage multiple copies of SQL Server instances or servers separately. Hence, managing an EDW Appliance is easier than managing multiple SMP-based SQL Server environments (such as Fast Track). Since the goal of both EDW and Fast Track is to physically store data sequentially, data may be streamed at high rates of speed off of the disks because there should be little or no disk head movement to seek and find data. Figure 8. Example of how data may be stored sequentially for optimal performance Subsequent loads Subsequent loads will tend to have data physically clustered on the disk in their respective filegroups. This clustering of data allows for efficient scan operations. How data can become fragmented over time Loading data into multiple fact tables may cause a slight increase in disk latency and/or seek time Changes to dimension tables after the initial load may also cause some fragmentation. However, in most data warehouses and data marts, dimension tables are relatively static. Later in this document we will discuss how EDW/PDW best practices optimize dimension table performance. 16 Figure 9. Example of how data may be fragmented over time Note Similar colors represent the same table. Notice how all similar colors are clustered in large blocks, even though they are not sequential. For example, this type of large block fragmentation may be clustered date ranges; therefore, query performance should not be significantly affected because most queries have a date range as part of the predicate. This clustering of fact table data across the disk is typically not a problem because most queries access the most recent data which tends to be clustered in groups (e.g., users may query what happened last week). If a query accesses multiple date ranges, each date range will likely be clustered together on disk. Therefore, disk head movement will be minimal. Even though seek time will not be completely eliminated; scan rates will still be excellent. How to defragment data on EDW to encourage sequential scan rates To re-org, perform a CTAS (CREATE TABLE as SELECT) operation OR Backup and restore In general, CTAS is fast and the preferred method to re-org a table. However, it should be noted that a CTAS operation may consume as much as three times the storage space as the original table during execution, so make sure that you have enough disk storage available to execute your CTAS commands. EDW & Fast Track index-lite design conclusions Try to eliminate all indexes from EDW and Fast Track database designs. The concept is to stream data as fast as possible because fast table scans will service most queries faster than performing fewer, slower random disk I/Os while the system is under load. 17 EDW and Ultra-Shared Nothing technology EDW is more than just an index-lite design. Other parallel database implementations on the market share storage. EDW has been intentionally designed as a true, InfiniBand-based, loosely-coupled architecture that does not share storage. Ethernet is only used for connecting the EDW Appliance to the customer‟s network and for internal command and control functionality. From a performance point of view, it is important to note that shared nothing MPP implementations are more efficient than shared storage architectures. A major reason for the added efficiency is because all of the disks in a shared storage environment get bombarded with requests from all the database servers requesting data on behalf of their respective database instance. The more database servers and instances in the database layer, the more pressure is exerted on the shared storage subsystem. In addition, all of these simultaneous requests result in the disks performing more random I/O because of this contention. Hence, shared disk architectures tend to not stream data as fast as systems which do not share storage when multiple queries are executing simultaneously. Conversely, loosely-coupled, shared nothing (MPP) architectures like EDW have storage subsystems that are separated and isolated. These isolated units are called compute nodes. Each compute nodes contains a database node and a storage node. See Figure 10. This loosely coupled, compute node concept allows EDW to stream data off disk faster due to less disk contention and more efficient disk head movement than shared storage systems can provide because the storage nodes do not accept requests from competing database instances/nodes. Figure 10. EDW Appliance is a shared nothing, MPP architecture where multiple instances do not compete for storage nodes 18 Note Each EDW compute node does more than simple predicate evaluation. Each node has a fully functional copy of SQL Server 2008, which allows the compute node to execute most SQL operations (joins, sorts, etc.) as close to the physical data as possible. In addition, each compute node is aligned with its own storage. This non-shared storage design avoids the contention that shared storage subsystems encounter. Database servers send data to each other, via InfiniBand, only when needed. EDW‟s loosely coupled architecture encourages the use of “ultra-shared nothing” technology. The use of this technology can dramatically reduce inter-nodal traffic, which allows the EDW Appliance to scale effectively. In order to exploit the power of ultra-shared nothing technology, you need to know when to use distributed or replicated tables. The concept of distributed or replicated tables tends to be easiest to explain using a multidimensional database designs. The concepts work equally as well when implementing a more normalized schema sometimes found in operational data stores or in some Enterprise Data Warehouse environments. Ultra--shared nothing – replicated tables In order to minimize inter-nodal traffic for join operations or dimension table look-ups, a best practice for EDW is to define small tables (typically dimension tables), to be replicated on each compute node. In other words, each compute node has its own copy of the replicated table on disk. It is also likely that frequently accessed replicated tables are cached for rapid response time. This replication process is transaction-protected and automated, so the operations staff or DBAs do not have to manually replicate tables across the 10-40 servers housed in your EDW Appliance. The PDW software automatically assumes that tables will be replicated (default) unless the table was specifically defined as a distributed table. 19 Figure 11. The replication process across multiple compute nodes Ultra--shared nothing – distributed tables As shown in Figure 11, replicated tables allow for very efficient join operations, especially star joins, because the dimension tables will always be local on each compute node, thus minimizing or eliminating dimension table internodal traffic. Ultra-shared nothing replicated tables are not available in a shared disk architecture. Hence, EDW ultrashared nothing design techniques will place less of a demand on the InfiniBand network and many join operations will be more efficient. On the other hand, fact tables can contain billions of rows, so they are not practical to replicate. Therefore, fact tables are typically “distributed.” Distribution is accomplished by defining a “distribute” key column. This key column subsequently hashed so the large (fact table) data will be distributed across all of the compute nodes in the EDW. The goal of the distribution (hash) column is to evenly distribute data across all the compute nodes in the data racks. See Figure 12. Note One data rack has 10 active compute nodes and four data racks contain 40 active compute nodes. 20 Figure 12. Distributed table Distributing data evenly allows table scans to execute efficiently, in parallel, in order to provide users with rapid response times. Other PDW software benefits Partitioning distributed data In addition to distributing data, PDW software also allows for “partitioning” and “clustering.” These features allow DBAs and your operations staff to manage large quantities of data more effectively. A good example would be to partition data by date range. As the data warehouse grows, archiving old historical partitions which may be dropped after they are archived. Loading data into the EDW Data loaded into the EDW is stored on the Landing Zone node before it gets loaded into the PDW database. Once data is on the landing zone, it may be loaded using DWLoader which loads data in parallel for maximum speed. It is also possible to have application software, such as SQL Server Integration Services (SSIS) perform direct insert/update or delete operations to the PDW data warehouse/mart. These insert/update or delete operations may also be transaction protected. External ETL tier Some customers may already have an existing BI infrastructure which already has ETL software on a dedicated server. Therefore, input data may have been already remapped, cleansed and formatted by the ETL software running on the external ETL server. In this case, it is likely that the landing zone functions as a true “landing zone” which temporarily stores data to be loaded into the EDW Appliance compute nodes. 21 EDW/PDW internal ETL tier (landing zone) Customers also have the option to have their OLTP systems or other external feeds store raw data, directly, on the landing zone. Then, SQL Server Integration Services (ETL software) may remap, cleanse and reformat data to be loaded into the EDW compute nodes. Figure 13. Landing zone is used to load data into various schemas The landing zone is physically a ProLiant DL370 G6 (X5690) with 36 GB of memory and 10 * 2 TB disks (RAID 5), which may be used for storage. If your ETL workload requires more resources than what is provided on the EDW Landing Zone, an external ETL tier would be desirable to run ETL software. Note The landing zone is the only server on which the customer is allowed to install application code. The server has a CD/DVD and USB port for software installation. SQL Server Integration Services (SSIS) optimization This document will not cover SSIS optimization. We are assuming that the OLTP data was already cleansed by ETL software (such as SSIS) on an external server or using ETL software executing on the landing zone. Microsoft has useful documentation regarding SSIS tuning tips, available at: http://technet.microsoft.com/en-us/library/cc966529.aspx How loading data can affect EDW/PDW performance Earlier in this document we discussed how PDW table replication and distribution features are leveraged to optimize query response time by performing more efficient join operations and executing faster (parallel) scan rates. In addition, PDW software also provides the option of using “staging tables” to automatically optimize how data is physically stored on disk. When loading data you simply have to specify the staging database in the load command and “dwloader” will automatically optimize load operations by insuring that data in the target PDW data warehouse is stored sequentially on disk. 22 Sequential storage of your data warehouse data on disk and efficiently organized SQL Server block structures allow replicated and distributed tables to perform more efficiently. Scan operations are faster and more data is retrieved per data block due to these load optimization features. In addition, PDW data is automatically compressed to further enhance the system‟s effective throughput. Compression allows more rows to be read in each block retrieved from the physical disk and no row reconstruction is required if multiple columns need to be accessed when executing the users query. The default for PDW is to compress rows in the database and PDW compression algorithms allow: Data to be compressed during bulk load or during trickle insert/update/delete operations. This means that EDW performance can be maintained in real time operational data stores or data warehouses. Queries can also execute while loads, inserts, updates or deletes are taking place. All the columns in your selected rows will be decompressed as each block is read into the EDW memory. This is an advantage over column compression algorithms because multiple compression groups do not need to be read to reassemble rows in an effort to retrieve all the columns required to service queries. PDW rows level compression to use CPU resources efficiently, freeing up CPU time for other useful work. Reorganizing EDW/PDW tables CTAS (CREATE TABLE as SELECT) CTAS is different than SQL Server‟s INSERT with a SELECT. This is because a SQL Server INSERT with a SELECT: Runs “serially” – each distribution is run separately to provide transaction safety across the entire appliance Runs within a SQL Server transaction In the event of failure, a full transaction rollback (up to the last commit point) will occur On the other hand, CTAS using PDW software is significantly faster because it runs in parallel and does minimal logging. CTAS will, most likely, be used by your operations staff or DBAs if data becomes fragmented. If this occurs, it may be remedied by performing a CTAS (Create Table As Select) operation. This task will restructure the target table to allow for more efficient sequential scan performance. In addition to defragmentation, CTAS operations are useful in order to perform the following tasks: Create another copy of a table Create distributed from replicated or vice versa Create new table distributed on a different column Create new table with different clustered index, or with no index Create new table with different partitioning Create a temporary version of a table CTAS will always create a new table when executed. If no WITH option is used, CTAS will copy the existing table with no indexing or partitioning. (See the .chm, help file for CTAS syntax.) 23 CTAS example: CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name [ ( { column_name } [ ,...n ] ) ] WITH ( DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE } [ , <CTAS_table_option> [ ,...n ] ] ) AS SELECT select_criteria [;] The RENAME statement may be used after a CTAS operation to rename a table back to its original name. RENAME <class_type> entity_name TO new_entity_name[;] <class_type> ::= { DATABASE | OBJECT} Examples: RENAME RENAME DATABASE SalesNew TO Sales; OBJECT CustomerNew TO Customer; Defragmenting indexes Despite that fact that PDW best practices discourage the use of indexes to improve I/O throughput, it is likely that some indexes may be required in some situations. Performance testing is encouraged to verify whether the addition of an index helps or actually slows does the EDW/PDW throughput. If your database does have indexes, in the event that these indexes become fragmented, you should use the ALTER INDEX command to defragment these indexes. Creating database & tables Creating a database At this point, the reader will recognize the differences between replicated and distributed tables. We will now take a look at the next level of technical detail. This involves actually creating the database and tables. Behind the scenes, the PDW software manages anywhere from 10 to 40 active SQL Server instances when executing the Create Database command. Each compute node has its own copy of SQL Server 2008, but commands are executed, as if you are managing a single database. Transparent to your Database Administrator (DBA) and operations staff, files and filegroups are organized across each SQL Server instance and are managed by PDW software. This virtually eliminates the need for manual filegroup and file placement. Hence, the EDW Appliance is automatically tuned to maximize parallel I/O throughput while providing users with excellent query performance and fast load times. 24 Let‟s take a look at the syntax by using an example: CREATE DATABASE db_name WITH ( AUTOGROW = ON | OFF (default is OFF) ,REPLICATED_SIZE = 1024 (in GB) In this example, the total space required on all 10 compute nodes is 10 terabytes: 1024 GB per replicated table * 10 (compute nodes) = 10 terabytes In the background, each compute node will have files created: Filesize = 1024GB/8 Files = 128GB per file Each 128GB file is stored in each Filegroup spanned across all 8 LUNs (16 RAID10 physical disks). ,DISTRIBUTED_SIZE = 16384 (in GB) Total disk space required: 16384 GB or 16 terabytes Space is divided evenly across all compute nodes in the EDW (10 - 40 nodes) using 1 to 4 data racks. Each compute node has 8 distributions (1 distribution on each of the 8 LUNs in the storage node). Each data rack has 10 compute nodes, so each data rack will have 80 distributions. One LFF (Large Form Factor) Storage node is associated with each compute node. Note that Small Form Factors are configured differently. In the example shown in Figure 14, we will assume 1 data rack with 80 distributions. Therefore, each distribution in our single data rack example equals 204.8GB: Distribution Size = 16,384GB/80 Distributions = 204.8GB/distribution Figure 14. One P2000 G3 MSA (LFF disk example) associated with a single compute/storage node pair ,LOG_SIZE = 100 (in GB) The log file should be at least 2X the size of the largest load file or CTAS operation for a Clustered Index Table, especially if varchar is used In this example, each compute node: Filesize = 100GB per node (10-nodes = 1TB Total) ) 25 The CREATE DATABASE command is a good example of how the PDW software gives your DBA and operations staff the appearance of a single system. When the command is executed, behind the scenes, the CREATE DATABASE command is simplifying management by accessing all 10 to 40 instances of SQL Server, while, giving the appearance of a single database to the person executing the command. Creating a table Creating a table in a PDW database is relatively straightforward from a syntax point of view. However, in the next section we will be discussing best practices for determining good columns in the “distribute_on”, “cluster_on” and “partition_on” columns. Figure 15. How to define the column to “distribute_on” Earlier in the document, we talked about distributing large tables. These are traditionally fact tables. In order to improve performance, it sometimes makes sense to cluster data within each distribution. Finally, partitioning is frequently used to make database management easier, especially as partitions need to be dropped from the tables over time. For example, it is common to partition by date so the oldest months may be more easily dropped and potentially archived. Important If the DISTRIBUTE_ON option is not specified, the default is to replicate the table. Therefore, you must use the “DISTRIBUTE_ON” option for large, multi-billion row tables. Partitioning a replicated table Frequently, dimension tables are relatively small to medium sized. Replication of these tables is simple and straight forward. However, you may want to consider partitioning a large, replicated, dimension table. It is recommended that performance testing is done with and without partitioning to determine if the partitioning helps with performance for your specific query workload. For example, sometimes partitioning a table with a non-clustered index may cause extra random I/O which may inhibit query response time and throughput. 26 Partitioning example: CREATE TABLE TestTable ( partition integer NOT NULL, FirstName varchar(25), PostalCode integer) WITH ( PARTITION ( partition RANGE LEFT FOR VALUES (10, 20, 30, 40, 50, 60, 70, 80, 90)) ); The data is partitioned based upon the range of partition values in the example above. Figure 16. Example of partitions based upon range of partition values Partition 1 2 3 4 5 6 7 8 9 10 Value col <= 10 10 < col <= 20 20 < col <= 30 30 < col <= 40 40 < col <= 50 50 < col <= 60 60 < col <= 70 70 < col <= 80 80 < col <= 90 col > 90 Currently, PDW software allows: Up to 2 billion tables per database Up to 1,024 columns per table Number of rows limited only by available storage Maximum bytes per row is 8,060 The maximum bytes per row is allowed for tables with varchar columns that cause the total defined table width to exceed 8,060 bytes. The actual width of all columns must still fall within the 8,060 byte limit. Best practices for distributed tables Supported schemas The HP EDW Appliance supports multi-dimensional, star, snowflake schemas which are common in data warehouses and data marts. In addition, the appliance also supports normalized database designs which may also be found in ODSs or in some large Enterprise Data Warehouse designs. The use of distributed tables is one of the features which allow the HP EDW Appliance to provide customers with the performance of truly scalable, massively parallel, shared nothing architecture. Distribution based upon hash keys In most data warehouse schemas, it is common to find large and small tables. Generally, large tables are evenly distributed across all the available compute nodes by using a hash key which is generated from a single column in its respective table. However, it should be noted that it is sometimes desirable to replicate a large table in order to minimize InfiniBand network traffic if large tables are joined. Very large table replication is rarely desirable and replicating large files should only be considered on a case-by-case basis dependent upon the nature of the end user query workload. When creating distributed tables, in the background, the PDW software creates eight physical distributions per DBMS instance. (80 distributions in per 10 node data rack or 320 distributions in a 4 data rack appliance.) Behind the scenes, each distribution results in a distinct physical table within each compute node‟s instance of SQL Server. This fact implies that the data in the column selected to be used to generate the hash key for any given table needs to have many more possible values than distributions on the appliance. 27 How to determine a good hash key column Before considering a hash key column, it is highly recommended that you understand your data and the nature of your query workload. Test the cardinality (high cardinality means many unique values) and distribution key by using a query with „count(*)‟ and a GROUP BY on the targeted distribution column to detect likely skew. Use SELECT COUNT/GROUP BY to determine cardinalities of candidate columns Example: Select top 1000 <key>, count(*) as cnt from <table> group by <key> order by cnt desc The “hash” key column should have a large number of distinct values (high cardinality) with significant variation in values and in volume. Note that more important than simply having a large number of possible values, it is also important that the possible values in the hash key column contain a relatively even distribution (minimal skew). Minimal skew can be controlled by ensuring that no single key value represents a large percentage of rows on its own (including NULLs). Watch out for NULL because all nulls will hash to a single distribution and skew rows across your distributions poorly. It is important that the distinct values of distribution key be at least 10 times the number of table distributions on the appliance. The larger the number of distribution key values, the less concern there will be with a few dominating values that can skew the number of rows in each distribution. Example: If we have a single data rack appliance, then we would expect 80 distributions for any given distributed table. Therefore, it is desirable to have at least 800 possible values in the column that you would like to use to hash the distributed table. Conversely, if you have four data racks, then you want your hash key column to have at least 3,200 possible unique values because four data racks will have 320 distributions. Example: If you decide to distribute your table on “StoreID” column and there were only 100 stores in the company, then “StoreID” is likely to give you a poor distribution on a single data rack with 80 distributions. If you have four data racks, in this example, “StoreID” would be inappropriate because there are too few possible values. Time-based elements tend to be poor choices for hash key distributions. Once you define the hash key column in the “distribute_on” clause of the “Create Table” statement, PDW will automatically create tables on each compute node to reflect this design choice. In the rare case that you do not have a good hash key column allowing for a good distribution of data, you may have to create an artificial hash key column or possibly concatenate columns into a single column in your ETL software. In this way, the new column may be used to generate hash keys which are more evenly distributed across all of the table‟s distributions. Important Sometimes during the design phase of a project a poor distribution key was selected or, as time passes, the cardinality and distribution of your data changes so your distributions are no longer even. In these cases, your may decide to redistribute the table on a different column. The easiest way to do this is to execute a CTAS statement based upon the new hash key column distribution. CTAS is generally faster than doing a reload of the table. Tip: If you distribute a large dimension table, try to distribute commonly joined fact tables on the same key. This distribution will minimize network traffic within the EDW Appliance. Data skew Data skew is defined as having a disproportionate number of rows in one or more distributions when compared to other distributions across all nodes in the appliance. A user table is said to have data skew if one or many distributions has significantly more rows than others. 28 Figure 17. Example caption wording While desirable, it is unlikely that a hash key column can be found that provides a perfectly even distribution of data across each distribution across all of the compute nodes. Therefore, as a general rule of thumb, it should be noted that data skew between 10% and 30% may have an impact on performance. Data skew greater than 30% for a given distribution will have a significant impact on performance. Hence, data skew should be less than 1% - 7% for optimal and consistent performance. What causes skew? Data skew tends to be caused by a poor, non-uniform, distribution frequency of data within the hash key column. An example may be when a retailer sells a few items disproportionately higher than other items. In this case, if an “item_number” column was defined as a hash key for this retailer, it is possible that some distributions would have significant data skew. Therefore, queries that require a table scan would have to wait for the distributions that have more rows to complete before query execution will complete. If data structures allow for non-unique distribution keys, this can also become an issue if the key is heavily weighted to certain values. Another reason for data skew can be a result of poor data quality. Disproportionate NULL values in distribution column can result in a skewed distribution of data among distributions. Bad data can also create repeating values in a distribution key. Detecting skew Before you go into production, it is desirable to make sure that data is evenly distributed across your distributions. This may be accomplished by: Loading a representative sample of data or the complete initial load Perform a “Select 1 from <table>” Look at the query plan and see how many rows came from each distribution Or issue: “Select top 1000 <key>, count(*) as cnt from <table> group by <key> order by cnt desc” If the data in your distributions are skewed, try another key column or you can try creating a random or surrogate key, if necessary. 29 Partitioning distributed tables By definition, distributed tables are already segmented by hashed distributions to allow for massively parallel performance. Tables are distributed with the intent of optimizing appliance-wide CPU and I/O subsystem consumption. Partitioning distributed tables is typically done for manageability. In essence, partitioning will further partition rows within a distribution. Based on a partition function: Partitions are within each respective compute node Date/time or integer surrogate based partitioning is common Typically the partition column is the same as clustered key index For those readers who are interested, partitions (applied within each schema) are all placed in the same filegroup. Examples of why partitioning is useful include operational efficiency when: adding, loading, archiving, dropping, and switching/sliding partitions. Partitioning may also be desirable to SWITCH partitions OUT for fast delete of history or IN to modify or add a specific historical slice. In general, partitioning allows DBAs to manage data for very large tables in smaller chunks by using the ALTER TABLE syntax. ALTER TABLE [ database_name.[ dbo ].| dbo.] source_table_name [ { SPLIT | MERGE } RANGE boundary_value ] | [ SWITCH [ PARTITION source_partition_number ] TO target_table_name [ PARTITION target_partition_number ] [;] Some customers partition distributed tables in order to optimize the loading of large tables. For example, partitioning allows for fast loading of an unused partition and switching it in after the loading process. Even though PDW best practices encourage the use of an index-lite database design, if a table has an index, partitioning allows for re-indexing of smaller partitions, which results in a faster operation. Finally, in the event that data needs to be re-organized partitioning allows for piecemeal data reorganization at the partition level. Note You cannot move partitions between distributed and replicated tables. If you feel that a replicated table needs to be partitioned, consider making it a distributed table. For more information on SQL Server partitioning, see http://technet.microsoft.com/en-us/library/cc966457.aspx 30 Example of distributed table with partitions: CREATE TABLE TestTable ( id integer NOT NULL, LastName varchar(25), ShipDate datetime ) WITH (DISTRIBUTION = HASH (id), CLUSTERED INDEX (ShipDate), PARTITION (ShipDate RANGE RIGHT FOR VALUES ( '1995-01-01', '1995-02-01', '1995-03-01', '1995-04-01', '1995-05-01', ... ))); The number of distributions equals eight times the number of compute nodes. Therefore, one data rack will have 80 distributions created because there are 10 active compute nodes per data rack. Four data racks will have 320 distributions. Partitioning into 10 ranges per compute node will effectively segment a table into 800 horizontal (row based) partitions on a single data rack EDW Appliance. Partitions for a given distribution are all placed in the same filegroup on its respective compute node. Generally, it is a best practice to cluster and partition on the same column; this best practice holds true most of the time, but it is query-dependent. Consequently, it is always a good idea to do performance testing to verify your database design. Clustered index vs. a heap table Up until this point, we have discussed the two of the most critical issues which will affect the EDW Appliance performance -- distributed vs. replicated tables and selecting a good hash key column for distributed tables. Another common issue which should be evaluated when developing a PDW data warehouse involves the decision of whether to use clustered indexes or a heap table structure. Generally, a clustered index is recommended for most tables because query performance tends to be better if your primary clustered index is frequently used in many queries. If your clustered index is not frequently used, then a heap table may be a better alternative. In line with the index-lite philosophy, secondary (non-clustered) indexes should be minimized or, preferably, eliminated on the EDW Appliance in order to reduce random I/O. In addition to throughput gains, elimination (or minimization) of indexes also makes the job of the DBA easier, makes system management simpler and ensures that your ETL batch window will be shorter. A clustered index is different than a secondary index. A clustered index is built into the table structure itself. This means that a clustered index cannot be added or dropped later. However, if desired, you can use the CTAS operation to change a clustered index. A clustered index is only valid when you use CREATE TABLE. On the other hand, secondary indexes are physically stored and managed as separate files and may be added or dropped at any time. Even though query response times are generally improved when a clustered index is used, the reader should be aware of the fact that clustered indexes will affect load windows by: Introducing a “sort” step during table loads (load slower than heap loads) Requiring periodic maintenance to defragment (use CTAS operation) On the other hand, heap tables may want to be considered if batch load windows are weighted as a more important issue to your business than end user query performance. Heap tables also have an advantage in that fragmentation is minimized, so defragmentation is not necessary. 31 If you decide that you want a clustered index for your table, please consider the following best practices: Cluster fact tables on same column used for partitioning. This will allow optimal sequential I/O Minimize secondary (non-clustered) indexes on PDW to reduce random I/O Avoid using multi-column (compound) clustered indexes Use CREATE TABLE AS SELECT (CTAS) to defragment a clustered table Heap table advantages: Provides for the fastest data loading Can be used on replicated tables if a clustered index cannot be easily determined Now that you have a grounding in the tradeoffs between a clustered index and a heap, if you are still not sure you have found a good clustered index that will help your query performance, your best course of action would be to test some of your frequently used queries against both types of table structures. Monitoring query throughput, response time and resource utilization during the testing will allow you to make the best decision for your database and workload environment. Non-clustered indexes When designing a new data warehouse using the HP EDW Appliance and PDW software, it is desirable to start with no indexes. Generally, best performance under load may be tables with no secondary indexes. As the query load increases, seek time delays and disk queuing become more noticeable. Therefore, negative effect of non-clustered indexes may not be noticed under a light workload. Non-clustered index considerations: Maximum index size for a non-clustered key is 900 bytes (Varchar columns may bypass this limit, but enforcement is based on actual data) Requires ALTER permission on the table To view information on existing indexes, you can query the sys.indexes catalog view Indexes cannot be created on views Maximum number of non-clustered indexes is 999 per table If you decide to rebuild an index via the ALTER INDEX command, consider the fact that indexes are rebuilt in parallel on each compute node. This allows for rebuilding an index to complete faster, yet it also consumes more resources during the rebuild process. The ALL option will alter all indexes on table, both clustered and non-clustered ALTER INDEX {index_name | ALL } ON [ database_name.[ dbo ]. | dbo.] table_name REBUILD [;] Remember that a PDW best practice is to use non-clustered indexes sparingly. Best practices for loading data PDW loading tools Now that you created your database and created distributed, partitioned and replicated tables, it is time to load data into your EDW/PDW database. Determining the best data loading logistics for your environment is an important step in maintaining high levels of performance. The following are some important issues to consider when implementing a loading methodology: Reduce disk fragmentation Loading data into your warehouse using the best practices outlined below will help reduce table fragmentation as much as possible and allow for maximum I/O throughput because data will be able to stream off the disk drives with minimal disk head movement. 32 Provide system recovery capabilities in the event of data load failure Some thought should be given to restart and/or recovery procedures in the event that a failure occurs in the middle of a batch load. Load data while respecting concurrent queries and loads The PDW software allows you to load data into your data warehouse while maintaining query performance. However, despite the fact that PDW uses SQL Server Resource Governor internally to balance query vs. load resources, you do not want to push this feature to its limit by loading excessively large volumes of data during your peak query volume windows. End users who do so may see some degradation in performance. The PDW software supports several different methods to load data into the PDW database. Extract transformation and load software, such as SQL Server Integration Services (SSIS), may execute on the landing zone or on an external ETL tier which generates flat files which are to be loaded into the PDW database. The landing zone is physically configured with a DVD and USB, which allows customers to load and execute 3rd party software of their choosing. The PDW software allows customers to use multiple load options. Loading tools: DWLoader Utility SQL Server Integration Services (SSIS) SSIS may be used for bulk loads or for “close to” real time trickle updates by having very tight windows for minibatch loads. Standard SQL DML statements: INSERT/SELECT SQL Insert/Select is useful for real time updates into the data warehouse or operational data store database. It is generally not the preferred method for bulk loads. CREATE TABLE AS SELECT (CTAS) CTAS is not used to load data from flat files. As stated earlier in the document, CTAS is good for defragmenting tables, creating another copy of a table, creating distributed from replicated tables or vice versa, creating a new table distributed on a different column, creating a new table with different clustered index or with no index, and creating a new table with different partitioning or creating a temporary version of a table. DWLoader best practices DWLoader is a bulk load, command line oriented utility. It is the tool which is most frequently used when loading flat files which were already remapped and cleansed by Extract Transformation and Load (ETL) software. DWLoader is integrated with PDW‟s Data Movement Service (DMS) and streamlines I/O while minimizing dataload times. The DWLoader utility input file must reside on the landing zone and DWLoader is not intended to support more than 2,147,483,647 (2 billion) rows in the input file. DWLoader provides transactional protection and configurable batch size (10,000) The PDW software can load a maximum of 10 concurrent loads, running in parallel on the same appliance. At that point it queues up subsequent load operations. No error is generated. SQL Server Integration Services (SSIS) SSIS can run either on the landing zone or on a separate server tier outside the PDW appliance. However, if you choose to execute SSIS on the landing zone, make sure that it has enough horsepower to meet your service level agreements. Similar to DWLoader, SSIS can perform parallel load operations. When using SSIS to load data the following may be useful tips to provide optimal SSIS performance: For good PDW loading throughput, it is important to keep a steady stream with minimal starts and stops. PDW connections and queries are very costly to initiate. Using fewer connections will allow for more efficient and faster load rates. Data type conversion in the PDW destination adapter is very expensive. Be sure the input types match the destination types, especially for strings and decimals. 33 An SSIS “.dtsx” package may be used to load data directly into the PDW database or create a flat file for DWLoader. In the package workflow for SQL Server PDW, you can load and merge data from multiple sources and load data to multiple destinations. The loads occur in parallel, both within a package and among multiple packages running concurrently. The combined total of loads with SSIS and DWLoader is 10 active loads per appliance. Each SSIS package destination from the same data source counts as one load when the package is running. Multiple packages can run concurrently as long as each package does not use more than the maximum active loads. For example, if there are a maximum of 10 active loads, you can concurrently run two packages with each using 10 destinations. One package will run while the other one waits in the load queue. For maximum appliance throughput using SSIS loading, you will want to stream to multiple concurrent PDW destination adapters. In general, if you have a choice between SSIS and DWLoader, tests have shown that DWLoader is faster and is the preferred method when loading from flat files. Some reasons why DWLoader is faster is due to SSIS locking behavior: When loading data with SSIS, SQL Server PDW uses row-level locks to update data in the destination table. Bear in mind that the rows in the destination table are not locked while the data is loaded into the staging table. Loading data with SSIS initiates these locks: A SharedUpdate lock on the destination table and the destination database A SharedUpdate lock on the staging table (when a staging database is used) The best suggested practice in general is that DWLoader is the most optimal way to load data into your PDW database. However, if you require more ETL functionality (such as data cleansing, table lookups, mapping, etc.), then SSIS is a better choice because SQL Server Integration Services is a fully functional ETL tool. Staging database Even though the staging database is technically optional, benchmarks have shown that the use of a staging database generally improves the EDW Appliance‟s load rates for most data warehouses. There should only be one staging database per EDW Appliance. The staging database is shared by all load jobs for all destination databases on the EDW Appliance. When creating the staging database, use the following guidelines: Replicated table size should be the estimated size per compute node of all the replicated tables that will load concurrently. Distributed table size should be the estimated size per appliance of all the distributed tables that will load concurrently. Log size is typically similar to the replicated table size. When your data warehouse first goes into production, your initial PDW load jobs tend to be the largest. Therefore, you may want to consider reducing the size of your staging database to more closely represent your largest incremental loads vs. the initial load. To reduce the size, you can drop the staging database and create it again with smaller allocations, or use the ALTER DATABASE statement. Miscellaneous loading tips This section will point out a few more PDW best practices which were alluded to, but that were not necessarily spelled out in the earlier sections of this document. Make load partition friendly by grouping data by partition prior to loading. If necessary, drop non-clustered indexes before loading and re-index after all loads are complete. This action will help minimize fragmentation. During development, performance tests should be performed to find out which method performs best in your environment. There is no benefit to sorting data before hitting the landing zone. 34 The main reason why there no performance gain by pre-sorting, has to do with the staging table and the use of a clustered index on the destination table. Since most customers use a staging table as part of the loading process, data is automatically sorted during the load procedure. Hence, pre-sorting data is not saving any time during the load. Pre-sorting data can also hurt load performance because pre-sorted input rows may result in unbalanced compute node activity during the load window. Use CTAS to expand or grow data (e.g., for testing) instead of using a standard SQL DML INSERT/SELECT statement. Minimize page breaks (fragmentation) by designing partition-friendly loads. Loading and updating “wall clock” times for replicated tables tends to be much longer than for distributed tables. This is because replicated tables need to make multiple copies of the replicated tables on each compute node in the appliance. After data is loaded into the PDW database During the PDW database development cycle, it is desirable to define and use a few “master queries” which represent your end user query workload. These queries should be run once every few months to verify that end user response time and I/O throughput remained relatively consistent over time. Pro-active periodic query performance monitoring is more desirable than having end users bring up query performance as an issue. PDW software will automatically update statistics after table changes by 20%. However, based upon your performance monitoring tests, it may be desirable to increase your update statistics frequency. However, don‟t update statistics too often because there may be a trade-off between improving query plans and recompilation time due to updated statistics. Miscellaneous facts about update statistics: All statistics use standard SQL Server sampling. PDW creates only single-column stats. However, compound, multi-column (covering) statistics may be implemented, but they must be manually created. If your joins are on multiple key columns, multicolumn stats help avoid mistaken Nested Loop plans on the SQL Server nodes. The following are some clues which may warrant the exploration of using multi-column statistics: A long-running query step involving multicolumn join key and experiencing an inordinate amount of CXPACKET waits SQL query plan (estimated) shows Nested Loop and a very small number of estimated result rows from the join Actual query has millions of rows running through a Nested Loop operation Multi-column statistics: SELECT... example: FROM fPageView A... JOIN fCookiePageView cpv ON cpv.PageViewID = A.PageViewID AND cpv.SessionID = A.SessionID ... Performance can be improved significantly by creating multicolumn stats in the above query which joins two fact tables and multiple columns: Create Statistics pv_page_sess on fPageView(PageViewID, SessionID) Create Statistics cpv_page_sess on fCookiePageView(PageViewID, SessionID) 35 PDW Create Statistics uses syntax similar to the standard SQL Server command but the current release of PDW software has these limitations: No filter predicates No sampling percentage options In order to complete creating or updating statistics as fast as possible, PDW executes the command in parallel on all the compute nodes. EDW performance metrics HP and Microsoft have run various performance tests internally and with customers. The following are a few reasonable performance metrics which may be used when evaluating EDW throughput. Since the EDW is a loosely coupled MPP architecture, system scan rates are expected to perform in a linear fashion. Therefore, it should be expected that a 4-rack EDW can scan at about 66GB/second. Table 2 shows what is considered to be reasonable performance metrics for each data rack. Table 2. One data rack performance metrics EDW Performance Metrics Small Form Factor disks per data rack Large Form Factor disks (15k) per data rack Load rates 1.7 TB/Hr 1.1 TB/Hr Scan rates compressed (raw disk, per rack) 16.5GB/sec 12.8 GB/sec Backup rates (to disk) 5.4TB/Hr 5.1 TB/Hr Summary This technical white paper discusses many of the major issues relating to EDW/PDW best practices. While these recommendations are practical and very useful in most customer environments, HP recognizes that each customer‟s workload and set of requirements will vary. Therefore, it is highly suggested that customers test the various best practice suggestions presented in this document in their own environment using a master query set that represents the most frequently executed queries of their users. These queries should be run single-threaded and under load in such a way to reflect the customer‟s “real world” workload volume. Once complete, these tests will provide the customer‟s technical staff with a great deal of insight regarding data warehouse performance. By changing one variable at a time, the data warehouse may be customtuned to provide optimal performance for all users. In the meantime, please feel free to contact HP or Microsoft to provide more in-depth insight relating to EDW/PDW best practices to implement a successful and efficient foundation for your business intelligence environment. 36 For more information EDW solutions page: http://h71028.www7.hp.com/enterprise/us/en/partners/microsoft-enterprise-data-warehousesolution.html?jumpid=ex_r2858_us/en/large/tsg/microsoft_edw EDW product page: http://h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/sm/solutions/enterpriseoverview.html HP Business Intelligence Solutions for Microsoft SQL Server: http://www.hp.com/solutions/microsoft/sqlbi HP Business Data Warehouse Appliance You Tube video: http://www.youtube.com/watch?v=uoXD9nJKmcw BDW product page: http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/457757-457757-42926994346415-4292808-5104076.html HP Business Decision Appliance Overview (Useful for a data mart spoke, in a hub and spoke BI architecture or a standalone data mart): http://h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/sm/solutions/businessoverview.html ActiveAnswers page: http://www.hp.com/solutions/activeanswers Microsoft Remote Data Copy (Parallel Data Export) documentation: http://download.microsoft.com/download/1/7/5/175A7833-6F75-418D-8800-86D44D1D712D/[MSDPPDW].pdf To help us improve our documents, please provide feedback at http://h71019.www7.hp.com/ActiveAnswers/us/en/solutions/technical_tools_feedback.html. © Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Microsoft is a U.S. registered trademarks of Microsoft Corporation. 4AA3-8477ENW, Created November 2011
© Copyright 2024