The Thinking Person’s Guide to Data Warehouse Design

The Thinking
Person’s Guide to
Data Warehouse
Design
Robin Schumacher
Director of Product Strategy
www.enterprisedb.com
www.enterprisedb.com
Building a Logical Design
Transitioning to the Physical
Monitoring for Success
www.enterprisedb.com
Building a Logical
Design
www.enterprisedb.com
Why Care
About
Design…?
www.enterprisedb.com
What is the key component for success?
*
In other words, what you do with your PostgreSQL
Server– in terms of physical design, schema design, and
performance design – will be the biggest factor on
whether a BI system hits the mark…
* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
www.enterprisedb.com
Your Database
The #1 Cause of
Database
Downtime…?
Bad Design…
Source: Oracle Corporation
www.enterprisedb.com
First Recommendation – Use a Modeling Tool
*
www.enterprisedb.com
A logical design for OLTP Isn’t For Data Warehouse
www.enterprisedb.com
Simple reporting databases
End Users
Application Servers
Reporting
Read Shard
Database
One
OLTP Database
ETL
Replication
Just use the same design
on a different box…
www.enterprisedb.com
Data Warehouse
Horror Story
Number One…
www.enterprisedb.com
The logical design for analytics/data warehousing
www.enterprisedb.com
• Datatypes are more generally defined, not directed
toward a database engine. Still choose carefully
• Entities aren’t designed for performance necessarily
• Redundancy is avoided, but simplicity is still a goal
• Bottom line: you want to make sure your data is
correctly represented and is easily understood (new
class of user today)
www.enterprisedb.com
Manual horizontal partitioning
Modeling technique to overcome large data volumes
www.enterprisedb.com
Manual Vertical Partitioning
www.enterprisedb.com
Pro’s/con’s to manual partitioning
Pro’s
• Less I/O if design holds up
• Easy to prune obsolete data
• Possibly less object contention
Con’s
•
•
•
•
More tables to manage
More referential integrity to manage
More indexes to manage
Joins oftentimes needed to accomplish
query requests
• Oftentimes, a redesign is needed
because the rows / columns you
thought you’d be accessing together
change; it’s hard to predict ad-hoc
query traffic
www.enterprisedb.com
Use a modeling tool
Don’t use 3rd normal form
Manual partition but…
Let the DB do the heavy lifting
www.enterprisedb.com
Transitioning to a
Physical Design
www.enterprisedb.com
SQL or NoSQL…?
How to scale…?
Should I worry about High availability…?
Index or no…?
Is sharding a good idea…?
How should I partition my data…?
Row or Column database…?
www.enterprisedb.com
What technologies you should be looking at
*
* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
www.enterprisedb.com
• Whether you choose to go NoSQL, Shard, use MPP
databases, or something similar, divide & conquer is
your best friend
• You can scale-up and divide & conquer to a point, but
you will hit disk, memory, or other limitations
• Scaling up and out is the best future proof methodology
www.enterprisedb.com
Divide & Conquer via Sharding
www.enterprisedb.com
Row or Column-Based Database?
A column-oriented architecture looks the same on the surface, but stores data
differently than legacy/row-based databases…
www.enterprisedb.com
Row or Column-Based Database…?
Yes, Row-based database!
Yes, Column-based database!
• Will need most columns in a table for query
• Only need subset of columns for query
• Will be doing lots of single inserts/deletes
• Need very fast loads; little DML
• Small-medium data
• Medium-very large data
• Know exactly what to index; won’t change
• Very dynamic; query patterns change
www.enterprisedb.com
Example: Column DB vs. “Leading” row DB
InfiniDB takes up 22% less space
InfiniDB total query times were 65% less
InfiniDB loaded data 22% faster
InfiniDB average query times were 59% less
Notice not
only are
the queries
faster, but
also more
predictable
* Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data
www.enterprisedb.com
Hybrid Row / Column Databases
Some vendors now give you a choice of table format –
row or column – based on expected access patterns.
www.enterprisedb.com
What about NoSQL options?
• Standard model is not relational
• Typically don’t use SQL to access the data
• Take up more space than column databases
• Lack special optimizers / features to reduce I/O
www.enterprisedb.com
What about NoSQL options?
• Really are row-oriented architectures that store data in
‘column families, which are expected to be accessed
together (remember logical vertical partitioning?)
Individual columns cannot be accessed independently
• Will be faster with individual insert and delete operations
• Will normally be faster with single row requests
• Will lag in typical analytic / data warehouse use cases
www.enterprisedb.com
PostgreSQL Specific - Partitioning
• Main goal: reduce I/O via partitioning
• Partitioning in PostgreSQL is somewhat more
laborious than other RDBMS’s
• Consider when table size exceeds memory capacity
• Partitioning key is ‘key’ for many reasons
• Have seen > 90% response time reductions when
done right
• Partitioning also assists in more efficient data
pruning activities than typical DELETE operations
www.enterprisedb.com
PostgreSQL Specific - GridSQL
• One option for divide-and-conquer strategy
• Does have limitations with respect to PostgreSQL
feature and syntax support
• One customer of EnterpriseDB is running well with
8TB
www.enterprisedb.com
What About Indexing?
• If query patterns are known and predictable, and data is
relatively static, then indexing isn’t that difficult
• If the situation is a very ad-hoc environment, indexing
becomes more difficult. Must analyze SQL traffic and
index the best you can
• Over-indexing a table that is frequently loaded /
refreshed / updated can severely impact load and DML
performance. Test dropping and re-creating indexes vs.
doing in-place loads and DML. Realize, though, any
queries will be impacted from dropped indexes
• Remember that a benefit of (most) column databases is
that they do not need or use indexes
www.enterprisedb.com
Optimizing for Data Loads
• The two biggest killers of load performance are (1) very wide tables for
row-based tables; (2) many indexes on a table / foreign keys;
• Column-based tables typically load faster than row-based tables with load
utilities, however they will experience slower insert/delete rates than rowbased tables
• Move the data as close to the database as possible; avoid having
applications on remote machines do data manipulations and send data
across the wire a row at a time – perhaps the worst way to load data
• Oftentimes good to create staging tables then use procedural language to
do data modifications and/or create flat files for high speed loaders
• PostgreSQL COPY much faster than INSERT; EnterpriseDB’s
EDB*Loader faster than COPY
www.enterprisedb.com
Optimizing for Data Loads
• Increasing maintenance_work_mem to a large values (e.g. >
1GB) helps speed index and foreign key constraint creations
• Turning autovacuum off can help speed load operations
• Turning off synchronous commit (synchronous_commit )can
help improve load efficiency, but utilize generally only for allor-nothing use cases
• Minimizing checkpoint I/O is a good idea
(checkpoint_segments to 100-200 and checkpoint_timeout to
a higher value such as 1 hour or so).
www.enterprisedb.com
Monitoring and
Tuning the Design
www.enterprisedb.com
1. Bottleneck
Analysis
2. Workload
Analysis
3. Ratio Analysis
www.enterprisedb.com
Bottleneck Analysis
• The focus of this methodology is the answer to the question
“what am I waiting on?”
• With general PostgreSQL, unfortunately, it can be difficult to
determine latency in the database server
• Lock contention rarely an issue in data warehouses
• Can use EnterpriseDB’s wait interface
• Problems found in bottleneck analysis translate into better
lock handling in the app, partitioning improvements, better
indexing, or storage engine replacement
www.enterprisedb.com
Workload Analysis
• The focus of this methodology is the answer to three
questions: (1) Who’s logged on?; (2) What are they doing?;
(3) How is my machine handing it?
• Monitor active and inactive sessions. Keep in mind idle
connections do take up resources
• I/O and ‘hot objects’ a key area of analysis
• Key focus should be on SQL statement monitoring and
collection; something that goes beyond standard preproduction EXPLAIN analysis
www.enterprisedb.com
The Pain of Slow SQL
* Philip Russom, “Next Generation Data
Warehouse Platforms”, TDWI, 2009.
www.enterprisedb.com
Data Warehouse
Horror Story
Number Two…
www.enterprisedb.com
Workload Analysis
• SQL analysis basically becomes bottleneck analysis, because you’re
asking where your SQL statement is spending its time
• Once you have collected and identified your ‘top SQL’, the next step is to
do tracing and interrogation into each SQL statement to understand its
execution
• Historical analysis is important too; a query that ran fine with 5 million
rows may tank with 50 million or with more concurrent users
• Design changes usually involve data file striping, indexing, partitioning, or
parallel processing additions
www.enterprisedb.com
Workload Analysis
• The pgstatspack utility available to the community can provide some
statistics for workload analysis
• The pgfouine log analyzer tool can help you identify bad SQL
• EnterpriseDB packages a utility that duplicates Oracle Automatic
Workload Repository (AWR) reports, and shows hot objects, top wait
events, and much more
• A new SQL Profiler utility will be available from EnterpriseDB in the first
half of 2011 that will help in tracing and analyzing SQL statement
execution
www.enterprisedb.com
Ratio Analysis
• Least useful of all the performance analysis methods
• May be OK to get a general rule of thumb as to how various
resources are being used
• Do not be misled by ratios; for example, a high cache hit ratio
is sometimes meaningless. Databases can be brought to
their knees by excessive logical I/O
• Design changes from ratios typically include the altering of
configuration parameters and sometimes indexing
www.enterprisedb.com
Conclusions
www.enterprisedb.com
Ratio Analysis
• Design is the #1 contributor to the overall performance and availability of
a system
• With PostgreSQL, you have greater flexibility and opportunity than ever
before to build well-designed data warehouses
• With PostgreSQL, you now have more options and features available
than ever before
• The above translates into you being able to design data warehouses that
can be future proofed: they can run as fast as you’d like (hopefully) and
store as much data as you need (ditto)
www.enterprisedb.com
The Thinking
Person’s Guide to
Data Warehouse
Design
Thanks…!
www.enterprisedb.com