SQL Server PDW w. Polybase: What, Why, How Srinath Shankar

SQL Server PDW w. Polybase: What, Why, How
Srinath Shankar
Microsoft Gray Systems Lab
Madison, WI
graysystemslab.com
Today, many organizations have data sets in both a relational DBMS and Hadoop. While systems like
Hive and Impala can be used to query HDFS-resident data, they are not capable of answering queries
that combine data from the two universes of structured (relational) and unstructured (Hadoop) data.
In this talk I will describe Polybase, an extension to the SQL PDW parallel database system that is
capable of executing queries against both types of data.
Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a
separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance.
Using PDW’s CTAS (“create table as select”) syntax Polybase provides the ability to move data in parallel
between nodes of the Hadoop and PDW clusters. In addition, users can create external tables over
HDFS-resident data (all standard HDFS file formats are supported as well as custom file formats as long
as RecordReader and RecordWriter implementations are provided) allowing queries to reference data
stored in HDFS as if the data had been loaded into a relational table.
While some other parallel database systems provide similar capabilities, Polybase pushes the state-ofthe-art further into two significant ways. First, when optimizing a SQL query that references data stored
in HDFS, the Polybase query optimizer makes a cost-based decision (using statistics that it has stored on
the HDFS file in the PDW catalog) on whether or not it should transform relational operators over HDFSresident data into MapReduce jobs for execution on the Hadoop cluster. Consider for example a simple
query with two selections, a join, and an aggregate. If one of the two input tables is stored in HDFS, the
Polybase query optimizer will evaluate whether or not to perform the select operator on that table as a
Map job on the Hadoop cluster or whether it is more efficient to pull the entire file in PDW and perform
the selection using the SQL Server instances running on the PDW appliance. If both input tables are in
HDFS, not only will the Polybase QO consider pushing the selection as Map jobs, but it will also consider
the benefits of pushing the join and aggregate as well. All of this is totally transparent to the user and is
driven by a state-of-the art parallel query optimizer initially developed at the Gray Systems Lab.
About the Speaker
Srinath Shankar is a Research SDE working for Microsoft at the Jim Gray Systems Lab in Madison, WI. His
primary field of work is database management systems, with a focus on parallel and distributed data
management. He obtained his PhD at the University of Wisconsin, Madison and his bachelor’s degree at
IIT Madras before that. At Microsoft, he helped develop a cost-based query optimizer for SQL Server
Parallel DataWarehouse (PDW), a shared-nothing parallel DBMS. He currently works on Polybase, a splitquery execution framework capable of processing both relational data in PDW and unstructured data in
Hadoop.