QMapper for Smart Grid: Migrating SQL

QMapper for Smart Grid: Migrating SQL-based Application
to Hive
Yue Wang1,2 , Yingzhong Xu1,2 , Yue Liu1,2 , Jian Chen3 and Songlin Hu1,2
1
1
Institute of Computing Technology, Chinese Academy of Sciences, China
2
University of Chinese Academy of Sciences, China
3
Zhejiang Electric Power Corporation, China
{wangyue89,xuyingzhong,liuyue01,husonglin}@ict.ac.cn, 3 [email protected]
ABSTRACT
Apache Hive has been widely used by Internet companies for
big data analytics applications. It can provide the capability of compiling high-level languages into efficient MapReduce workflows, which frees users from complicated and time
consuming programming. The popularity of Hive and its
HiveQL-compatible systems like Impala and Shark attracts
attentions from traditional enterprises as well. However, enterprise big data processing systems such as Smart Grid applications often have to migrate their RDBMS-based legacy
applications to Hive rather than directly writing new logic
in HiveQL. Considering their differences in syntax and cost
model, manual translation from SQL in RDBMS to HiveQL
is very difficult, error-prone, and often leads to poor performance.
In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL. QMapper consists of a rule-based rewriter and a cost-based optimizer.
The experiments based on the TPC-H benchmark demonstrate that, compared to manually rewritten Hive queries
provided by Hive contributors, QMapper dramatically reduces the query latency on average. Our real world Smart
Grid application also shows its efficiency.
Categories and Subject Descriptors
H.2 [Database Management]: Systems
Keywords
SQL on Hadoop; Hive; System Migration; Join Optimization
1.
INTRODUCTION
In recent years, high-level query languages such as Hive
[17, 18], Pig [16] and JAQL [3] based on MapReduce have
been widely used to deal with big data problems [4]. For example, more than 95% of MapReduce jobs running in FacePermission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
Copyright http://dx.doi.org/10.1145/2723372.2742792.
book are generated by Hive [12]. By providing the SQL-like
query language HiveQL, Hive enables users who have experience using traditional RDBMS to run familiar queries
on MapReduce. Moreover, HiveQL queries can also be run
in systems like Shark[20] and Impala1 , which utilize the inmemory storage to accelerate iterative computations and are
very popular in Internet companies as well.
Due to an increasing amount of data produced, performance bottlenecks of current RDBMS-based infrastructure
appear in traditional enterprises. It is critical for them
to leverage a new sustainable and scalable platform, which
can guarantee high performance, robust data processing and
controllable budget at the same time. For example, in the
Zhejiang province, the State Grid Corporation of China (SGCC) has deployed more than 17 million smart meters, which
will grow to 22 million within two years. Data collected by
these sensors was previously stored with complex indexes in
a commercial RDBMS to facilitate selective data reading.
Based on that, statistic data processing is implemented in
the form of SQL stored procedures. We observe that with
the growth of both data collecting frequency and the amount
of sensors, the performance of data writing becomes very
low and the global statistic analyzing on big tables tends to
be a bottleneck as well. The weak scalability of traditional
RDBMS and expensive cost of commercial parallel databases
force Zhejiang Grid to find another way to meet its business
requirements. Considering the high write throughput, good
scalability, high performance offline batch data processing
capability and lower cost of Hadoop/Hive, SGCC tries to
leverage Hive as an extra engine to perform metering data
collection and offline analysis.
However, different from Internet companies, traditional
enterprises have many legacy applications run on RDBMS.
Though new applications can be implemented on Hadoop/Hive environment, legacy ones still need to be smoothly migrated to Hive. As in the Smart Electricity Consumption
Information Collection System (SECICS) of the Zhejiang
Smart Grid, the business logic has already been implemented
by hundreds of stored procedures in RDBMS. These procedures have been accumulated for years and are usually
written by different developers. It would be a huge cost to
let engineers figure out the logic of each query which they
are probably not familiar with and re-implement them using
Hive. They will face great challenges during the course of
migrating:
1
http://www.cloudera.com
• Hive can not fully support the SQL syntax at the moment. Statements such as UPDATE and DELETE
are not well supported, so the way to modify existing
files is overwriting them entirely (we will analyze the
support for UPDATE and DELETE in Hive 0.14 in
Section 2.3). Moreover, subqueries such as EXISTS
and IN are also not fully supported. Migrating these
queries to Hive is time consuming and the correctness
can hardly be guaranteed.
• Even if some SQL queries used in RDBMS can be
directly accepted by Hive, their performance might
be very low in the Hadoop ecosystem because of the
difference between the cost models of RDBMS and
MapReduce. SQL queries run in traditional centralized RDBMS are tuned according to the cost of physical operations, mainly involving CPU cycle and Disk
I/O costs. Whereas in MapReduce, due to the materialization of intermediate data and data shuffle between
mappers and reducers, disk I/O and network I/O factors dominate the performance. Successful migration
needs to consider cost model shift in addition to syntax
mapping.
In this paper, we introduce QMapper [21], an intelligent,
rather than a ”direct”, translator to enable automatic rulebased SQL-to-HiveQL mapping as well as cost-based optimizing of the translated queries. We divide the translation
process of QMapper into three phases. For a given SQL
query, it is first parsed by the SQL Interpreter into the Query
Graph Model (QGM)[7]. Then, the Query Rewriter applies
translation rules to generate equivalent queries, and further
figures out the near-optimal join order for each equivalent
query variation based on our cost model. Finally, a plan
evaluator is then utilized to identify the optimal rewritten
query. The specific contributions of this paper include:
• A tool QMapper is implemented to automatically translate SQL to optimized HiveQL. Our current work focuses on the optimization on the stack of Hive and
Hadoop. However, since the translation rules and cost
model are pluggable, QMapper can also be extended
to support HiveQL-compatible systems such as Shark
and Impala.
• A cost model is proposed to reflect the execution time
of MapReduce jobs. Our cost model combines the
costs of MapReduce framework and the costs of internal Hive operators. With our cost model, there is
no need to pre-run the corresponding query.
• An algorithm is designed to reorganize the join structure so as to construct the near-optimal query. It is
known that a bushy-plan can exploit multiple servers
by allowing joins to run in parallel but may potentially
increase the sizes of intermediate relations. Whereas, a
left-deep plan can minimize the intermediate relations
but can not exploit parallel resources[6]. In this paper,
our algorithm makes a trade-off between intermediate
data volume and job concurrency.
• Many experiments have been done based on TPC-H
workloads. The results demonstrate that queries optimized by QMapper outperform the queries manually
rewritten by experienced users. Moreover, a real application of QMapper for Smart Grid is reported.
The rest of this paper is organized as follows: In Section
2, we describe big smart meter data problem, the motivation of migration and our solution in detail. Section 3 gives
an overview of QMapper. Section 4 introduces our rewriting rules and the cost-based optimization. QMapper’s cost
model and MapReduce workflow estimation are discussed in
Section 5. Section 6 briefly describes the implementations of
statistics collection and query evaluation. In Section 7, we
validate QMapper with extensive experiments. In Section
8, we briefly review the recent works related to our paper.
Finally, we summarize our work in Section 9.
2.
BACKGROUND
In this section, we will first give an overview of the Smart
Electricity Consumption Information Collection System in
Zhejiang Grid, then discuss the difficulties of smoothly migrating legacy RDBMS applications to Hadoop platform.
2.1
Smart Electricity Consumption Information Collection System
Figure 1 shows the data flow and system architecture
of SECICS in Zhejiang Grid. Currently, 17 million smart
meters are deployed in Zhejiang Province to collect meter
data at fixed frequency, for example, once per 15 minutes.
When massive collected data is decoded by the Data Collection Server Cluster, they will be written to a commercial
RDBMS, which is deployed on two high end servers. The
total amount of data is 20TB and there is about 30GB new
data added into the database every day.
There are mainly three kinds of data in SECICS:
• Meter data. Meter data is collected by smart meters. Because of the large number of smart meters and
the high collecting frequency, its amount is very huge.
Since meter data is the real measurement of physical
world, it is rarely updated. The massive meter data
needs to be stored timely, otherwise it may be overwritten by the next dataflow.
• Archive data. Archive data records the detailed archived information of meter data, such as user and device
information of a smart meter. Compared with the meter data, archive data has much smaller scale and is
updated frequently.
• Statistic data. Statistic data is the result of offline
batch analysis. Its computation is based on meter data
and archive data. It is used for the report generation
and online query for staff in Zhejiang Grid.
SECICS needs to deeply analyze the data it collects to
support scenarios such as computing user electricity consumption, district line loss calculation, statistics of data
acquisition rates, terminal traffic statistics, exception handling, and fraud detection, etc. The offline batch analysis
logic currently consists of several stored procedures, each
of which contains tens of SQL statements. They are executed in fixed frequencies every day to calculate corresponding statistic data. As many queries in the stored procedures
involve both meter data and archive data, and usually contain multiple join operations, the performance of join significantly affects the efficiency of the business application.
The middle part of Figure 1 shows the previous solution
of SECICS in Zhejiang Grid. The solution is based on
Workflow 1
Stored Procedure 1
Collection
Workflow n
Stored Procedure n
OLTP
100,000 lines of stored
procedures migration
Offline Batch Analysis
...
Data Collection Server
Cluster
Archive Data
OLTP
Offline Batch Analysis
Meter Data
Meter Data
Online Query
...
...
Online Query
Statistic
data ETL
Statistic
Data
Statistic Data
Copy same table
schema
Archive
Data
Archive Data
Archive data
synchronization
Figure 1: The System Migration of the Smart Electricity Consumption Information Collection System
RDBMS deployed on a high performance 2*98 cores commercial RDBMS cluster and an advanced storage system.
Careful optimization works had been done by database administrators and experts, so as to guarantee that all the
tasks can be finished between 1am and 7am. Otherwise
they will block the business operations in working hours.
However, with the growing number of installed meters and
increasing of collecting frequency, the previous solution encounters some bottlenecks.
2.2
System Improvement Requirements and
Solution
The previous solution can not provide sustainable and
scalable infrastructure, mainly because:
• Low data write throughput. RDBMS with complex indexes can not provide enough write throughput,
which would consequently result in serious problem
when the data scale becomes larger and larger. The
arriving data would be dropped if the data previously
put in queue could not be stored in time.
• Unsatisfied statistics analyzing capability. Smart Grid business involves a large amount of global statistics analyzing tasks. They are implemented by SQL
stored procedures, each of which may need to execute
on whole big tables and perform complex join operations among them. Even with the current scale of
smart meters and collecting frequency, the tasks can
hardly be completed in time. For example, in order to
compute users’ electricity consumption, a query needs
to perform complex join operations on 5 tables, which
contain 60GB data in total. The average processing
time even reaches 3 to 4 hours.
• Weak scalability. The amount of metering devices
increases 30 times in recent five years, and the collecting frequency is still speeding up. The meter data
scale is consequently growing and the system is also expected to be scaled correspondingly at the same time.
However, the scalability of the previous solution is
fairly weak. Besides, scaling out RDBMS mostly leads
to redesign of the sharding strategies as well as a lot
of application logic. While scaling up means purchasing more expensive hardware, and the limitation will
easily be reached. Both of the cases need huge human
labors and financial resources.
• Uncontrollable resource competition. As online
query processing and offline batch analysis tasks are
put on a single RDBMS, they will compete for computing resources. In the worst cases, they will incur
significant performance penalty.
Moreover, keep on scaling RDBMS needs more powerful
hardware, which will bring additional cost. The finance of
SGCC is another factor that must be taken into consideration.
In order to meet the requirements, we propose a new solution that leverages both Hadoop/Hive and RDBMS. It is
shown in the right part of Figure 1. In this solution, the meter data is directly written into HDFS instead of RDBMS,
which makes use of the high writing throughput capability
of HDFS. The archive data is still stored in RDBMS to support frequent CRUD operations. The offline batch analysis
logic is migrated to Hive. Before running each offline batch
analysis logic, the archive data used in the queries is copied
to HDFS via Sqoop (an open source ETL tool). After finishing the analysis, the resulting statistic data is written back
to RDBMS for online query. To make it easy for migration
and data sharing, the schema of both meter and archive data
remains the same in Hive.
The new solution takes advantages of both RDBMS and
Hadoop. The powerful OLTP and index abilities make RDBMS still more suitable for archive data management and online query processing. The good write throughput of HDFS
guarantees that the meter data can be written into the system in time. In addition, migrating offline batch analysis
tasks to Hadoop and Hive frees RDBMS from complicate
big data processing. It improves the statistics analyzing performance significantly by making full use of the computing
capability of Hadoop ecosystem. Moreover, uncontrollable
resource competition in RDBMS is avoided. The most important thing is that the new solution provides good scalability. In a Hadoop/Hive cluster, we only need to add cheap
servers to scale out the system. Finally, since Hadoop/Hive
platforms are open source, it is cost effective. The budget
in the future can even be reduced as the heavy burden of
the RDBMS has been removed and we do not need to buy
extra hardware for it to cope with the growing meter data.
2.3
Smooth SQL Migration
When migrating applications from RDBMS to Hadoop/Hive, first we create tables for meter data and archive data in
Hive, each of them keeps the same schema used in RDBMS.
The constraints and indexes are ignored. Instead, we create
DGFIndex [15](a cost-effective multi-dimensional index for
Hive that we developed for the application of Smart Grid)
for specific tables as needed to improve the reading efficiency
SQL
Interpreter
SQL
Result
Count
Sum
Avg
Result
Count
Sum
Avg
Sample
Data
HiveQL
Query
Rewriter
HiveQL
Correctness
Validation
HiveQL 1
...
Workflow
HiveQL n
SQL 1
...
SQL n
Correctness
Validation
<HV
Deploy
Workflow
Stored
Procedure
Figure 2: The migration of Stored Procedures
for some selective queries. Then, lots of stored procedures
need to be migrated/translated to Hive, which is the most
technical challenging problem in system migration.
Currently, the offline batch analysis logic consists of more
than 200 stored procedures, more than 4000 statements in
RDBMS, adding up to around 100,000 lines of SQL codes
in total. These codes had been maintained for 5 years by a
team of 100 developers, and they have done enormous optimizations on these SQL statements based on the cost model
of RDBMS. Since members in the team change frequently
and the documents may not be up-to-date to reflect all the
modifications/improvements on the codes, it is impossible
for us to find anyone to fully explain even a single procedure.
As the business logic is really complex and the SQL developers are not familiar with Hadoop and Hive, it is not realistic
to re-implement these codes manually and do optimization
works. Thus, it is ideal to provide a fairly ”automatic” way
to enable free translation, without deep concern about their
internal business logic.
However, when we try to migrate these codes to Hive, two
challenges come into being:
First, the stored procedures can not be run directly, as
Hive does not support full SQL, for example, UPDATE,
DELETE (before Hive 0.14), MERGE INTO, EXISTS subqueries (as of Hive 0.13, only top-level conjunctive EXISTS
subqueries are supported) etc. While in fact, these happen a
lot in real engineering. For instance, the ratio of DML statements is very high as they form more than 70% of the offline
batch analysis logic (detailed statistics can be found in [11]).
Although Hive issued on November 2014, only several days
before the submission of this paper, starts to support DML
operations, it still has some restrictions on file format etc.
The UPDATE and DELETE operations need to be operated
on ORC (Optimized RCFile) format files and the table must
be bucketed. Its performance and stability still need to be
validated in practice.
Second, since the cost model of Hive on Hadoop is different
from that of RDBMS, the statement optimized in RDBMS
might yield worse performance in Hive. It is also far from
practical to ask developers to manually rewrite one query
and enumerate all its variations. We can not ask developers
to compute cost for each candidate variation and choose
the best one by hand either. A new approach for selecting
optimal statement for a query is needed.
With our observation, there are some regular mapping
patterns between SQL and HiveQL, so we propose an automatic translation tool named QMapper. It can accelerate
SQL migration process of legacy RDBMS applications and
avoid manual mistakes. Besides, one SQL statement may
generate several equivalent candidate HiveQL statements,
and QMapper can be used to choose the near-optimal one
based on our cost model.
Figure 2 shows the translation work in detail. We divide this work into three phases. In phase one, with QMapper, each SQL statement is translated into corresponding
HiveQL, and the near-optimal variation is selected automatically. Details about translation will be given in the rest of
this paper. In phase two, we validate the correctness of each
translated HiveQL based on sample data. We run count,
sum and avg queries on SQL and its corresponding HiveQL
separately. If both results are the same, this HiveQL is
considered to be correct. In phase three, we organize the
HiveQL queries and generate HiveQL script files, each file
corresponds to a stored procedure in RDBMS. If the script
files pass the correctness validation tests, they will be carefully organized and deployed as a workflow in Oozie. The
timing of running these script files is coordinated by Oozie.
3.
QMAPPER OVERVIEW
RDBMS
DBMS-X
DBMS-…
DBMS-Y
SQL
QMapper
SQL Interpreter
Query Graph Model
Statistics
Collector
Query Rewriter
Rule-Based
Rewriter
Query
Hive
Compiler
MR
Plan
DAGs
Cost-Based
Optimizer
HiveQL
Queries
Dumping
Collector
Cardinality
Estimation
Plan
Evaluator
Hive
Pg_stat
Statistics
Background
Collector
PostgreSQL
HiveQL
Hive
Impala
Shark
Figure 3: QMapper Architecture Overview
Figure 3 demonstrates the architecture of QMapper, which
contains four components:
SQL Interpreter : This component resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree (AST). Then a Tree Walker will traverse the AST and further interpret it into a Query Graph
Model (QGM) [7], which is more convenient to be manipulated. This interpretation process includes analyzing the
relationship between tables and subqueries, etc. After a
QGM is built, it will be sent to the rewriter.
Query Rewriter : The Query Rewriter is composed of
two phases. In the first phase, a Rule-Based Rewriter (RBR)
checks if a query matches a series of static rules. These
rules can be triggered to adjust that query and new equivalent queries will be generated. Then Cost-Based Optimizer
(CBO) is used to further optimize the join structure for each
query. In this phase, cardinality estimation in PostgreSQL
is involved to calculate the cost. The cost model is discussed
in Section 5.
Statistics Collector : The Statistics Collector is responsible for collecting statistics of related tables and their columns. These information is important for estimating the cardinality used to calculate the cost. The collecting approach
is discussed in Section 6.1.
Plan Evaluator : The Plan Evaluator is the final stage in
QMapper. The queries with equivalent join cost generated
by RBR will be sent to it. It will take into account none-join
operators like UNION ALL, GROUP BY, etc to distinguish
them. For each query, the Plan Evaluator will first invoke
Hive to generate the MapReduce workflows, then use PostgreSQL’s cardinality estimation component (P g stat shown
in Figure 3 is a system table used for storing column statistics) to estimate the data selectivity of each operator in map
and reduce. Finally, our cost model is applied to the MapReduce workflows to calculate their cost and a query with the
minimal cost is returned to the user.
4.
QUERY REWRITING
Query rewriting in QMapper aims at generating more
equivalent candidate HiveQL queries and increasing the probability of triggering more optimization mapping rules. The
more candidates are available, the more chances to find the
optimal solution. In QMapper, the rewriting rule itself is
pluggable to make it easy to extend. Moreover, since query
rewriting is performed outside Hive, it can also work together with MapReduce level optimizers like YSmart [12].
Consider the following example which retrieves all Shipments information of the P arts stored in Hangzhou or provided by Suppliers in the same city.
SELECT * FROM Shipments SP WHERE EXISTS ( SELECT 1
FROM Suppliers S WHERE S.SNO=SP.SNO AND S.city =
’Hangzhou’) OR EXISTS ( SELECT 1 FROM Parts P Where
SP.PNO=P.PNO AND P.city=’Hangzhou’)
QMapper will generate a variety of versions of HiveQL that
can yield the same results. One version is to divide the
two disjunctive predicates into separate SEMI JOIN and
then perform an UNION ALL. Another version is to transform the EXISTS to LEFT OUTER JOIN and replace itself
with joinColumn IS NOT NULL. The first one uses SEMI
JOIN so that it can remove unnecessary tuples earlier than
OUTER JOIN. However, a tuple may satisfy two predicates
at same time, and this might bring about duplicates after
UNION ALL. Thus, an additional operation can be added
by QMapper to eliminate the duplicates.
Next, We first introduce the rewriting rules of RBR and
then describe the mechanism of CBO in Query Rewriter.
4.1
Rule-based Rewriter
Rule-based Rewriter tries to detect the SQL clauses that
are not supported well by Hive and transform them into
HiveQL. As we find the regular mapping patterns between
SQL and HiveQL, we summarize them into different rules.
In the translation process, some initial rules are first invoked
to check if the query can be rewritten. If the condition is
satisfied, the rule will be applied to the query and some
new equivalent queries will be returned. Then the RBR
will traverse the subqueries of each query and apply rules
to them recursively. After that, all rewritten queries are
generated and sent to the CBO. Here, we only introduce
some typical rules, like UPDATE and (NOT)EXISTS, to
demonstrate how RBR works.
lvRate(uid,deviceid,isMissing,date,type)
dataProfile(dataid,uid,isActive)
dataRecord(dataid,date,consumption)
powerCut(uid,date)
gprsUsage(deviceid,dataid,date,gprs)
deviceInfo(deviceid,region,type)
The above tables are designed according to SECICS’s real
scenarios, and they will be used to explain the rules. Table
lvRate is calculated on daily basis, which can be looked up
to check if a device should re-collect the consumption information or indicate the malfunction of sensors. dataRecord
stores the consumption data uploaded from sensors. And
dataP rof ile is the mapping between low-voltage customers
and sensors’ data. Moreover, powerCut records the powercut information. If some users face the power cut situation,
then their data should not be collected.
4.1.1
Basic UPDATE Rule
Trigger Pattern : UPDATE table SET column = CONSTANT, . . . LEFT OUTER JOIN table ON condition, . . .
WHERE simpleCondition, . . .
Example : UPDATE lvRate a SET a.isMissing=true
LEFT OUTER JOIN dataProfile b ON a.uid=b.uid
LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid
AND a.date=c.date WHERE c.dataid IS NULL
Description : The above example updates lvRate.
isM issing to true if there is not a corresponding record
stored in dataRecord, which means the consumption data on
that day is not collected because of some failures. Setting
isM issing to true will let the device re-upload its data.
This rule translates U P DAT E into SELECT statement by
putting the simpleCondition to selectList. In this case, the
output of this rule will be:
INSERT OVERWRITE TABLE lvRate
SELECT a.uid,a.deviceid,
IF(c.dataid IS NULL,true,false) as isMissing
,a.date,a.type FROM lvRate
LEFT OUTER JOIN dataProfile b ON a.uid=b.uid
LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid
AND a.date=c.date
4.1.2
(NOT) EXISTS Rule
Trigger Pattern : SELECT selectList FROM table JOIN table ON condition, . . . WHERE simpleCondition, . . .
Example : DELETE FROM lvRate a WHERE NOT EXISTS (
SELECT 1 FROM powerCut b WHERE a.uid=b.uid
AND a.date=b.date )
Description : The simpleCondition includes EXISTS subqueries and comparisons between columns and constants
that are connected by conjunctions. Since (NOT) EXISTS
is not fully supported by HiveQL, this rule is applied after initial rules to flatten the subqueries into JOIN clauses.
As shown in the example demonstrated at the beginning of
Section 4, there are two approaches to flatten an EXISTS subquery: 1. Transforming the subquery into a SEMI JOIN and
then extracting the JOIN condition from the WHERE clause in
the subquery. This approach only works for EXISTS clause.
2. The second approach can be applied on both EXISTS and
NOT EXISTS subqueries. It transforms that subquery into
a LEFT OUTER JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL. In this example, the
second approach is used to generate:
INSERT OVERWRITE TABLE lvRate
SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type
FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date
FROM powerCut) b ON a.uid=b.uid AND a.date=b.date
WHERE b.uid IS NULL
In order to make the rule easier to be understood, the original subquery style is kept in the new JOIN clause even if
there is only one table. This won’t harm the performance
since Hive’s internal optimizer could further optimize it.
4.2
Cost-based Optimizer
E
E
D
C
A
A
B C
D
B
(a) Left-deep Plan (b) Bushy Plan
Figure 4: Join Plan Example
As the cost model of Hive and RDBMS is different, an
efficient SQL query may not get high performance in Hive if
we only directly translate it into HiveQL. QMapper’s Costbased Optimizer is used to optimize the join order of a query.
Consider the following example which tries to get the daily
gprs usage of a normal running device in a region according
to the device type:
SELECT sum(gprs), type FROM gprsUsage A
JOIN deviceInfo B ON A.deviceid = B.deviceid
JOIN dataRecord C ON A.dataid = C.dataid
AND A.date = C.date
JOIN dataProfile D ON C.dataid = D.dataid
LEFT OUTER JOIN powerCut E ON D.uid = E.uid
AND A.date = E.date
WHERE E.uid IS NULL AND A.date=’2014-01-01’
GROUP BY B.type
A left-deep tree plan is applied for this query just as Figure
4(a) shows. One goal of the optimization here is to reduce
the execution time by running jobs concurrently. QMapper
can adjust the join order by rewriting the query like this:
SELECT sum(gprs), type FROM(
SELECT T1.gprs, T1.date, T1.type, T2.uid FROM
(SELECT A.gprs, A.dataid, A.date, B.type FROM
gprsUsage A
JOIN deviceInfo B ON A.deviceid = B.deviceid
WHERE A.date=’2014-01-01’
)T1
JOIN (
SELECT C.dataid, C.date, D.uid FROM
dataRecord C JOIN dataProfile D
ON C.dataid = D.dataid
)T2 ON T1.dataid = T2.dataid
AND T1.date = T2.date
)T
LEFT OUTER JOIN powerCut E
ON T.uid = E.uid AND T.date = E.date
WHERE E.uid IS NULL GROUP BY type
Now, an equivalent bushy plan is generated, just as Figure
4(b) shows. Different from traditional databases, MapReduce-based query processing will write join intermediate results
back to HDFS and the next join operation will read it from
HDFS too, causing big I/O costs. So, another significant
goal for join optimization in MapReduce is to reduce the
size of intermediate results. Comparing the above left-deep
plan A 1 B 1 C 1 D 1 E with bushy plan (A 1 B) 1
(C 1 D) 1 E, the main difference in intermediate results is
that the left-deep plan generates A 1 B 1 C and the bushy
plan generates C 1 D. Thus, the sizes of A 1 B 1 C and
C 1 D will be important for comparing the two plans. On
the other hand, in the bushy tree plan, A 1 B and C 1 D
may execute concurrently, reducing the total executing time.
So, concurrent jobs should also be taken into consideration.
QMapper’s CBO will evaluate these join plans according to
our cost model and choose the best one for the query.
In the query optimization, both left-deep trees and bushy
trees are explored, which are shown as Figure 4. In leftdeep trees, intermediate results are pushed to join with a
base table until all join tables are covered. It is very clear
as join operation goes step by step, we can add the cost of
each join step to get the final cost of the plan. However, it
may not get the optimal plan as computing resources may
not be fully exploited. In bushy trees, intermediate results
can join with each other, so join operations can be executed
concurrently. It seems that concurrent jobs will be more
efficient, but this may lead to worse performance as jobs
will compete for computing resources. We believe that it is
enough to get a near-optimal plan by considering left-deep
plans and bushy plans. So, we establish a cost model to
evaluate them and choose the one with minimum cost.
We use a bottom-up method to construct the join tree. In
order to prune the searching space, dynamic programming
algorithm is used to get the best join plan for a query. The
cost model applied by the algorithm will be introduced in
Section 5. The input is a query with join tables and their
relations. Initially, tables involved are recorded in the query
plan as base tables. Then, the number of join tables gradually increases from two to all. We build left-deep and bushy
plans for them. Meanwhile, we compare the plans by our
cost model and continually prune the inferior ones. In the
end, the best plan for all join tables is returned as the result.
5.
COST MODEL
QMapper’s cost model is inspired by [8], which is used by
[9] for tuning MapReduce parameters. However, that cost
model can not be directly applied to QMapper since it tries
to simulate the whole process of MapReduce framework and
some parameters used in the model are highly related to
each specific task, such as the CPU cost for compressing
the output per byte. As a result, sampling is required to
gather the parameters for each task. In QMapper, the cost
model is also designed to capture the cost of each phase in
MapReduce. We focus on the cost of Hive’s operators and
the most time-consuming factors such as I/O operations.
By filling the parameters with data collected by probes and
120.
12.41
Reduce Other
Reduce Write
100.
Merge Write
0.27
18.85
Reduce Operators
4.19
Merge Sort
each mapper processes Msp size bytes of data concurrently.
We can model this part by estimating the time consumption
of one mapper, the expected cost for reading data will be:
6.55
TIME USAGE (SECONDS)
80.
Merge Read
Merge Write
5.71
Spill Write
0.75
15.17
1.91
60.
Spill Sort
2.57
4.97
Merge Read
Map Other
40.
24.81
53.19
Shuffle
27.75
Map Read
0.
Map
Reduce
Execution Time of Each MapReduce
Table 1: Parameters of QMapper’s Cost Model
Symbol
Diskr
Diskw
HDF Sw
Mlocal
Msp size
Mout.rec
Mout.avgbytes
Rout.rec
Rout.avgbytes
N etwork
Nmap
Nm max
Nreduce
Nr max
Description
Disk reading speed (bytes/second)
Disk writing speed (bytes/second)
HDFS writing speed (bytes/second)
Map Local Ratio
Map Split Size (bytes)
Number of Map Output Records
Average Bytes of Map Output Records
Number of Reduce Output Records
Average Bytes of Reduce Output Records
Data Transfer Speed via Network (bytes/second)
Estimated Number of mappers
Maximum Number of mappers
Estimated Number of reducers
Maximum Number of reducers
the estimated cardinality, the cost model can be universally
applied.
It is well known that disk and network I/O costs are main
reasons that slow down the MapReduce tasks. But there
is not any data indicating how and to what extent do the
I/O costs affect each phase of MapReduce. We add a few
counters to MapReduce framework in order to find the timeconsuming factors. Figure 5 shows the detailed time usage
of one representative MapReduce job generated by Hive. We
run the job on our cluster and collect the time usage in each
phase. It is obvious that Hive’s operators, together with I/O
dominated phases such as M erge and Shuf f le take up more
than 80% costs in terms of execution time. In this figure,
other costs include sorting during M erge phase and data
serialization as well as deserialization in map and reduce
functions.
In this section, we first introduce the cost model for evaluating a single MapReduce job. Then, the approach of estimating Hive operators’ costs is discussed. Finally, we describe how to evaluate the cost of MapReduce workflows.
5.1
Msp size
Msp size
+(1−Mlocal )×
Diskr
N etwork
So the total cost of map phase is:
Cost(Mmap ) = Cost(Mread ) + Cost(Mops )
Map Operators
20.
Figure 5:
Phase
Cost(Mread ) = Mlocal ×
Cost of MapReduce
MapReduce’s programming model is abstracted as two
parts, Map and Reduce. Map phase can be divided
into three subphases, which are M ap, Spill and M erge.
Reduce phase also includes three parts, Shuf f le, M erge
and Reduce. We will analyze the cost in each phase and
form the overall cost of one MapReduce job.
In Map phase, the mappers read data from DFS and process them. For the reading part, Hadoop tries to assign
mapper to the node where the input data is stored, and
where Cost(Mops ) donates the time spent on processing
data, which will be discussed in Section 5.2.
After the (key, value) pairs are partitioned and collected
into a memory buffer, a spill-thread may simultaneously
clear the buffer and write the records into a Segment file if
the size of data stored in memory exceeds the maximum of
either accounting buffer (storing metadata for each record)
or key/value buffer. Even if the memory buffer has enough
space to store the data, the spill-thread will materialize them
into a single file before the merge phase. Thus, the writing
cost for spill phase can be simply estimated as:
Cost(Mspill ) =
Mout.rec × Mout.avgbytes
Diskw
The spill phase may generate multiple Segment files and
the goal of merge phase is to sort and merge these Segment
files into a single file. If there is only one Segment, merge
phase is bypassed since it can be shuffled to reducer directly.
This phase performs an external merge-sort, thus it may
need multiple rounds to generate the final output. Here we
simplify the processing logic and assume the merge phase
being done in a single round. The cost of merge phase can
be simplified into:
C(Mmerge)=
Mout.rec×Mout.avgbytes Mout.rec×Mout.avgbytes
+
Diskr
Diskw
Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific
table to be processed. Thus, the costs of processing each
table are evaluated independently. Suppose the number of
input tables is n, the total cost of M ap can be evaluated by
the mapper that takes the longest time:
i
i
) + Cost(Mspill
)
Cost(M ) = max {Cost(Mmap
1≤i≤n
i
)}
+ Cost(Mmerge
In the reduce phase, shuffle is responsible for fetching
mappers outputs to their corresponding reducers. Here
we treat the input tables as a whole. For simplicity, we
assume the output data is evenly distributed, thus the total data size each reducer received from Segment files is
n
i=1
(Mout.rec i ×Mout.avgbytes i ×Nmap i )
, where n is
Segr.size =
Nreduce
the number of input tables. So, the network cost of shuffle
phase is:
Segr.size
N etwork
The merge phase is executed concurrently with shuffle.
When the reduce memory buffer reaches its threshold, the
merge threads will be triggered. Similar to the merge phase
in M ap side, the merge phase of Reduce is applied to sort
and merge the Segment files from the M ap tasks. The difference is that some Segment files may have already been
merged during the shuffle phase. Here, we also assume that
Cost(Rshuf f le ) =
Depth 1
it only needs one single round to merge the Segment files
into the final one. Thus, the cost of this phase can be simplified as:
Cost(Rmerge ) =
Stage-1
Depth 3
Stage-4
Stage-5
Cost(Rreduce ) = Cost(Rops ) +
Rout.avgbytes × Rout.rec
HDF Sw
Total Cost
Note that, the cost model does not consider the costs and
effects of Combiner in spill and merge phases, since Hive
does not use Combiner for data processing. The total cost
of the whole MapReduce will be the sum of costs discussed
above. In the cases that the required number of mappers or
reducers exceeds the cluster’s capacity, some mappers and
N
and Pr = NNreducer
reducers have to wait, Pm = Nmmap
max
r max
are in turn used as punishment coefficients. Thus, the total
cost is:
Figure 6: MapReduce Workflow Depth
5.3
Cost of Workflow
A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single
MapReduce job and the edge represents the dataflow. As
the example shown in Figure 6, QMapper groups MapReduce jobs (each stage stands for a MapReduce job) and task
chains that can be executed concurrently within the same
depth. Here, we just assume the resources are enough for
concurrent jobs. Since the cost of each job is already known,
the summation of each depth’s cost will be the total cost of
a query:
Cost(M Rworkf low ) =
Hive embeds the operators in map and reduce functions
and the records are processed by each operator iteratively.
The cost of each operator is considered as CPU cost. In
order to calculate the costs, a few sample queries based on
TPC-H are designed as probes to collect the execution time
of operators such as F ileterOperator, JoinOperator and
GroupByOperator. Also, since we have added counters in
Hive’s operators, these statistics can be updated according
to logs. The cost of F ileSinkOperator is ignored, since its
responsibility is to write output data to HDFS, whose I/O
cost is already accounted in Section 5.1. Because of the
variety of operators’ internal implementation, it is hard to
precisely estimate the time consumption of each operator
with different input data and filters. As we observe in real
applications, the time usage of a Hive operator might increase almost linearly as the amount of input records grows.
Thus, we treat each operator as a black-box, its cost is defined as the execution time of processing a given amount
of records. Since there may be multiple tables as input,
j
is used to donate the records number of table j.
Ntab.rec
For an operator, the Linear-Regression approach is used to
m
j
Ntab.rec
).
build the cost function, represented as f (
j=1
After calculating the cost function of operators through
analyzing logs of sample queries, given a chain with n operators, the cost is evaluated as:
n
i=1
fi (
m
j
Ntab.rec
)
j=1
Notice that, in each iteration i, the internal m and table parameters may be different due to the effects of JoinOperator
and F ilterOperator, the approach of how to calculate these
parameters is discussed in Section 6.2.
n
Cost(Depthi )
i=1
+ Cost(Rmerge ) + Cost(Rreduce )}
Cost of Operators in Map and Reduce
Stage-6
Stage-24
Cost(Job) = Pm × Cost(M ) + Pr × {Cost(Rshuf f le )
5.2
Depth 4
Stage-3
Segr.size
Segr.size
+
Diskr
Diskw
The costs of reduce function includes Hive operators’ costs
(discussed in Section 5.2) and writing costs. In Hive, the
writing part is handled by F ileSinkOperator, which directly
writes data to HDFS. So the cost will be:
5.1.1
Stage-2
Depth 2
where Cost(Depthi ) = max{
pression
m
m
Cost(Stageki )}. The ex-
k=1
Cost(Stageki ) donates the cost of a task chain
k=1
within the Depth-i.
6.
IMPLEMENTATION
6.1
Statistics Collection
QMapper leverages PostgreSQL for cardinality estimation. In order to do that, PostgreSQL’s source codes have
been modified so as to fetch statistics data from external
source and write them into the system table P g stat. In
this way, there is no need to actually insert data into PostgreSQL. The statistics data are collected by Statistics Collector. Same with PostgreSQL, besides the number of tuples, 5 metrics are collected for each column: 1.M ost Common V alues. 2.Histogram. 3.N ull F raction. 4.Distinct
Ratio. 5.Average Bytes. We implement MapReduce programs based on PostgreSQL’s analyzing module and the
Statistics Collector runs the programs periodically to calculate these statistics. The mappers perform sampling on
the tuples and send them to reducers, where key is the attribute name and value is the value corresponding to that
attribute. Thus, one reducer is only responsible for calculating the statistics of one attribute and all the attributes
can be calculated concurrently.
6.2
Query Cost Evaluation
Query cost evaluation is used in 2 stages: in and after
CBO. For example, given an input query A, RBR may generate two candidate queries A1 and A2 . Then, the 2 queries
are sent to CBO for optimizing the join order. For each
query, CBO adjusts its join plan and performs evaluation
using the cost model in Section 5. In this stage, we do not
need Hive to compile the query to MapReduce workflows.
6.3
Nreduce =
Mout.avgbytes × Mout.rec × Nmap
512 × 10242
.
EXPERIMENTS
In this section, we will evaluate the correctness and efficiency of QMapper. Here, efficiency contains two aspects:
the efficiency of translating SQL into HiveQL and the efficiency of HiveQL execution comparing QMapper with manually translated work. Experiments based on TPC-H will
demonstrate the execution efficiency of HiveQL generated
by QMapper, and Smart Grid application will show the correctness and translation efficiency of QMapper.
7.1
7.1.1
TPC-H workloads Evaluation
Experiment Environment
We perform the experiments on our in-house cluster, which
consists of 30 virtual nodes. Each of them has 8 cores and
8GB RAM. All nodes are installed with CentOS 6.4, Java
1.6.0 22 and hadoop1.2.1. The Hive 0.12.0 is deployed and
Value
66584576 (Bytes/second)
61027123 (Bytes/second)
46137344 (Bytes/second)
1024 (MB)
0.67
Default: dfs.block.size=67108864 (Bytes)
44040192 (Bytes/second)
140
84
10GB, 20GB, 50GB, 100GB TPC-H data sets are generated
as workload. By default, the cluster contains 30 nodes and
the size of data set is 10GB. The replica number of files
stored in HDFS is set to 3. The value of parameters collected from this cluster are listed in table 2. In most cases,
more than 2 tasks are running on the same machine, the
values of Diskw , Diskr and N etwork are smaller than ideal
ones since there is resource contention. Other parameters
are set to default values.
7.1.2
Overall Performance
HiveM
QMapper
1400
Tuning the Number of Reducers
We observe that another important factor affecting the
performance of a MapReduce job is the number of reduc.
ers. By default, Hive sets this parameter by M apInputSize
2×10243
However, it is not always a good choice due to the assumption that lots of records are filtered in M AP phase. In some
cases, a single reducer will have too much workloads. In
QMapper, since the total output data size of M AP side
can be estimated, we enable Hive to allow setting reducer
numbers per stage. The recommended setting provided by
QMapper is:
7.
Table 2: Parameters of Experiments
Symbol
Diskr
Diskw
HDF Sw
Pr.task.mem
Mlocal
Msp size
N etwork
Nm max
Nr max
Execution Time (Second)
Because only Join and F ilter are concerned in this stage and
Hive’s strategy of processing join operations is predictable,
CBO can generate the workflows for evaluation. After the
processing of CBO, there are still two queries A1 and A2
which have been optimized. Then, QMapper will invoke
Hive’s planner to compile them into MapReduce workflows
and the cost model is used again to evaluate the whole plan.
The one with the minimum cost is then returned to end user.
As previously mentioned, QMapper leverages PostgreSQL
for cardinality estimation, and the key problem is how to
map the estimated results to MapReduce plan. First, in
order to make sure that PostgreSQL’s optimizer does not
rewrite the SQL query, the join optimizer in PostgreSQL is
disabled by setting join collapse limit = 1. Then, by using
explain command, the query is sent to PostgreSQL to get its
cardinality estimation result as json format. The estimation
result can also be treated as a DAG, the difference between
this DAG and the MapReduce plan DAG is that it is composed by operators such as Hash Join, Seq Scan, Sort and
GroupAggregate etc. And the operators’ dependency is constructed based on the structure of blocks in a query, which
can be mapped to our MapReduce plan. For example, the
output row number of a Seq Scan operator with filters can
be mapped to a table scan Mapper for calculating the output data size, and the output row number of a Hash Join
operator can be used to estimate the output data size of
the Reduce phase. And we use filters, aliases, table names
and join conditions as identifications to correctly retrieve the
corresponding operator’s cardinality information.
1200
1000
800
600
400
200
0
1 2 3 4 5 6 7 8 9 10 1112 13 141516 17 1819 20 2122
Query
Figure 7: Performance of Full TPC-H Queries
Figure 7 shows the overall performance of full TPC-H
queries. We choose HiveM2 and QMapper for comparison.
In general, we can divide the TPC-H queries into three types:
queries with few translation variations, queries containing
subqueries and queries with multiple join tables.
For queries with few translation variations (Q1, Q6, Q12,
Q13, Q14, Q19), the translation results of QMapper are
quite similar to HiveM, thus the execution time of these
queries is almost the same.
For queries containing subqueries (Q2, Q4, Q11, Q15,
Q16, Q17, Q18, Q20, Q21, Q22), QMapper and HiveM
choose different translation strategies. QMapper converts
the subqueries into join operations, whereas HiveM separates the subqueries from the original ones and the final
results contain several queries. Usually, as the translation
methods of HiveM involve more MapReduce tasks, the execution time of QMapper’s results is better than that of
HiveM (Q4, Q16, Q17, Q18), but the performance improvement is not that obvious. Sometimes the results of HiveM
are even better since the extra join operations introduced by
2
These queries are TPC-H queries rewritten by Hive contributors for benchmark purpose. https://issues.apache.
org/jira/browse/HIVE-600
QMapper take longer time (Q15, Q22). Q11 is very special
as the join operations in the subquery are same as the ones
outside, so HiveM reuses the results of the subquery and
dramatically reduces the overall execution time. For other
queries (Q2, Q20, Q21), QMapper further adjusts the join
orders of them, this will greatly shorten the execution time.
For queries with multiple join tables (Q2, Q3, Q5, Q7,
Q8, Q9, Q10, Q20, Q21), QMapper leverages the cost-based
optimizer to adjust the join orders. For Q2, Q3 and Q10, as
there are not so many join tables, the join orders in HiveM
are superior and leave little optimization space for QMapper. For the other queries, QMapper dramatically improves
the performance. In the next section, we will analyze these
typical queries in detail.
7.1.3
Join Performance
Figure 8 shows the performance of the five typical TPC-H
queries with multiple join tables. Here we add HiveM+YSmart and QMapper+YSmart for comparison. In our settings,
YSmart is employed by setting hive.optimize.correlation =
true in Hive. We can see that, in these 5 cases QMapper
gets the best performance. For Q5, Q8 and Q20, QMapper
achieves about 44% improvements. Also, for Q7 and Q9,
QMapper improves the performance more than 50%. Even
for experienced engineers, it is very difficult to manually
optimize complex queries which involve five or more tables.
Intuitive decisions often lead to sub-optimal plans. Thus, it
is necessary to estimate the costs and identify the best query
automatically, QMapper provides a solution and gets good
performance.
As shown in Figure 8, YSmart does not perform any
transformations on the queries except for Q20, Figure 9
and 10 show Q20’s MapReduce plans generated by QMapper+YSmart and QMapper respectively. We can see that
YSmart merges Stage-4 and Stage-8 into a single job, however, this does not bring any improvement since the query
rewritten by QMapper is already highly paralleled. In this
case, reducing the number of jobs does not affect the depths
of the DAG. Thus, no improvement is gained when there are
enough resources for running multiple jobs concurrently.
7.1.4
Scalability
Figure 11, 12, 13 and 14 show the scalability for the optimized queries. Figure 11 and 12 reflect the performance of
Q7 and Q20 while the data size increases from 10G to 100G.
While Figure 13 and 14 show the performance of Q5 and Q8
with data size of 20GB, while the number of nodes scales
from 10 to 30. In Figure 13 and 14, the execution time of
the queries optimized by QMapper almost does not shorten
when there are more than 20 nodes in the cluster, while the
execution time of the original queries keeps decreasing. Because QMapper balances the trade-off between reducing the
size of intermediate data and improving parallelism, 20-node
cluster is already able to support the optimal plan, adding
more nodes does not bring much help in that data size.
7.1.5
Accuracy of Cost Model
Some experiments have been done to evaluate QMapper’s
cost model. We choose Q9, which is the most complicated
query in TPC-H, to validate the accuracy of the cost model.
Figure 15 shows the execution time comparison between
query variations for Q9. These variations are selected from
the intermediate results of QMapper’s CBO. We can see that
different query variations get huge difference in performance.
Since our cost model does not consider the influences such
as the CPU costs of sorting and compressing, it can not fully
reflect the execution time of queries in reality. However, the
cost model considers the main factors in each phase which
cover more than 80% (Figure 5) of the execution time, thus
it is enough to choose a good variation.
Furthermore, we validate the cost model by using real
world data set. We pick out a typical query from SECICS,
which consists six join tables and is used to calculate the
line loss rate. And we execute some variations of the query
in the 30-node cluster with 25GB real world data set. Figure 16 shows the comparison results. The performance of
QMapper’s cost model using the real world data set is a little worse than that using TPC-H. As we assume the data is
evenly distributed in our cost model (for example, we assume
the output of map phase is evenly shuffled to each reducer),
the data skew in the real world data set will influence the
accuracy of cost model. However, the cost model is practical
enough to pick out a superior variation for the query.
7.2
Smart Grid Application
We built up the Hadoop/Hive platforms on a cluster composed of eight commodity servers (8-core and 32GB memory
each). Then we started to translate the stored procedures.
With the help of QMapper, our 6 members team finished the
migration from SQL to HiveQL in only 6 weeks. According
to the statistics, 90% of the SQL queries can be perfectly
translated by QMapper. QMapper can handle queries with
complex subqueries very well, but we still need to adjust
some queries manually: (1) It is very difficult to find general translation rules for some SQL functions, such as the
window function, and we have to deal with them manually. (2) For queries containing f or or while loops, we still
need to translate them by ourselves. (3) For queries containing rownum, we need to analyze them carefully and choose
proper translation ways to ensure the correctness of translation results. As we only performed the translation work
with QMapper, we don’t know how many time it will cost if
we try to understand the logic of SQL codes and translate
them manually, but it must be a much longer time than the
time we cost now.
For correct validation of HiveQL and deployment of workflows, we performed it by manual work. It took us another
8 weeks to finish this work. We had to take fairly long
time to prepare environment and import data for validation.
Correct validation and workflow deployment are time consuming, mainly because: (1) In some cases, it is difficult to
guarantee that SQL and corresponding HiveQL generate the
same results. For example, for queries containing rownum,
different records might be chosen out and we must validate
the impact of these results. (2) Considering the cost of initiation of MapReduce tasks, sometimes we choosed to merge
some queries to minimize the number of MapReduce tasks.
(3) It is very complicated to debug at workflow level. As we
used new data to validate the workflow every day, once we
found the results of the workflow were not the same as that
of SQL stored procedure, we had to analyze the results of
HiveQL sentence by sentence to locate the error. (4) The
workflow deployed on Oozie should be carefully arranged to
make full use of resources of the cluster. Some tasks were
arranged to execute concurrently to improve the overall efficiency.
HiveM
HiveM+YSmart
QMapper
QMapper+YSmart
Execution Time (Second)
Stage-12
Stage-12
1400
1200
1000
Stage-4
800
Stage-2
Stage-3
600
Stage-2
400
Stage-3
Stage-5
200
0
Q5
Q7
Q8
Q9
Q20
Stage-4
Stage-5
Stage-8
Query
Figure 8:
mance
Overall Join Perfor-
QMapper
HiveM
HiveM
20G
50G
Execution Time (Second)
1000
800
600
400
200
20G
50G
200
15
Execution Time
Execution Time
Execution Time (Seconds)
Execution Time (Seconds)
25
30
1200
1000
800
600
400
200
1
2
Figure 14: Effect of Cluster Size
for Q8
3
4
Optimal
Query Variations of Q9
Figure 15: Accuracy of Optimizer
for Q9
We use QMapper for the migration of five main business
in SECICS: computing user electricity consumption, district
line loss calculation, statistics of data acquisition rates, terminal traffic statistics and exception handling. After migration, 95% of the queries are executed by Hive and the others
are deployed on Impala. In order to guarantee the stability
of migration, we haven’t removed the offline batch analysis
tasks from RDBMS. The meter data is currently duplicated
and stored both in RDBMS and HDFS. We compare the
analysis results of RDBMS and Hadoop/Hive every day to
verify their consistency. Now the new solution has been in
production for eight month and it works very well. Results
show that, comparing with the execution time in RDBMS,
the performance in new environment is more than 3 times
faster on average. With the growth of the meter data scale,
the Zhejiang Grid plans to let Hive take full charge of the
offline analyzing tasks.
RELATED WORK
Automatic SQL to MapReduce mapping and optimization
attract lots of attentions from both industry and academia.
Current works mainly fall into three categories.
25
30
Figure 13: Effect of Cluster Size
for Q5
Cost Estimation
1400
Nodes Number
20
Nodes Number
0
20
400
10
1600
15
600
100G
Figure 12: Effect of Data Size for
Q20
QMapper
900
800
700
600
500
400
300
200
100
0
10
800
Data Size
Figure 11: Effect of Data Size for
Q7
HiveM
1000
0
10G
100G
QMapper
1200
Data Size
Execution Time (Second)
QMapper
0
10G
8.
Figure 10: Execution Plan of Q20QMapper
1200
1800
1600
1400
1200
1000
800
600
400
200
0
Execution Time (Second)
Execution Time (Second)
HiveM
Figure 9: Execution Plan of Q20QMapper & YSmart
Cost Estimation
900
800
700
600
500
400
300
200
100
0
1
2
3
4
Optimal
Variations of A Real World Query
Figure 16: Accuracy of Optimizer
for A Real World Query
The first category adopts a rule-based approach to guide
the SQL-to-MapReduce mapping procedure. Tenzing [14],
proposed by Google, provides SQL92 compatible interface
and introduces four join algorithms based on MapReduce.
It also supports a simple cost-based approach to switch the
join orders. HadoopDB [1] uses MapReduce as a computing layer on top of multiple DBMS instances and push SQL
queries down to DBMS so as to utilize their sophisticated optimizers. These systems provide the fundament to run SQL
on MapReduce. However, most of the translations follow a
specific pattern, which is to map each part of a SQL query
into MapReduce jobs directly. Comparing to the optimizations in RDBMS, there are lots of optimization opportunities
left, such as query rewriting and cost-based optimization.
In the second category, works including [2, 9, 10] propose
cost-based approaches to determine appropriate configuration parameters for MapReduce jobs. They treat MapReduce as a black-box system. By using the statistics of prerun jobs, more than 10 parameters are tuned and the cost of
MapReduce job is evaluated during the optimization phase.
The work in third category aims at optimizing MapReduce workflows directly. YSmart [12] is a rule-based correlation aware SQL-to-MapReduce translator, aiming at re-
ducing the number of jobs by merging jobs according to
their relations. Stubby [13] is a cost-based optimizer for
MapReduce workflows. It is designed to perform transformations on MapReduce workflows. Stubby searches through
the workflow variations to get a near-optimal solution.
Moreover, lots of works focus on the optimization of join
operations in MapReduce. AQUA [19] is a query optimizer
which aims at reducing the cost of storing intermediate results by optimizing the order of join operations. In AQUA,
a top-down approach is utilized. The join operators are divided into several groups, each of them is processed by a
single MapReduce job. Then, a heuristic approach is used
for connecting these groups to form a final join tree.
The works above are very important improvements for
running SQL queries on MapReduce. However, current approaches mostly focus on optimization at MapReduce level,
while ignoring the varieties of SQL queries and their influences on the performance. Moreover, unlike the well known
cost model described in [8], which simulates every stage of
MapReduce and needs to pre-run the jobs on sampling data
set, QMapper’s cost model only captures the I/O costs of
each phase and computes the cost of involved Hive operator
based on its cost sampling and run time I/O data size.
Besides, some benchmarks have been done to compare
Hive and Impala. [5] compares the performance of Hive and
Impala in detail, which can be a reference for the improvement of SECICS.
9.
CONCLUSION
Traditional enterprises seek for tools to migrate legacy
RDBMS-based data analysis applications to Hive. QMapper
is proposed in this paper, which applies rule-based rewriting
and cost-based optimization to a given SQL query to generate efficient HiveQL. The experimental evaluations based on
TPC-H demonstrate the effectiveness of QMapper as well as
the accuracy of its cost model. Our real world application in
Smart Grid also shows its superiority. One direction of our
future work is trying to integrate this approach with Shark.
Multiple query optimization will be another improvement
direction.
10.
ACKNOWLEDGEMENTS
This work is supported by the National Natural Science
Foundation of China under Grant No.61070027, 61020106002
and 61161160566.
11.
REFERENCES
[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,
A. Silberschatz, and A. Rasin. Hadoopdb: an
architectural hybrid of mapreduce and dbms
technologies for analytical workloads. VLDB,
2(1):922–933, 2009.
[2] S. Babu. Towards automatic optimization of
mapreduce programs. In SoCC, pages 137–142, 2010.
[3] K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,
M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J.
Shekita. Jaql: A scripting language for large scale
semistructured data analysis. In VLDB, 2011.
[4] Y. Chen, S. Alspaugh, and R. Katz. Interactive
analytical processing big data systems: A
cross-industry study of mapreduce workloads. VLDB,
5(12):1802–1813, 2012.
[5] A. Floratou, U. F. Minhas, and U. F. Minhas.
Sql-on-hadoop: Full circle back to shared-nothing
database architectures. Proceedings of the VLDB
Endowment, 12(7):1295 – 1306, 2014.
[6] M. J. Franklin, B. T. J´
onsson, and D. Kossmann.
Performance tradeoffs for client-server query
processing. ACM SIGMOD Record, 25(2):149–160,
1996.
[7] L. M. Haas, W. Chang, G. M. Lohman, J. McPherson,
P. F. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. J.
Carey, and E. Shekita. Starburst mid-flight: as the
dust clears. TKDE, 2(1):143–160, 1990.
[8] H. Herodotou. Hadoop performance models. arXiv
preprint arXiv:1106.0940, 2011.
[9] H. Herodotou and S. Babu. Profiling, what-if analysis,
and cost-based optimization of mapreduce programs.
VLDB, 4(11):1111–1122, 2011.
[10] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong,
F. B. Cetin, and S. Babu. Starfish: A self-tuning
system for big data analytics. In CIDR, volume 11,
pages 261–272, 2011.
[11] S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao,
H.-A. Jacobson, X. Pei, and J. Wang. Dualtable: A
hybrid storage model for update optimization in hive.
In ICDE, 2015. to appear.
[12] R. Lee, T. Luo, F. Huai, Yand Wang, Y. He, and
X. Zhang. Ysmart: Yet another sql-to-mapreduce
translator. In ICDCS, pages 25–36, 2011.
[13] H. Lim, H. Herodotou, and S. Babu. Stubby: A
transformation-based optimizer for mapreduce
workflows. VLDB, 5(11):1196–1207, 2012.
[14] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and
M. Wong. Tenzing a sql implementation on the
mapreduce framework. 2011.
[15] Y. Liu, S. Hu, T. Rabl, W. Liu, H.-A. Jacobsen,
K. Wu, J. Chen, and J. Li. DGFIndex for Smart Grid:
Enhancing Hive with a Cost-Effective
Multidimensional Range Index. Proceedings of the
VLDB Endowment, 13(7):1496–1507, 2014.
[16] C. Olston, U. Reed, Benjamand Srivastava, R. Kumar,
and A. Tomkins. Pig latin: a not-so-foreign language
for data processing. In SIGMOD, pages 1099–1110,
2008.
[17] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive:
a warehousing solution over a map-reduce framework.
VLDB, 2(2):1626–1629, 2009.
[18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a
petabyte scale data warehouse using hadoop. In
ICDE, pages 996–1005, 2010.
[19] S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query
optimization for massively parallel data processing. In
SoCC, page 12, 2011.
[20] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,
S. Shenker, and I. Stoica. Shark: Sql and rich
analytics at scale. In SIGMOD, pages 13–24, 2013.
[21] Y. Xu and S. Hu. Qmapper: a tool for sql
optimization on hive using query rewriting. In WWW,
pages 211–212, 2013.