DBMS on a modern processor: where does time go?

DBMS on a modern processor:
where does time go?
Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood
University of Wisconsin‐Madison
Presented by:
Bogdan Simion
Current DBMS Performance
+
=
Where is query execution time spent?
Identify performance bottlenecks in CPU and memory
Outline
• Motivation
• Background
• Query execution time breakdown
• Experimental results and discussions
• Conclusions
Hardware performance standards
• Processors are designed and evaluated with simple programs • Benchmarks: SPEC, LINPACK
• What about DBMSs? DBMS bottlenecks
• Initially, bottleneck was I/O • Nowadays ‐ memory and compute intensive apps
• Modern platforms:
– sophisticated execution hardware
– fast, non‐blocking caches and memory
• Still …
– DBMS hardware behaviour
is suboptimal compared to scientific workloads Execution pipeline
INSTRUCTION POOL
FETCH/
DECODE UNIT
DISPATCH EXECUTE
UNIT
L1 I‐CACHE
RETIRE UNIT
L1 D‐CACHE
L2 CACHE
MAIN MEMORY
Stalls overlapped with useful work !!!
Execution time breakdown
TQ = TC + TM + TB + TR ‐ TOVL
•TC ‐ Computation
•TM ‐ Memory stalls
L1D, L1I
L2D, L2I
DTLB, ITLB
•TB ‐ Branch Mispredictions
•TR ‐ Stalls on Execution Resources
Functional Units
Dependency Stalls
DB setup
• DB is memory resident => no I/O interference
• No dynamic and random parameters, no concurrency control among transactions
Workload choice
• Simple queries:
– Single‐table range selections (sequential, index)
– Two‐table equijoins
• Easy to setup and run
• Fully controllable parameters
• Isolates basic operations
• Enable iterative hypotheses !!!
• Building blocks for complex workloads?
Execution Time Breakdown (%)
Query execution time (%)
10% Sequential Scan
100%
100%
100%
80%
80%
80%
60%
60%
60%
40%
40%
40%
20%
20%
20%
0%
0%
0%
A
B
C
D
B
DBMS
Computation
Join (no index)
10% Indexed Range Selection
Memory
C
DBMS
D
Branch mispredictions
• Stalls at least 50% of time
• Memory stalls are major bottleneck
A
B
C
DBMS
Resource
D
Memory Stalls Breakdown (%)
Memory stall time (%)
10% Sequential Scan
100%
100%
100%
80%
80%
80%
60%
60%
60%
40%
40%
40%
20%
20%
20%
0%
0%
0%
A
B
C
D
B
DBMS
L1 Data
Join (no index)
10% Indexed Range Selection
L1 Instruction
C
DBMS
D
A
B
C
D
DBMS
L2 Data
L2 Instruction
• Role of L1 data cache and L2 instruction cache unimportant
• L2 data and L1 instruction stalls dominate
• Memory bottlenecks across DBMSs and queries vary
Effect of Record Size
10% Sequential Scan
L2 data misses / record
25
8
# of misses per record
L1 instruction misses / record
20
6
15
4
10
2
5
0
0
20
48
100
record size
System A
200
System B
20
System C
48
100
record size
System D
• L2D increase: locality + page crossing (except D)
• L1I increase: page boundary crossing costs
200
Memory Bottlenecks
• Memory is important
‐ Increasing memory‐processor performance gap
‐ Deeper memory hierarchies expected
• Stalls due to L2 cache data misses
‐ Expensive fetches from main memory
‐ L2 grows (8MB), but will be slower
• Stalls due to L1 I‐cache misses
‐ Buffer pool code is expensive
‐ L1 I‐cache not likely to grow as much as L2
Branch Mispredictions Are Expensive
25%
Query execution time (%)
Branch misprediction rates
25%
20%
15%
10%
5%
20%
15%
10%
5%
0%
0%
A
B
C
D
B
C
DBMS
DBMS
Sequential Scan
A
Index scan
Join (no index)
• Rates are low, but contribution is significant
• A compiler task, but decisive for L1I performance
D
Mispredictions Vs. L1‐I Misses
10% Sequential Scan
10% Indexed Range Selection
Events / 1000 instr.
12
9
50
20
40
15
Join (no index)
30
10
6
20
3
5
10
0
0
0
A
B
C
DBMS
D
B
Branch mispredictions
C
DBMS
D
A
B
C
DBMS
L1 I-cache misses
• More branch mispredictions incur more L1I misses
• Index code more complicated ‐ needs optimization
D
Resource‐related Stalls
% of query execution time
Dependency‐related stalls (TDEP)
Functional Unit‐related stalls (TFU)
25%
25%
20%
20%
15%
15%
10%
10%
5%
5%
0%
0%
A
B
C
D
B
C
D
DBMS
DBMS
Sequential Scan
A
Index scan
Join (no index)
• High TDEP for all systems : Low ILP opportunity
• A’s sequential scan: Memory unit load buffers?
Microbenchmarks vs. TPC
CPI Breakdown
Clock ticks
System B
System D
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
sequential
scan
TPC-D
2ary index
TPC-C
benchmark
Computation
sequential
scan
TPC-D
2ary
index
TPC-C
benchmark
Memory
Branch misprediction
Resource
• Sequential scan breakdown similar to TPC‐D
• 2ary index and TPC‐C: higher CPI, memory stalls (L2 D&I mostly)
Conclusions
• Execution time breakdown shows trends
• L1I and L2D are major memory bottlenecks
• We need to:
–
–
–
–
reduce page crossing costs
optimize instruction stream
optimize data placement in L2 cache
reduce stalls at all levels
• TPC may not be necessary to locate bottlenecks
Five years later – Becker et al 2004 • Same DBMSs, setup and workloads (memory resident) and same metrics • Outcome: stalls still take lots of time
–
–
–
–
Seq scans: L1I stalls, branch mispredictions much lower
Index scans: no improvement
Joins: improvements, similar to seq scans
Bottleneck shift to L2D misses => must improve data placement
– What works well on some hardware doesn’t on other Five years later – Becker et al 2004 • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2
• B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace cache, 512K L2, BTB 8x than P3
• P3 results:
– Similar to 5 years ago: major bottlenecks are L1I and L2D
• P4 results:
– Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line
– L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index)
Hardware – awareness is important !
References
• DBMS on a modern processor: where does time go? Revisited – CMU Tech Report 2004
• Anastassia Ailamaki – VLDB’99 talk slides
Questions?