Assignment

CMPSCI 445 -- Homework 4
Due April 14, before 11 am
Question 1. Output of Join Algorithms
Suppose we have two unary (one attribute only) relations, R and S:
R
S
------------7
8
2
4
9
2
8
1
3
3
9
2
1
7
3
3
6
Show the result of joining R and S using each of the following algorithms. List the results in the
order that they would be output by the join algorithm. Note that the result relation contains only
one attribute, which is the common attribute between R and S.
(1) Sort merge join algorithm.
(2) Two-phase hash join algorithm, assuming that there is enough memory to complete the join
in two passes. To run the algorithm, assume that there are two hash buckets, numbered 0 and 1.
In Phase I, the hash function sends even values to bucket 0 and odd values to bucket 1. In Phase
II, use R as the “build” relation and S as the “probe” relation. Assume that bucket 0 is read first
and that the contents of a bucket are read in the same order as they were written in the previous
phase.
Question 2. Costs of Join Algorithms
Consider the join R R.a=S.b S, given the following information about the relations to be joined.
The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the
result should be uniformly ignored.
Relation R contains 10,000 tuples and has 10 tuples per page.
Relation S contains 2,000 tuples and also has 10 tuples per page.
Attribute b of relation S is the primary key for S.
Both relations are stored as simple heap files.
Neither relation has any indexes built on it.
52 buffer pages are available.
(1) What is the cost of joining R and S using a block nested loops join?
(2) What is the cost of joining R and S using a sort-merge join?
(3) What is the cost of joining R and S using a hash join?
(4) Assume that you can access as much memory as you want. What would be the lowest
possible I/O cost for joining R and S using any join algorithm, and what is the minimum buffer
size to achieve this cost? Explain briefly.
Question 3. Cost of Group By
Consider again the relation R(a, b, c, … ), with 10,000 tuples and 10 tuples per page. Consider
the following Group By query:
Query 1:
Select
From
Group By
min(R.b)
R
R.a
Assume that 52 buffer pages are available.
Please briefly describe the most efficient implementation of the Group By query using the sortmerge algorithm and analyze its I/O cost.
Question 4: Query Optimization
Consider the following schema:
Sailors(sid, sname, rating, age)
Boats(bid, bname, size)
Reserves(sid, bid, day)
Reserves.sid is a foreign key to Sailors. Reserves.bid is a foreign key to Boats.bid.
We are given the following information about the database:
• Sailors contains 50 pages with 20 records per page, so 1000 records in total.
• Boats contains 10 pages with 10 records per page, so 100 records in total.
• Reserves contains 250 pages with 40 records per page, so 10,000 records in total.
• There are 100 values for Reserves.sid.
• There are 50 values for Reserves.bid.
• There are 1000 values for Reserves.day
Consider Query 2:
SELECT S.sid, S.sname, B.bname
FROM
Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid and R.bid = B.bid
(1) Draw all left-deep query plans that avoid Cartesian products for this query.
(2) Now consider the query plan (SR) B, where SR takes place first and then further
joins with B. Assume that we have 22 pages of memory. Block nested loops join and hash join
are the only available join methods.
Which join algorithm works better for SR given all 22 pages? Explain your answer.
Next, we consider the two joins, (SR) B, together. Let us reserve 10 pages to hold
relation B in memory, and 1 page to hold the final output. We give the remaining 11 pages to
implement SR. Given 11 pages, which join algorithm is more efficient for SR? And
what is the total I/O cost for the entire plan, (SR) B?