Polytechnic Institute of NYU Computer Science and Engineering CS308, Spring 2014 May 6, 2014 Sample Solution for Problem Set #4 Problem 1: Problem 2: Problem 3: a) Query1: “List all user ids who have checked into ‘Lunar Lounge’ ”. Query2: “List all user ids who checked into ‘Madison Square Garden’ on January 12 2014”. Query3: “List all the users from Pittsburgh that have checked into a place located in Austin”. Query4: “For every place, list the pid and the number of check-‐ins by users from Austin. Query Execution Plans: b) Query 1: We scan Place to find the Lunar Lounge record, and place it in main memory. Then we do a blocked nested loop join by scanning CheckIn once, and then project onto uid. So the cost is the same as scanning Place (2 million places * 100B = 200MB) and CheckIn (2 billion check-‐ins * 50B = 100GB), which takes time (100,200 MB / 80 MB = 1252.5 seconds). (Note: we ignored 20ms for seek and rotational latency as this is negligible.) Query 2: First scan Place to retrieve Madison Square Garden and put it into main memory. Then do a scan on CheckIn to join with this tuple, which will result in 5000 tuples (5000 * (100+50) = 0.75 MB), which fits in main memory. (The check on cdate is performed while scanning CheckIn.) Then we project onto uid. Thus, the cost is the same as scanning the CheckIn table (100GB) and the Place table (2 million Places * 100 Mb = 200 MB), which takes time 100,200 MB / 80 MB = 1252.5 seconds. Query 3: We first scan the Place table, keeping only those records that are in Austin. There are about 10,000 such places (0.5% of 2 million places total), so these fit into main memory. Assume that places in Austin are as popular on average as other places, so there would be about 0.5% of 2 billion, or about 10 million check-‐ins. The size of each record is 150 bytes (100 bytes for each Place record and 50 for each CheckIn record. So these records use 1.5 GB, and would not fit into main memory for the join with User. We can project away everything from the resulting records except uid, pid, and cdate, so that the size per record might be only about 40-‐42 bytes, or 420 MB total. In this case, we could then use a blocked-‐nested loop join and scan USER once or twice (depending on whether the intermediate results fits or not. Thus, the total cost is that of scanning Place and CheckIn once and User 2 times, or about (200 + 100,000 + 2 * 660 + 2 * 1,000) / 80 = 103,520 / 80 = 1294 seconds. One other, maybe better, option is to first scan both Place and User, and keep the tuples from both tables that satisfy the respective conditions in main memory (there are about 50000 users in Pittsburgh, so this would easily fit). Then scan CheckIn once and join each tuple with both the Place and User table (maybe by first checking for a matching place and then a matching user). This would require only one scan of each table, and also results in a nice non-‐blocking execution after the initial scan and filtering of the small User and Place tables. Query 4: First do a scan on the User Table and obtain all records where ucity = “Chicago”. This results in 10,000,000/50 = 200,000 records, which clearly fit in main memory. Next we can scan CheckIn once to join on the User table, then group by pid and do a count on the rows in each group. Note that there are (at most) 2 million groups, and for each group we only need to store a pid and a count, so this group-‐by and count could be done in a pipelined manner on the output of the join, using say a hash table with key pid and the count as payload. Thus, the disk access cost consists of a scan of User and CheckIn, so the cost would be (1000 MB + 100,000 MB) / 80 Mb/s = 1262.5 seconds. c) For the User table: Each tree node contains n-‐1 keys and n pointers. With 16 bytes per key, 8 bytes per pointer and a node size of 4096 bytes, we can find how many keys and pointers can fit in each node: (n-‐1) * 16 + 8 * n = 4096 => n = 171 Assuming 80% occupancy per a node then each leaf node will contain about 137 index entries and each internal node will contain 137 children. For the User table, 40 records fit into each disk page, assuming 100% occupancy in the disk blocks used by the relation. Thus a sparse index on the User table will have one index for every 40 records or total of 250,000 records. Since about 137 index entries are in each leaf node there will be about 250,000/137 = 1825 leaf nodes. On the next level there will be about 1825/137 = 14 nodes, and then the next level is the root of the tree. So the B+ tree has 3 levels of nodes: the root, one internal level, and the leaf level. Thus is takes about 4 * 10 ms = 40 ms to fetch a single record from the table using this index assuming no caching. The size of the tree is dominated by the leaf level, which is about 1825 * 4KB = 7.3 MB. For the CheckIntable: Each index entry now has two IDs and a counter, and thus takes 40 bytes. Thus n=102, and with 80% occupancy we get about 80 children per node. 80 records of CheckIn fit into each disk page, assuming 100% occupancy in the disk blocks used by the relations. Thus a sparse index on the User table will use one index for every 80 records or a total of 25 million records. Since there are about 80index entries in each leaf node there will be about 25,000,000/80 = 312,500 leaf nodes. The next level has about 312,500/80 = 3906 nodes, the level above it about 3906/80 = 49 nodes, and the level above that is the root. Thus the tree has 4 levels of nodes and 5 * 10 ms = 50ms is needed to fetch a single record from the table. The size of the tree is dominated by the leaf level, which is about 312,500 * 4KB = 1.250 GB. d) Query 1: First, we choose a clustered or unclustered index on pname. Getting the tuple with pname = “Lunar Lounge” will take 4 *10 ms = 40 ms (assuming an index of height 3). Next, we choose a clustered index on pid in CheckIn. To do the join, we then only need to perform one lookup on the pid of the Lunar Lounge record (assuming there is only one Lunar Lounge), which takes 1 * (5 * 10 ms) = 50 ms (assuming an index of height 4, and that the cost of scanning the 100 check-‐ins for the Lunar Lounge in the underlying table is negligible). Overall, the total cost is now about 90ms. Query 2: We choose an index on pname in Place. We also choose a clustered index on (pid, cdate) in CheckIn. Thus, after looking up the pid of Madison Square Garden using the first index, costing about 40ms, we can then do a look-‐up in CheckIn on all records with that pid and with cdate equal to January 12, 2014, resulting in 5000 tuples that are next to each other on disk. The cost of this second step is 50ms + 250KB/80MB/s = 53ms. Note that choosing only a single-‐attribute index for CheckIn, either on pid or cdate, would mean that we would have to either scan all check-‐ins in MSG, or all check-‐ins to any places on January 12, 2014, which would be much more expensive. Query 3: Our main concern here is to avoid a scan of CheckIn, so we need one index on that table. The best choice would be a clustered index on pid. We will use the other table to avoid a scan of User, the second-‐largest table. We use an index on ucity in User to get the 50,000 users in Pittsburgh and place them in memory. We then scan Place to get the pids of the 10,000 tuples of places in Austin, and for each place we look up the matching check-‐ins, and for each such check-‐in we check for the matching user. This step requires 10,000 index lookups in CheckIn, taking about 500 seconds, and scanning takes about 2.5s, so the total cost is about 502.5s. Query 4: For the fourth query, we can have a clustered index on the ucity attribute of User to efficiently find the uids of users in Chicago. (Another slightly faster way to get these might involve having a dense index on (ucity, uid), and then scanning only the leaf nodes of this index with ucity=”Chicago” to get the corresponding uids – this is a bit non-‐standard but also possible.) The problem is that there seems to be no way to use a second index to avoid scanning CheckIn in the join, since an index-‐based join would result in too many lookups as Chicago has too many venues. Thus, we only save the cost of the scan of User. (Well, if all users in the same city are assigned consecutive uids, say all users in Chicago have uids of the form 20*, then we could do something. Can you think of what to do then?) Problem 4 (a) The capacity of the disk is: 2*2*300,000*3000*512bytes =1843.2GB. The maximum rate of a read (using 3000RPM = 50RPS) is 50*3000*512 byte = 76.8 MB/s, and the average rotational latency is 10ms. (b) Block Model: Read 4KB: t = 4+4+4/(100*1024)*1000 ≈ 8.039ms Read 20KB: T = 5t = 5*8.039 = 40.19 ms Read 200KB: T = 50t = 20*8.039 = 400.19 ms Read 20MB: T = 5000t = 40.02 s LTR Model: Read 20KB: T = 4+4+20/100 = 8.02 ms Read 200KB: T = 4+4+200/100 = 10 ms Read 20MB: T = 4+4+20000/100 = 208 ms Thus, the predictions by the LTR model are much faster (i.e., more accurate) than those for the block model when reading large files. (c) Phase 1: Repeat the following until all data is read: Read 1GB of data and sort it in main memory using any sorting algorithm. Write it into a new file until all data is read. The time to read 1GB is 4+4+1024/100*1000=10.25s. To read and write 240 such files takes 10.25*2*240=4920s Phase 2: Merge the 240 files created in Phase 1 in one pass. The main memory is divided into 241 buffers. Each buffer is of size 1024/241 = 4.25 MB. For each buffer, the read and write time is 4+4+4.25/100*1000=50.5 ms. 240 GB can be divided into 240*1024/4.25 = 57826 pieces. The total time is 50.5 * 57826 *2=5840.43 s. The total time for sorting the 240GB file in a single pass is about 4920+5840.43=10760.43 s (d) For 12-‐way merge we require buffers=13. size of each buffer is 1024/13=78.76 MB. For each buffer, the read and write time is 4+4+78.76/100*1000=795.6 ms. 240 GB can be divided into 240*1024/78.76 = 3121 pieces. The total time is 795.6 * 3121 *2= 4966.13s. Next for a 20-‐way merge we require buffers=21. size of each buffer is 1024/21=48.76 MB. For each buffer, the read and write time is 4+4+48.76/100*1000=495.6 ms. 240 GB can be divided into 240*1024/48.76 = 5041 pieces. The total time is 495.6 * 5041 *2=4996.64s. Total Time for 12-‐way + 20-‐way = 4966.13+4996.64 = 9962.77s For 16-‐way merge we require buffers=17. size of each buffer is 1024/17=60.24 MB. For each buffer, the read and write time is 4+4+60.24/100*1000=610.4 ms. 240 GB can be divided into 240*1024/60.24 = 4080 pieces. The total time is 610.4 * 4080 *2=4980.86s. Next for a 15-‐way merge we require buffers=16. size of each buffer is 1024/16=64 MB. For each buffer, the read and write time is 4+4+64/100*1000=648 ms. 240 GB can be divided into 240*1024/64 = 3840 pieces. The total time is 648* 3840 *2= 4976.64 s. Total Time for 15-‐way + 16-‐way = 4980.96+4976.64 = 9958.60s, which is slightly faster than 12-‐20.