Multidimensional Data • Many applications of databases are ``geographic'' = 2dimensional data. Others involve large numbers of dimensions. • Example: data mining information about sales. - A sale is described by (store, day, item, color, size, etc.). Sale = point in 5dim space. - A customer is described by (age, salary, pcode, maritalstatus, etc.). Typical Queries • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'' • Nearest neighbor : ``If I am at coordinates (x,y), what is the nearest McDonalds.'' • They are expressible in SQL. Do you see how? SQL • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'‘ SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; • Nearest neighbor : ``If I am at coordinates (a,b), what is the nearest McDonalds.'‘ Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ ); Big Impediment • For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHEREclause. Approaches 1. Index on attributes independently. - Intersect pointers in main memory to save disk I/O. - Problem: Does this structure help with the nearestneighbor? 2. Multiple key index: Index on one attribute provides pointer to an index on the other. Computing aggregates • Sales(day,store,item,color,size) - Such relations are called “Data cube” • Example query: • “Summarize the sales of white shirts by day and store.” SELECT FROM Sales s WHERE s.item=‘Shirt’ AND s.color=‘white’ GROUP BY day, store; Attempt at using B-trees for MD-queries • Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) • Range query {(x,y) : 450  x 550, 450  y 550} • B-tree indexes with pointer lists on x and on y • 100,000 pointers (i.e. 1,000,000/10) for x, and same for y • 10,000 pointers for answer (found by pointer intersection) • Root of each B-tree in main memory • Suppose leaves have avg. 200 keys  500 disk I/O in each B-tree to get pointer lists  1000 + 2(for intermediate B-tree level) disk I/O’s • Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s • Sum 11,002 disk I/O’s • Sequential scan of file = 10,000 I/O’s (100 tuples per block) Nearest Neighbor query using B-trees • Turn NN to (10,20) into a range-query {(x,y):10-d  x 10+d, 20-d  y 20+d } • Possible problem: • No point in the selected range • The closest point inside may not be the answer • Solution: re-execute range query with slightly larger d NN-queries, example • Same relation Points and its indexes on x and y as before, and Query: NN to (10,20) • Choose d = 1  range-query = {(x,y): 9x 11, 19y21} • 2000 points in [9,11], same in [19,21]  each dimension = 10+1 I/O’s to get pointers (+1 is because points with x=9 may not start just at the beginning of the leaf) • With an extra I/O for the intermediate node for each index  24 + 1 disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points • However, if d is too small, we have to run another range query with a larger d Grid files (hash-like structure) • Divide data into stripes in each dimension • Rectangle in grid points to bucket • Example: database records (age,salary) for people who buy gold jewelry. Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260) Grid file Operations Lookup Find coordinates of point in each dimension --gives you a bucket to search. Nearest Neighbor Lookup point P . Consider points in that bucket. • Problem: there could be points in adjacent buckets that are closer. - Example: NN of (45; 200). • Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. • Buckets on border may contain points not in range. • Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute • Problem: must search a whole row or column of buckets. Insertion • Use overflow buckets, or split stripes in one or more dimensions • Insert (52,200). Split central bucket, for instance by splitting central salary stripe • The blocks of 3 buckets are to be processed. • In general the blocks of n buckets are to be processed during a split. • n is the number of buckets in the chosen direction • Very expensive. Insertion • Insert (52,200). Split central bucket, for instance by splitting central salary stripe (One possibility) Grid files Advantages • Good for multiple-key search • Supports PM, RQ, NN queries Disadvantages • Space, management overhead • Need partitioning ranges that evenly split keys • Possibility of overflow buckets for insertion Partitioned hashing I • If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). • A preferable option is to design the hash function so it produces some number of bits, say k. These k bits are divided among n attributes. • I.e. the hash function h is a concatenation of n hash functions, one for each dimensional attribute. • h = (h1, …, hn) • the bucket where to put a tuple (v1, …, vn) is computed by concatenating the bit sequences h1(v1)…hn(vn). Partitioned hashing II • If we have a hash table with 10-bit bucket numbers (1024 buckets), we could devote four bits to attribute a and the remaining six bits to attribute b. • We hash using ha and hb. • If we are given a partial match query specifying only the value of a, we compute ha(a), which could be, say 0101. Then, we locate all the relevant buckets, which are: 0101000000 to 0101111111. Partitioned hashing III • Example: Gold jewelry with • first bit = age mod 2 • bits 2 and 3: salary mod 4 • Works well for: • partial match (i.e. just an attribute specified) • Bad for: • range • Nearest Neighbors queries Grid files vs. partitioned hashing • If many dimensions  many empty cells in grid. While partitioned hashing is OK. • Both support exact and partial match queries. • Grid files good for range and Nearest Neighbor queries, while partitioned hashing is not at all.