Download Report

Chapter 11: Indexing and Hashing
Database System Concepts, 6th Ed.
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Basic Concepts
Indexing mechanisms are used to speed up access to data.
 Search Key - attribute to set of attributes used to look up
records in a file.
 An index file consists of records (called index entries) of the
form
search-key
pointer
 Indexes are typically much smaller than the original, e.g.:

Table of contents in a book

Index in a book

Catalog in a library

Inventory in a warehouse
Database System Concepts - 6th Edition
11.2
©Silberschatz, Korth and Sudarshan
Basic Concepts
search-key
pointer
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted
order
Hash indices: search keys are distributed
uniformly across “buckets” using a “hash function”.
Database System Concepts - 6th Edition
11.3
©Silberschatz, Korth and Sudarshan
Index Evaluation Metrics
 Access types supported efficiently, e.g.:

records with a specified value in the attribute

or records with an attribute value falling in a
specified range of values.
 Access time
 Insertion time
 Deletion time
 Space overhead
Database System Concepts - 6th Edition
11.4
©Silberschatz, Korth and Sudarshan
11.2 Ordered Indices
To use an ordered index, we need an index-sequential
file = file that is stored sequentially, according to some
attribute:
This attribute (PK?) will be used as primary index
Database System Concepts - 6th Edition
11.5
©Silberschatz, Korth and Sudarshan
11.2 Ordered Indices
In an ordered index, index entries are stored sorted on the
search key value

E.g. author catalog in library.
 Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the
file.

Also called clustering index

The search key of a primary index is usually (but not
necessarily) the PK.
 Secondary index: an index whose search key specifies
an order different from the sequential order of the file.

Also called non-clustering index.
Database System Concepts - 6th Edition
11.6
©Silberschatz, Korth and Sudarshan
Ordered index: Dense
Dense index — There is an index record for every search-key value
E.g.: Dense and primary index on ID attribute (PK) of instructor
table
Database System Concepts - 6th Edition
11.7
©Silberschatz, Korth and Sudarshan
Ordered index: Dense
E.g.: Dense and primary index on dept_name, with
instructor file sorted on dept_name (not PK)
Database System Concepts - 6th Edition
11.8
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Sparse Index: contains index records for only some of the search-key values.
 It is applicable only as primary index, i.e. when the records are physically
ordered on the search-key
Database System Concepts - 6th Edition
11.9
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Algorithm for locating a record with search-key value K:

Find index record with largest search-key value < K

Search the file sequentially starting at the record to which the
index record points
Database System Concepts - 6th Edition
11.10
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Compared to dense indices, a sparse index:
 Requires less space and less maintenance overhead for insertions and
deletions.
 It is slower for locating records, b/c it requires multiple accesses to the file.
Good tradeoff: sparse index with an index entry for every block in file,
corresponding to least search-key value in the block.
Database System Concepts - 6th Edition
11.11
©Silberschatz, Korth and Sudarshan
QUIZ
What are the two main type of indices used in DBMS?
Database System Concepts - 6th Edition
11.12
©Silberschatz, Korth and Sudarshan
QUIZ
What are the two main type of indices used in DBMS?
Ordered index
Hash index
Database System Concepts - 6th Edition
11.13
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a primary index and a
secondary index?
Database System Concepts - 6th Edition
11.14
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a primary index and a
secondary index?
The key of the primary (a.k.a. clustered) index are the
ones on which the physical file is sorted.
The key of the secondary (a.k.a. non-clustered) index
is not the one on which the physical file is sorted.
Database System Concepts - 6th Edition
11.15
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a dense index and a
sparse index?
Database System Concepts - 6th Edition
11.16
©Silberschatz, Korth and Sudarshan
QUIZ
Which index is dense and which is sparse?
Database System Concepts - 6th Edition
11.17
©Silberschatz, Korth and Sudarshan
QUIZ
Can there be secondary sparse indices?
Give an example of explain why it’s not possible.
Database System Concepts - 6th Edition
11.18
©Silberschatz, Korth and Sudarshan
Secondary Indices
Remember:
 A secondary index is …
 Secondary indices have to be dense! (There are no
secondary sparse indices – Why?)
 Why do we need them?
Database System Concepts - 6th Edition
11.19
©Silberschatz, Korth and Sudarshan
Secondary Index Example
Secondary index on salary field of instructor
Index record points to “bucket” that contains pointers to all the
physical records with that particular search-key value.
Database System Concepts - 6th Edition
11.20
©Silberschatz, Korth and Sudarshan
QUIZ: Classify these indices as
primary/secondary and dense/sparse
?
Database System Concepts - 6th Edition
11.21
©Silberschatz, Korth and Sudarshan
QUIZ: Classify these indices as
primary/secondary and dense/sparse
Images source: http://webserver.ignou.ac.in/virtualcampus/BIT_StudyMaterials/tri8/cst203-bl1-u3.htm
Database System Concepts - 6th Edition
11.22
©Silberschatz, Korth and Sudarshan
QUIZ: What is the difference between these
secondary indices?
Database System Concepts - 6th Edition
11.23
©Silberschatz, Korth and Sudarshan
A: In the one on the right, the pointers are “approximate” –
they point to the correct block, but not necessarily the
correct record within the block
Database System Concepts - 6th Edition
11.24
©Silberschatz, Korth and Sudarshan
Performance of ordered indices
 Pro: substantial speed increase when searching for
records.
 Con: when a table in the DB is modified, every index on
that file(s) storing the table must be updated.
 Pro: Sequential scan using primary index is efficient
 Con: sequential scan using a secondary index is
expensive

Each record access may fetch a new block from disk

Block fetch requires about 5 to 10 ms (HDD), versus
≈ 10-50 ns for main memory access (RAM)
Database System Concepts - 6th Edition
11.25
©Silberschatz, Korth and Sudarshan
Hierarchical/Multilevel Index
For large table, primary index may not fit in main memory
→ store on disk → access becomes expensive.
 Solution: treat primary index kept on disk as a
sequential file and construct a sparse index on it.

outer index – a sparse index of primary index

inner index – the primary index file
Database System Concepts - 6th Edition
11.26
©Silberschatz, Korth and Sudarshan
Two-level index
Database System Concepts - 6th Edition
11.27
©Silberschatz, Korth and Sudarshan
Hierarchical/Multilevel Index
 If even the outer index is too large to fit in main
memory, yet another level of indexing can be created
(three-level), and so on.
 Overhead: Indices at all levels must be updated on
insertion or deletion from the file!
Database System Concepts - 6th Edition
11.28
©Silberschatz, Korth and Sudarshan
Index Update: Deletion
 If deleted record was the
only record in the file with its
particular search-key value,
the search-key is deleted
from the index also.
 Single-level index entry deletion:

Dense indices – deletion of search-key is similar to file record
deletion.

Sparse indices –

if an entry for the search key exists in the index, it is
deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order).
 If
the next search-key value already has an index entry, the
entry is deleted instead of being replaced.
Database System Concepts - 6th Edition
11.29
©Silberschatz, Korth and Sudarshan
Index Update: Insertion
 Single-level index insertion:

Perform a lookup using the search-key value appearing in
the record to be inserted.

Dense indices – if the search-key value does not appear in
the index, insert it.

Sparse indices – if index stores an entry for each block of
the file, no change needs to be made to the index unless a
new block is created.
 If
a new block is created, the first search-key value
appearing in the new block is inserted into the index.
 Multilevel insertion and deletion: algorithms are simple
extensions of the single-level algorithms
Database System Concepts - 6th Edition
11.30
©Silberschatz, Korth and Sudarshan
Disadvantage of indexed-sequential files
 Performance of all operations degrades as file grows.
 Periodic reorganization of entire file is required

E.g. insertion of a tuple requires either:
Overflow block, or
Shifting of all records following it
Database System Concepts - 6th Edition
11.31
©Silberschatz, Korth and Sudarshan
11.3 B+-Tree Index Files
B+-tree indices are an alternative to indexed-sequential files.
B+-tree is a type of balanced tree.
a.k.a. “skinny” tree
Database System Concepts - 6th Edition
11.32
©Silberschatz, Korth and Sudarshan
11.3 B+-Tree Index Files
Why do we need balanced trees?
 Most operations in a tree (search, insert, delete)
are “Big-Oh” of the height
Database System Concepts - 6th Edition
11.33
©Silberschatz, Korth and Sudarshan
B+-Tree
At most n – 1 keys and n pointers in each node. (n = 4 here)
Nodes may be only
partially filled
All leaves must
be on the same
level
To avoid “skinny” trees, all nodes except the root and
leaves must have at least n/2 keys. We say that the
“fill factor” is at least 50%.
Ceiling function
Image
source:
http://en.wikipedia.org/wiki/B%2B_tree
Database System
Concepts
- 6th Edition
11.34

©Silberschatz, Korth and Sudarshan
There are 3 types of pointers in a B+-Tree
• Those inside the tree point to other nodes of the tree.
• Horizontal from one
leaf to the next
• Those in the leaves point to the actual tuple of data
on the disk.
Image
source:
http://en.wikipedia.org/wiki/B%2B_tree
Database System
Concepts
- 6th Edition
11.35
©Silberschatz, Korth and Sudarshan
B+-Tree Node Structure
 Logically, the keys are “in between” the pointers
 The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
(Initially assume no duplicate keys, address duplicates later)
 All keys x in the subtree pointed to by Pi lie in between the two
adjacent keys:
Ki-1 ≤ x < Ki
Sometimes we write informally:
Ki-1 ≤ Pi < Ki
Database System Concepts - 6th Edition
11.36
©Silberschatz, Korth and Sudarshan
The B+-tree is a multi-level, index for the leaf nodes!
There’s a mistake in Fig.11.9 of our text.
Database System Concepts - 6th Edition
11.37
©Silberschatz, Korth and Sudarshan
Completing the definition of B+-tree
 Each internal node (i.e. not root or leaf) has
at least n/2 pointers (children).
 A leaf node has at least (n–1)/2 keys.
In many applications, the
fill factor can be adjusted,
even with separate
values for leaves and
internal nodes!
 Special cases:

If the root is not a leaf, it has at least 2
children.

If the root is a leaf (i.e. there are no other
nodes in the tree!), it can have between
0 and (n–1) values.
Database System Concepts - 6th Edition
11.38
©Silberschatz, Korth and Sudarshan
Leaf Nodes in B+-Trees
Properties of a leaf node:
 For i = 1, 2, . . ., n–1, pointer Pi points to a file record with
search-key value Ki,
 If Li, Lj are leaf nodes and i < j, Li’s search-key values are less
than or equal to Lj’s search-key values
 Pn points to next leaf node in search-key order
Database System Concepts - 6th Edition
11.39
©Silberschatz, Korth and Sudarshan
The previous slides contain all the
formal requirements that a B+ tree
must fulfill
To do for next time/review:
 List on one page all the requirements.
 Include the page in your cheat-sheet.
Database System Concepts - 6th Edition
11.40
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of the
tree; in which leaf node is Ki to be found?
Q: Is it certain that Ki will be found in a leaf?
Database System Concepts - 6th Edition
11.41
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of the
tree; in which leaf node is Ki to be found?
Q: Is it certain that Ki will be found in a leaf?
A: Yes, b/c the non-leaf nodes are an index to the
leaves!
Database System Concepts - 6th Edition
11.42
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of
the tree; in which leaf node is Ki to be found?
 Remember the ordering property:
 Two cases are possible:

If Ki is just above the leaf level, then the leaf
cannot start with a smaller key!
Database System Concepts - 6th Edition
11.43
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of
the tree; in which leaf node is Ki to be found?

If Ki is not just above the leaf level, the first key in
the right subtree must be > Ki (why?), so from this
point on we can take only left pointers to reach Ki .
 Conclusion: In either case, Ki is
the minimal key in the right
subtree!
Image
http://www.cburch.com/cs/340/reading/btree/index.html
Databasesource:
System Concepts
- 6th Edition
11.44
©Silberschatz, Korth and Sudarshan
B+ tree example
Find:
 The value of n
 The fill limit for the leaf nodes (# of keys)
 The fill limit for internal nodes (# of
pointers/children)
 The fill limit for the root
Is this a valid B+ tree?
Database System Concepts - 6th Edition
11.45
©Silberschatz, Korth and Sudarshan
B+ tree example
 n = 6 (max # of pointers/children)
 Leaf nodes must have between at least 3 keys, acc. to
fmla. (n–1)/2.
 Internal nodes must have between at least 3 children,
acc. to fmla. (n/2.
 Root must have at least 2 children, since it’s not the only
node in the tree.
Database System Concepts - 6th Edition
11.46
©Silberschatz, Korth and Sudarshan
Another B+ tree example
Find:
 The value of n
 The fill limit for the leaf nodes (# of keys)
 The fill limit for internal nodes (# of
pointers/children)
 The fill limit for the root
Is this a valid B+ tree?
Database System Concepts - 6th Edition
11.47
©Silberschatz, Korth and Sudarshan
Design problem
 We want each node of a B+ tree to fit as closely as possible in a
block on the disk.
 The block size is 4 KB, and the total capacity of the disk is 2 TB.
 The keys are integers, represented on 4 Bytes.
 The pointers are serial numbers unique to each disk block.

Hint: How many such serial numbers are there?
 Find the optimal value of n for this system.
Database System Concepts - 6th Edition
11.48
©Silberschatz, Korth and Sudarshan
Remarks on B+-trees
 Since the inter-node connections are done by pointers, logically close
blocks need not be physically close.
 The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
 The B+-tree contains a relatively small number of levels
 Level
 Next
 ..
below root has at least 2* n/2 values
level has at least 2* n/2 * n/2 values
etc.

If there are K search-key values in the file, the tree height is no
more than  logn/2(K)

Thus searches can be conducted efficiently.
 Insertions and deletions to the main file can be handled efficiently, as
the index can be restructured in logarithmic time (next slides).
Database System Concepts - 6th Edition
11.49
©Silberschatz, Korth and Sudarshan
Queries on B+-Trees
To find record with search-key value V:
1. C=root
2. While C is not a leaf node
{
1. Let Ki be the smallest key in C such that V  Ki.
2. If no such exists
set C = last non-null pointer in C
3. Else if (V = Ki ) set C = Pi +1
4. Else
set C = Pi
}
//end while
3. Let Ki be the smallest key in C such that Ki = V
1. If there is such a value i, follow pointer Pi to desired tuple on disk
2. Else
no record with search-key value V exists.
Database System Concepts - 6th Edition
11.50
©Silberschatz, Korth and Sudarshan
QUIZ: Search
for key 55 in
this B+ tree
Database System Concepts - 6th Edition
11.51
©Silberschatz, Korth and Sudarshan
QUIZ: Search
for key 85 in
this B+ tree
Database System Concepts - 6th Edition
11.52
©Silberschatz, Korth and Sudarshan
SKIP Handling Duplicates
Database System Concepts - 6th Edition
11.53
©Silberschatz, Korth and Sudarshan
Efficiency of Queries on B+-Trees
 If there are K search-key values in the file, the height of the tree
is no more than logn/2(K).
 A node is generally the same size as a disk block, typically 4 KB

and n is typically around 100 (40 bytes per index entry).
 With 1 million search key values and n = 100

at most log50(1,000,000) = 4 nodes are accessed in a lookup.
 Contrast this with a balanced binary tree with 1 million search key
values — around 20 nodes are accessed in a lookup

above difference is significant since every node access may
need a disk I/O, costing ~ 10 milliseconds
Database System Concepts - 6th Edition
11.54
©Silberschatz, Korth and Sudarshan
Your turn!
 If there are K search-key values in the file,
h ≤ logn/2(K).
 A node is generally the same size as a disk block, typically 4 KB
 Assume:

10 million search key values

Size of one tuple on disk is 100 Bytes

n = 80

Accessing each block on the HDD requires 10 ms
Find and compare the number of nodes accessed with:

B+ tree index

Binary tree index

Sequential search of a non-indexed file
Database System Concepts - 6th Edition
11.55
©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Insertion
1. Find the leaf node in which the search-key value would appear
2. If the search-key value is already present in the leaf node
1.
Add record to the file (duplicate!)
2.
If necessary add a pointer to the bucket.
3. If the search-key value is not present, then
1.
add the record to the main file (and create a bucket if
necessary)
2.
If there is room in the leaf node, insert (key-value, pointer)
pair in the leaf node
3.
Otherwise, split the node (along with the new (key-value,
pointer) entry) as discussed in the next slide.
Database System Concepts - 6th Edition
11.56
©Silberschatz, Korth and Sudarshan
Insertion may require splitting node(s)
 Splitting a leaf node:

take the n (search-key value, pointer) pairs (including the one
being inserted) in sorted order. Place the first n/2 in the
original node, and the rest in a new node.

let the new node be p, and let k be the least key value in p.
Insert (k,p) in the parent of the node being split.

If the parent is full, split it and propagate the split further up.
 Splitting of nodes proceeds upwards till a node that is not full is
found.

In the worst case the root node may be split increasing the
height of the tree by 1.
EXAMPLES
Database System Concepts - 6th Edition
11.57
©Silberschatz, Korth and Sudarshan
B+-Tree insertion example 1
Split node containing Brandt, Califieri and Crick.
Next step: insert entry with (Califieri,pointer-to-new-node) into parent
B+-Tree before and after insertion of “Adams”
Database System Concepts - 6th Edition
11.58
©Silberschatz, Korth and Sudarshan
B+-Tree insertion example 2
B+-Tree before and after insertion of “Lamport”
Database System Concepts - 6th Edition
11.59
©Silberschatz, Korth and Sudarshan
Insertion in B+-Trees (Cont.)
 Splitting a non-leaf node: when inserting (k,p) into an already full
internal node N

Copy N to an in-memory area M with space for n+1 pointers and n
keys

Insert (k,p) into M

Copy P1,K1, …, K n/2-1,P n/2 from M back into node N

Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node
N’

Insert (K n/2,N’) into parent N
 Read full pseudocode: p.494 of text!
Califieri
Adams Brandt Califieri Crick
Database System Concepts - 6th Edition
Adams Brandt
11.60
Crick
©Silberschatz, Korth and Sudarshan
QUIZ: Insert keys 21, 22, 23 in this B+ tree
Database System Concepts - 6th Edition
11.61
©Silberschatz, Korth and Sudarshan
SKIP Deletion
Database System Concepts - 6th Edition
11.62
©Silberschatz, Korth and Sudarshan
Ways to handle non-unique search keys
 Buckets on separate block (bad idea)
 List of tuple pointers with each key

Extra code to handle long lists

Deletion of a tuple can be expensive if there are many
duplicates on search key (why?)

Low space overhead, no extra cost for queries
 Make search key unique by adding a unique record-identifier,
a.k.a. an uniquifier attribute

Extra storage overhead for keys

Simpler code for insertion/deletion

Widely used
Database System Concepts - 6th Edition
11.63
©Silberschatz, Korth and Sudarshan
Ways to handle non-unique search keys
 Buckets on separate block (bad idea)
 List of tuple pointers with each key

Extra code to handle long lists

Deletion of a tuple can be expensive if there are many
duplicates on search key (why?)

Low space overhead, no extra cost for queries
 Make search key unique by adding a unique record-identifier,
a.k.a. an uniquifier attribute

Extra storage overhead for keys

Simpler code for insertion/deletion

What support does
PostgreSQL offer for
uniquifiers?
Widely used
Database System Concepts - 6th Edition
11.64
©Silberschatz, Korth and Sudarshan
See PostgreSQL lab: SERIAL data type!
 CREATE SEQUENCE test_seq
 nextval('test_seq');
-- Increments the value of the
specified sequence and returns the new value (as an integer)
 currval('test_seq');
-- Returns the value returned by
the most recent nextval(). If nextval() hasn't been used yet, an
error is returned (see 2nd cmd. below).
 setval('test_seq', n);
-- Sets the current value of the
sequence to n.
 DROP SEQUENCE test_seq
Database System Concepts - 6th Edition
11.65
©Silberschatz, Korth and Sudarshan
Conclusion on B+-Tree Index Files
 Advantage of B+-tree index files:

automatically reorganizes itself with small, local,
changes, in the face of insertions and deletions.

Reorganization of entire file is not required to
maintain performance.
 (Minor) disadvantage of B+-trees:

extra insertion and deletion overhead, space
overhead.
 Advantages of B+-trees outweigh disadvantages

B+-trees are used extensively
Database System Concepts - 6th Edition
11.66
©Silberschatz, Korth and Sudarshan
11.4 B+-Tree File Organization
 Idea: Use the B+-Tree structure not only for organizing the
index, but for organizing the file itself
 The leaf nodes in a B+-tree file organization store records,
instead of pointers
Database System Concepts - 6th Edition
11.67
©Silberschatz, Korth and Sudarshan
11.4 B+-Tree File Organization
 Leaf nodes are still required to be half full
 Since records are larger than pointers, the maximum number of
records that can be stored in a leaf node is less than the number
of pointers in a nonleaf node.
 Insertion and deletion are handled in the same way as in a B+-tree
index.
Database System Concepts - 6th Edition
11.68
©Silberschatz, Korth and Sudarshan
B+-Tree File Organization
 Good space utilization is important since records use more space
than pointers.
 To improve space utilization, involve sibling nodes in redistribution
during splits (for insertion) (and merges, for deletion)

If node full, first attempt to redistribute to adjacent siblings

If both adjacent siblings are full, split node and redistribute to the
two new nodes and one of the adjacent siblings!

Result: in each node having at least
Database System Concepts - 6th Edition
11.69
2n / 3
entries
©Silberschatz, Korth and Sudarshan
SKIP pp.502-509 in text
EoL3
Database System Concepts - 6th Edition
11.70
©Silberschatz, Korth and Sudarshan
QUIZ B+ trees
Is this a valid B+ tree?
Insert 42.
Database System Concepts - 6th Edition
11.71
©Silberschatz, Korth and Sudarshan
11.6 Hashing
Basic ideas:
 Tree indices are efficient (height ≤ logn/2(N)) and
mathematically well-understood, but …
 When querying for individual tuples we have too much
precision: although we need only one tuple, reading from
the HDD is still done in entire blocks (~4KB), each of
which usually stores multiple tuples.
 The parable of the two storage cabinets …
Database System Concepts - 6th Edition
11.72
©Silberschatz, Korth and Sudarshan
One little drawer for each item
Image source:
www.quantum-bits.com
Database
System Concepts
- 6th Edition
Larger drawers, with multiple items in each
11.73
©Silberschatz, Korth and Sudarshan
Static Hashing
 Bucket = unit of storage containing one or more
records (typically a disk block).
 Hash file organization: we obtain the bucket of a record
directly from its search-key value using a hash
function.
 Hash function h is a function from the set of all search-
key values K to the set of all bucket addresses B.
 Hash function is used to locate records for query,
insertion and deletion.
 Records w/different search-key values will be mapped
to same bucket → entire bucket is searched
sequentially to locate a record.
Database System Concepts - 6th Edition
11.74
©Silberschatz, Korth and Sudarshan
Example of Hash File Organization
File stores instructor table, using dept_name as key
(See figure in next slide.)
 There are 8 buckets
 The binary representation of the ith character is assumed to be
the integer i.
 The hash function returns the sum of the binary representations
of the characters modulo 8
E.g. h(Music) = 1
h(Physics) = 3
Database System Concepts - 6th Edition
h(History) = 2
h(Elec. Eng.) = 3
11.75
©Silberschatz, Korth and Sudarshan
Database System Concepts - 6th Edition
11.76
©Silberschatz, Korth and Sudarshan
EXAMPLE: Hash functions in Python
The hash function returns the sum of the binary representations of
the characters modulo 8
E.g. h(Music) = 1
h(Physics) = 3
h(History) = 2
h(Elec. Eng.) = 3
Write in Python the function h(string) described
Hints:

The Python library function ord() returns the ASCII
code of its character argument

The modulus is implemented as the operator %
Database System Concepts - 6th Edition
11.77
©Silberschatz, Korth and Sudarshan
Database System Concepts - 6th Edition
11.78
©Silberschatz, Korth and Sudarshan
Hash Functions
 Worst case: hash function maps all search-key values to same
bucket; this makes access time proportional to the number of
search-key values in the file.
 Ideal hash function is uniform → each bucket is assigned the
same average number of search-key values from the set of all
possible values.
 Ideal hash function is random, so each bucket will have the
same number of records assigned to it irrespective of the actual
distribution of search-key values in the file.

To achieve (pseudo)randomness, typical hash functions
perform computations on the internal binary representation
of the search-key, as in the prev. example.
Database System Concepts - 6th Edition
11.79
©Silberschatz, Korth and Sudarshan
What to do when a bucket becomes full?
Database System Concepts - 6th Edition
11.80
©Silberschatz, Korth and Sudarshan
Handling of Bucket Overflows
Bucket overflow can occur because of
 Insufficient buckets
 Skew in distribution of records → two reasons:

multiple records have same search-key value

chosen hash function produces non-uniform
distribution of key values
Although the probability of bucket overflow can be reduced,
it cannot be eliminated; it is handled by using overflow
buckets.
Database System Concepts - 6th Edition
11.81
©Silberschatz, Korth and Sudarshan
 Overflow chaining – the overflow buckets of a given
bucket are chained together in a linked list.

This is called closed hashing.
 An alternative exists, called open hashing, which does not
use overflow buckets. Not suitable for DB apps.!
E.g. linear
probing
Database System Concepts - 6th Edition
11.82
©Silberschatz, Korth and Sudarshan
Hash Indices
 Hashing can be used not only for file organization, but
also for index-structure creation.
 A hash index organizes the search keys, with their
associated record pointers, into a hash file structure.
 Strictly speaking, hash indices are always secondary
indices

if the file itself is organized using hashing, a separate
primary hash index on it using the same search-key
is unnecessary.

However, we use the term hash index to refer to
both secondary index structures and hash organized
files.
Database System Concepts - 6th Edition
11.83
©Silberschatz, Korth and Sudarshan
Example of Hash Index
Hash index on instructor.ID
Database System Concepts - 6th Edition
11.84
©Silberschatz, Korth and Sudarshan
Deficiencies of Static Hashing
 Function h maps search-key values to a fixed set of B of bucket
addresses. Databases grow or shrink with time.

If initial number of buckets is too small, and file grows,
performance will degrade due to too much overflows.

If space is allocated for anticipated growth, a significant amount of
space will be wasted initially (and buckets will be underfull).

If database shrinks, again space will be wasted.
 One solution: periodic re-organization of the file with a new hash
function

Expensive, disrupts normal operations
 Better solution: allow the number of buckets to be modified
dynamically.
Database System Concepts - 6th Edition
11.85
©Silberschatz, Korth and Sudarshan
11.7 Dynamic Hashing
 Good for DBs that grow and shrink in size
 Allows the hash function to be modified dynamically
 Extendable hashing – one form of dynamic hashing

Hash function generates values over a large range — typically b-bit
integers, with b = 32.

At any time use only a prefix of the hash function to index into a table
of bucket addresses.
Let the length of the prefix be i bits, 0  i  32.
 Bucket address table size = 2i. Initially i = 0
 Value of i grows and shrinks as the table grows and shrinks.
Multiple entries in the bucket address table may point to the same
bucket (why?)
Thus, actual # of buckets is < 2i
 The # of buckets also changes dynamically due to merging and
splitting of buckets.



Database System Concepts - 6th Edition
11.86
©Silberschatz, Korth and Sudarshan
Extendable Hash Structure
In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next
slide for details)
Database System Concepts - 6th Edition
11.87
©Silberschatz, Korth and Sudarshan
Use of Extendable Hash Structure
 Each bucket j stores a value ij

All the entries that point to the same bucket have the same values
on the first ij bits.
 To locate the bucket containing search-key Kj:
1. Compute h(Kj) = X
2. Use the first i high order bits of X as a displacement into bucket
address table, and follow the pointer to appropriate bucket
 To insert a record with search-key value Kj

follow same procedure as look-up and locate the bucket, say j.

If there is room in the bucket j insert record in the bucket.

Else the bucket must be split and insertion re-attempted (next
slide.)
 Overflow
Database System Concepts - 6th Edition
buckets used instead in some cases
11.88
©Silberschatz, Korth and Sudarshan
Insertion in Extendable Hash Structure
To split a bucket j when inserting record with search-key value Kj:
 If i > ij (more than one pointer to bucket j)

allocate a new bucket z, and set ij = iz = (ij + 1)
 Update the second half of the bucket address table entries originally
pointing to j, to point to z

remove each record in bucket j and reinsert (in j or z)
 recompute new bucket for Kj and insert record in the bucket (further
splitting is required if the bucket is still full)
 If i = ij (only one pointer to bucket j)
 If i reaches some limit b, or too many splits have happened in this
insertion, create an overflow bucket
 Else
 increment i and double the size of the bucket address table.
 replace each entry in the table by two entries that point to the
same bucket.
 recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
Database System Concepts - 6th Edition
11.89
©Silberschatz, Korth and Sudarshan
Use of Extendable Hash Structure: Example
Database System Concepts - 6th Edition
11.90
©Silberschatz, Korth and Sudarshan
Example (Cont.)
 Initial Hash structure; bucket size = 2
Database System Concepts - 6th Edition
11.91
©Silberschatz, Korth and Sudarshan
Example (Cont.)

Hash structure after insertion of “Mozart”, “Srinivasan”,
and “Wu” records
Database System Concepts - 6th Edition
11.92
©Silberschatz, Korth and Sudarshan
Example (Cont.)
 Hash structure after insertion of Einstein record
Database System Concepts - 6th Edition
11.93
©Silberschatz, Korth and Sudarshan
Example (Cont.)
 Hash structure after insertion of Gold and El Said records
Database System Concepts - 6th Edition
11.94
©Silberschatz, Korth and Sudarshan
Your turn!
Insert (Katz, Comp.Sci.) record.
Database System Concepts - 6th Edition
11.95
©Silberschatz, Korth and Sudarshan
Example (Cont.)

Hash structure after insertion of Katz record
Database System Concepts - 6th Edition
11.96
©Silberschatz, Korth and Sudarshan
Example (Cont.)
And after insertion of
eleven records
Database System Concepts - 6th Edition
11.97
©Silberschatz, Korth and Sudarshan
Example (Cont.)
And after insertion of
Kim record in previous
hash structure
Database System Concepts - 6th Edition
11.98
©Silberschatz, Korth and Sudarshan
SKIP: Deletion in Extendable Hash
Structure
Database System Concepts - 6th Edition
11.99
©Silberschatz, Korth and Sudarshan
Hashing pros and cons
 Benefits of extendable hashing:
Hash performance does not degrade with growth of file
 Minimal space overhead
 Disadvantages of extendable hashing
 Extra level of indirection to find desired record
 Bucket address table may itself become very big (larger than
memory)
 Cannot allocate very large contiguous areas on disk either
 Solution: B+-tree structure to locate desired record in bucket
address table
 Changing size of bucket address table is an expensive operation
 Linear hashing is an alternative mechanism
 Allows incremental growth of its directory (equivalent to bucket
address table) …
 … at the cost of more bucket overflows.

Database System Concepts - 6th Edition
11.100
©Silberschatz, Korth and Sudarshan
SKIP
11.8, 11.9
Database System Concepts - 6th Edition
11.101
©Silberschatz, Korth and Sudarshan
11.10 Index definition in SQL
Note: In keeping with the declarative nature of SQL, there is no
standard way provided to specify what type of index, or any other
index details .
Database System Concepts - 6th Edition
11.102
©Silberschatz, Korth and Sudarshan
To do for next time:
Review problems for Ch.11:
3, 4, 5, 6
15, 16, 17, 19, 22
Database System Concepts - 6th Edition
11.103
©Silberschatz, Korth and Sudarshan
Chapter 12 review:
Indexing and Hashing
 Basic Concepts
 Ordered Indices
 B+-Tree Indices
 B-Tree Index Files
 Static Hashing
 Dynamic Hashing
 Comparison of Ordered Indexing and Hashing
Database System Concepts - 6th Edition
11.104
©Silberschatz, Korth and Sudarshan
B+-Tree properties
 Paths from the root to any leaf have the same length (tree is
perfectly balanced!)
 Keys Ki are sorted (assume no duplicates!)
 Each inner node (not root or leaf) has between n/2 and n children

Pi points to subtree with key values x s.t. Ki-1 ≤ x < Ki
 Leaf node has between (n–1)/2 and n–1 keys

Pi points to tuple with key Ki

Pn points to next leaf
 The root normally has to have at least 2 children.

If the root is a leaf (there are no other nodes in the tree), it can
have between 0 and n–1 keys.
Database System Concepts - 6th Edition
11.105
©Silberschatz, Korth and Sudarshan
QUIZ: Verify all the conditions for this B+ tree
 n=





Database System Concepts - 6th Edition
11.106
©Silberschatz, Korth and Sudarshan
QUIZ: Verify all the conditions for this B+ tree
 n=





Database System Concepts - 6th Edition
11.107
©Silberschatz, Korth and Sudarshan
QUIZ: Are these valid B+ trees?
Database System Concepts - 6th Edition
11.108
©Silberschatz, Korth and Sudarshan
QUIZ: Insert the keys 42, 22, 52, 1, 2, 3 in a
B+ tree with n = 4, that is initially empty
Database System Concepts - 6th Edition
11.109
©Silberschatz, Korth and Sudarshan
QUIZ: Insert keys 21, 55, 22, 23 in this
B+ tree file
Database System Concepts - 6th Edition
11.110
©Silberschatz, Korth and Sudarshan